Skip to content

refactor(bootstrap): root-cause hub-only handshake deadlock via typestate + IpcBudget#181

Merged
yishuiliunian merged 1 commit into
mainfrom
worktree-dazzling-nibbling-bird
May 21, 2026
Merged

refactor(bootstrap): root-cause hub-only handshake deadlock via typestate + IpcBudget#181
yishuiliunian merged 1 commit into
mainfrom
worktree-dazzling-nibbling-bird

Conversation

@yishuiliunian
Copy link
Copy Markdown
Contributor

Summary

  • Eliminates the 30s hub child did not produce a handshake within 30s deadlock by re-architecting the IPC + bootstrap layer so the same class of bug becomes a compile error rather than a runtime hang.
  • Measured: ALIVE 243ms, READY 488ms (no-MCP / no-secret startup) — ~60× faster than the previous 30s timeout.
  • Adds compile-time invariants (Connection / Bootstrap typestates), explicit-cost IPC API (IpcBudget), and a phased handshake protocol (LOOPAL_HUB_ALIVE / LOOPAL_HUB_READY).

Root cause (plan: .claude/plans/dazzling-nibbling-bird.md)

agent-server fired reverse IPC (hub/mcp/snapshot, hub/secret/*) on the synchronous response path of agent/start, but the hub-only side's dispatcher had not yet started consuming incoming_rx. Three independent 30s timeouts (HANDSHAKE_TIMEOUT / proxy_rpc_timeout / start_agent) stacked and masked the real failure point.

Changes

IPC layer (crates/loopal-ipc/)

  • Connection<Inactive→Listening> typestate (connection.rs): into_listening() consumes self; send_request only on Listening. The old start() escape hatch is gone.
  • DispatcherBuilder + Dispatcher (dispatcher.rs): handler-table API. &'static Method key catches typos at compile time. build() consumes self.
  • IpcBudget (budget.rs): Allowed(Duration) | Forbidden with no Default impl. HUB_RPC_BUDGET = 8s is a named constant sitting inside the layered budget proxy(8s) < start_agent(20s) < HANDSHAKE(30s).
  • HandshakeLine (handshake_protocol.rs): encode/parse for LOOPAL_HUB_ALIVE / LOOPAL_HUB_READY / LOOPAL_HUB_ERROR (plus legacy line for back-compat).

Bootstrap layer (src/bootstrap/)

  • typestate/ module: HubBuilt → ListenerBound → DispatcherReady → AgentSpawned → Ready. Each transition consumes self. alive_info() lives on ListenerBound so ALIVE handshake fires the moment the listener is bound, not gated on uplink connect.
  • hub_bootstrap.rs is now a thin orchestrator (~76 lines).

Hub (crates/loopal-agent-hub/)

  • agent_io/ directory split (was 227-line single file): {mod,dispatch_loop,spawn}.rs. Loop body flattened with guard-clause early-returns. start_agent_io_with_ready collapsed into start_agent_io(... ready_tx: Option<…>).
  • dispatch/registry/{lifecycle,mcp,secret,spawn,topology,relay}.rs: hub/* handlers registered per business domain.
  • ui_request_loop.rs: shared UI client IO loop used by both UiSession::connect (in-process) and start_tcp_ui_io (TCP). Replaces two near-identical copies.
  • 4 IO loops (agent_io, ui_session, tcp_ui_io, uplink) now hold Arc<Dispatcher> and call dispatch_hub_request_with per-message. Legacy one-shot dispatch_hub_request is #[doc(hidden)] (kept for ~54 tests).

MCP / Secret

  • McpProvider / SecretClient traits: every method takes IpcBudget. LocalMcpProvider ignores it; McpProxyClient / HubSecretClient respect it (Forbidden returns TransportClosed immediately, no wait).

Test plan

  • tests/hub_only_mcp_deadlock_test.rs — handshake completes in <5s even with an unresponsive MCP server (would hang 30s before)
  • tests/bootstrap_typestate_e2e_test.rs — ALIVE within 3s, READY within 8s
  • crates/loopal-mcp/tests/suite/ipc_budget_test.rsForbidden returns immediately (<50ms)
  • bazel build //... clean
  • bazel build //... --config=clippy zero warnings
  • bazel build //... --config=rustfmt clean
  • bazel test //... — 88/88 pass
  • Manual smoke: loopal --hub-only --ephemeral → ALIVE 243ms / READY 488ms
  • CI passes

…tate + IpcBudget

`loopal` startup used to hang for 30s with "hub child did not produce a
handshake within 30s" because agent-server fired reverse IPC
(`hub/mcp/snapshot`, `hub/secret/*`) before responding to `agent/start`,
but the hub-side dispatcher was not yet consuming `incoming_rx`.

Fixed at the architectural layer so the same class of bug is impossible:

- **Connection<Inactive→Listening> typestate**: `into_listening()` consumes
  self and is the only way to enable `send_request`; the old `start()`
  escape hatch is gone. "Send before reader started" is now a compile
  error.

- **Bootstrap typestate chain**: `HubBuilt → ListenerBound →
  DispatcherReady → AgentSpawned → Ready`. Each transition consumes self;
  skip-a-step / reorder is a compile error. `alive_info()` lives on
  `ListenerBound` so the ALIVE handshake fires the moment the TCP listener
  is bound — no longer gated on uplink connect or dispatcher build.

- **DispatcherBuilder**: hub/* handlers are registered via
  `register_fn(method, handler)` (cheap: ~20 entries + Arc allocs). 4 IO
  loops (agent_io, ui_session, tcp_ui_io, uplink) each build their own
  dispatcher at startup and share it per-message via
  `dispatch_hub_request_with(&Arc<Dispatcher>, ...)`. The old one-shot
  `dispatch_hub_request` is `#[doc(hidden)]`, kept for ~54 tests.

- **IpcBudget {Allowed(Duration), Forbidden}**: no `Default` impl —
  every IPC callsite must explicitly choose. `HUB_RPC_BUDGET = 8s` is a
  named constant sitting inside the layered handshake budget
  (`proxy(8s) < start_agent(20s) < HANDSHAKE(30s)`), so the innermost
  failure surfaces first.

- **Phased handshake**: `LOOPAL_HUB_ALIVE <addr> <token>` (listener
  bound) and `LOOPAL_HUB_READY <session_id>` (root agent started) replace
  the single-line legacy form. Legacy line is still emitted for
  back-compat. Encode/parse lives in
  `loopal_ipc::handshake_protocol::HandshakeLine`.

- **agent_io split**: was 227 lines mixing 5 concerns; now
  `agent_io/{mod,dispatch_loop,spawn}.rs` (165/78 lines). Loop body
  flattened with guard-clause early-returns. `start_agent_io` /
  `start_agent_io_with_ready` collapsed; `spawn_io_loop` retained for
  callers that already registered.

- **UI loop deduplication**: extracted `ui_request_loop::ui_client_io_loop`
  shared by both `UiSession::connect` (in-process) and `start_tcp_ui_io`
  (TCP). Protocol-level UI behavior is now enforced by code structure.

Measured startup (no MCP / no secret, ephemeral): ALIVE 243ms, READY
488ms (~60× faster than the 30s timeout). Regression tests:
`tests/hub_only_mcp_deadlock_test.rs` (5s budget with unresponsive MCP),
`tests/bootstrap_typestate_e2e_test.rs` (ALIVE 3s / READY 8s budget),
`crates/loopal-mcp/tests/suite/ipc_budget_test.rs` (`Forbidden` returns
immediately).
@yishuiliunian yishuiliunian merged commit ba13450 into main May 21, 2026
4 checks passed
@yishuiliunian yishuiliunian deleted the worktree-dazzling-nibbling-bird branch May 21, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant