refactor(bootstrap): root-cause hub-only handshake deadlock via typestate + IpcBudget#181
Merged
Merged
Conversation
…tate + IpcBudget
`loopal` startup used to hang for 30s with "hub child did not produce a
handshake within 30s" because agent-server fired reverse IPC
(`hub/mcp/snapshot`, `hub/secret/*`) before responding to `agent/start`,
but the hub-side dispatcher was not yet consuming `incoming_rx`.
Fixed at the architectural layer so the same class of bug is impossible:
- **Connection<Inactive→Listening> typestate**: `into_listening()` consumes
self and is the only way to enable `send_request`; the old `start()`
escape hatch is gone. "Send before reader started" is now a compile
error.
- **Bootstrap typestate chain**: `HubBuilt → ListenerBound →
DispatcherReady → AgentSpawned → Ready`. Each transition consumes self;
skip-a-step / reorder is a compile error. `alive_info()` lives on
`ListenerBound` so the ALIVE handshake fires the moment the TCP listener
is bound — no longer gated on uplink connect or dispatcher build.
- **DispatcherBuilder**: hub/* handlers are registered via
`register_fn(method, handler)` (cheap: ~20 entries + Arc allocs). 4 IO
loops (agent_io, ui_session, tcp_ui_io, uplink) each build their own
dispatcher at startup and share it per-message via
`dispatch_hub_request_with(&Arc<Dispatcher>, ...)`. The old one-shot
`dispatch_hub_request` is `#[doc(hidden)]`, kept for ~54 tests.
- **IpcBudget {Allowed(Duration), Forbidden}**: no `Default` impl —
every IPC callsite must explicitly choose. `HUB_RPC_BUDGET = 8s` is a
named constant sitting inside the layered handshake budget
(`proxy(8s) < start_agent(20s) < HANDSHAKE(30s)`), so the innermost
failure surfaces first.
- **Phased handshake**: `LOOPAL_HUB_ALIVE <addr> <token>` (listener
bound) and `LOOPAL_HUB_READY <session_id>` (root agent started) replace
the single-line legacy form. Legacy line is still emitted for
back-compat. Encode/parse lives in
`loopal_ipc::handshake_protocol::HandshakeLine`.
- **agent_io split**: was 227 lines mixing 5 concerns; now
`agent_io/{mod,dispatch_loop,spawn}.rs` (165/78 lines). Loop body
flattened with guard-clause early-returns. `start_agent_io` /
`start_agent_io_with_ready` collapsed; `spawn_io_loop` retained for
callers that already registered.
- **UI loop deduplication**: extracted `ui_request_loop::ui_client_io_loop`
shared by both `UiSession::connect` (in-process) and `start_tcp_ui_io`
(TCP). Protocol-level UI behavior is now enforced by code structure.
Measured startup (no MCP / no secret, ephemeral): ALIVE 243ms, READY
488ms (~60× faster than the 30s timeout). Regression tests:
`tests/hub_only_mcp_deadlock_test.rs` (5s budget with unresponsive MCP),
`tests/bootstrap_typestate_e2e_test.rs` (ALIVE 3s / READY 8s budget),
`crates/loopal-mcp/tests/suite/ipc_budget_test.rs` (`Forbidden` returns
immediately).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
hub child did not produce a handshake within 30sdeadlock by re-architecting the IPC + bootstrap layer so the same class of bug becomes a compile error rather than a runtime hang.IpcBudget), and a phased handshake protocol (LOOPAL_HUB_ALIVE/LOOPAL_HUB_READY).Root cause (plan:
.claude/plans/dazzling-nibbling-bird.md)agent-serverfired reverse IPC (hub/mcp/snapshot,hub/secret/*) on the synchronous response path ofagent/start, but the hub-only side's dispatcher had not yet started consumingincoming_rx. Three independent 30s timeouts (HANDSHAKE_TIMEOUT/proxy_rpc_timeout/start_agent) stacked and masked the real failure point.Changes
IPC layer (
crates/loopal-ipc/)Connection<Inactive→Listening>typestate (connection.rs):into_listening()consumesself;send_requestonly onListening. The oldstart()escape hatch is gone.DispatcherBuilder+Dispatcher(dispatcher.rs): handler-table API.&'static Methodkey catches typos at compile time.build()consumes self.IpcBudget(budget.rs):Allowed(Duration) | Forbiddenwith noDefaultimpl.HUB_RPC_BUDGET = 8sis a named constant sitting inside the layered budgetproxy(8s) < start_agent(20s) < HANDSHAKE(30s).HandshakeLine(handshake_protocol.rs): encode/parse forLOOPAL_HUB_ALIVE/LOOPAL_HUB_READY/LOOPAL_HUB_ERROR(plus legacy line for back-compat).Bootstrap layer (
src/bootstrap/)typestate/module:HubBuilt → ListenerBound → DispatcherReady → AgentSpawned → Ready. Each transition consumes self.alive_info()lives onListenerBoundso ALIVE handshake fires the moment the listener is bound, not gated on uplink connect.hub_bootstrap.rsis now a thin orchestrator (~76 lines).Hub (
crates/loopal-agent-hub/)agent_io/directory split (was 227-line single file):{mod,dispatch_loop,spawn}.rs. Loop body flattened with guard-clause early-returns.start_agent_io_with_readycollapsed intostart_agent_io(... ready_tx: Option<…>).dispatch/registry/{lifecycle,mcp,secret,spawn,topology,relay}.rs: hub/* handlers registered per business domain.ui_request_loop.rs: shared UI client IO loop used by bothUiSession::connect(in-process) andstart_tcp_ui_io(TCP). Replaces two near-identical copies.Arc<Dispatcher>and calldispatch_hub_request_withper-message. Legacy one-shotdispatch_hub_requestis#[doc(hidden)](kept for ~54 tests).MCP / Secret
McpProvider/SecretClienttraits: every method takesIpcBudget.LocalMcpProviderignores it;McpProxyClient/HubSecretClientrespect it (ForbiddenreturnsTransportClosedimmediately, no wait).Test plan
tests/hub_only_mcp_deadlock_test.rs— handshake completes in <5s even with an unresponsive MCP server (would hang 30s before)tests/bootstrap_typestate_e2e_test.rs— ALIVE within 3s, READY within 8scrates/loopal-mcp/tests/suite/ipc_budget_test.rs—Forbiddenreturns immediately (<50ms)bazel build //...cleanbazel build //... --config=clippyzero warningsbazel build //... --config=rustfmtcleanbazel test //...— 88/88 passloopal --hub-only --ephemeral→ ALIVE 243ms / READY 488ms