feat(hub): move Vault + MCP to Hub process with sharing model#179
Merged
Conversation
Systemic fix for the macmini-03-64 "Memory tool killed after 30s"
incident: chrome-devtools-mcp's slow startup blocked agent/start, and
SingletonLock collisions across concurrent agents made the situation
unrecoverable. Instead of patching the symptom, this branch moves both
Vault and MCP up to the Hub process and introduces a proper resource
sharing model so the root causes can't recur.
Stage 1 — Vault on Hub
- loopal-secret-client crate: SecretClient trait + HubSecretClient
IPC impl + HubHealth + RetryPolicy + structured SecretIpcError
- loopal-hub-vault crate: HubVaultService + with_identity DI seam
- Agent gets plaintext only via hub/secret/* IPC; Provider trait stays
clean (no SecretClient awareness, per architectural decision)
- SpawnRegistry::verify_vault_access guards cwd-scoped boundaries
Stage 2 — MCP on Hub
- HubMcpService in agent-hub owns LocalMcpProvider per scope
- Sub-agents use McpProxyClient over IPC; never spawn MCP children
- LocalMcpProvider spawns connections in background (fire-and-forget);
bounded finalize_mcp_tools + late settle-poll for slow servers
- LocalMcpProvider::has_server O(1) replaces N+1 list_tools probe
Stage 3 — Sharing model
- McpSharing { HubSingleton (default) | PerAgent | SpawnTree }
- CwdIsolation { arg, cache_subdir }: config-driven chrome-style
per-cwd --user-data-dir injection (no hardcoded server names)
- SpawnRegistry as single source of truth for parent chain (ADR §7);
walk_to_root returns Result distinguishing NotFound vs CycleDetected
Stage 4 — Resilience
- HubHealth { degraded_at_unix_ms, single-writer atomics }
- 1Hz hub_health_poller emits HubDegraded/HubRecovered with synthetic
initial emit at agent boot so a pre-degraded Hub surfaces in TUI
- SessionViewState.hub_degraded_since_ms cleared on session_resume
- emit failures log via tracing::warn instead of silent break
Phase 5 — Stage debt cleanup
- spawn_proxy_mcp_settle_poll terminates after quiet_streak ticks
- SecretIpcError JSON over IPC; classify_rpc reads RpcError directly,
no string-matching
- expand_template rewrites via manual cursor + push_str so plaintext
lives in one buffer (zeroize on SecretString::from)
- verify_caller returns SecretError, error paths uniformly structured
E2E + mutation testing
- 30+ integration tests across view-state, mcp_service, agent-hub
- e2e_real_vault_test exercises real age encrypt/decrypt → IPC →
byte-for-byte plaintext assertion
- cross-cwd request asserted to PermissionDenied (single variant)
- 4 deliberate mutations each caught by 1-3 tests
Env vars exposed: LOOPAL_HUB_HEALTH_TICK_SECS, LOOPAL_MCP_POLL_SECS,
LOOPAL_MCP_POLL_QUIET_STREAK, LOOPAL_MCP_STARTUP_WAIT_SECS.
Windows CI runners have no ~/.ssh/id_ed25519 so HubVaultService::with_noop_audit() (which calls discover()) was panicking before reaching the actual permission check. Use with_identity() with the test SSH key fixture instead, matching the pattern already used in e2e_real_vault_test.rs.
2 tasks
yishuiliunian
added a commit
that referenced
this pull request
May 26, 2026
Closes the e2e coverage gap identified in the post-deletion audit: `build_kernel_from_config` had no direct tests for its depth-/production- based branch selection of MCP backend after its deletion in #179 (Vault+MCP-to-Hub migration). Restored 7 tests covering still-applicable invariants: - depth_zero_uses_local_backend_with_manager — root agent always owns a LocalMcpProvider (mcp_manager().is_some()) - depth_gt0_with_hub_client_uses_proxy_backend — sub-agent + hub_client → McpProxyClient, no local manager - depth_gt0_without_hub_client_falls_back_to_local — defensive fallback - non_production_skips_mcp_entirely — test mode keeps Local backend but skips MCP server spawn - build_kernel_with_slow_mcp_server_returns_within_bounded_wait — core startup-resilience promise: 1s budget + overhead ≤3s wall-clock - build_kernel_mixed_servers_completes_within_bounded_wait — even with multiple problem servers, build respects bounded wait - sub_agent_build_with_slow_root_config_does_not_spawn_local_mcp — anti- process-explosion: depth>0 + hub_client skips local MCP spawn entirely (chrome-devtools-mcp duplication scenario from the original incident) Two tests from the original were genuinely obsolete and NOT restored: - build_kernel_with_failing_mcp_server_does_not_propagate_error - build_kernel_skips_disabled_servers_entirely Both asserted on `mcp_provider().snapshot()` showing per-server status. Post-#179, MCP server spawning happens in `loopal-agent-hub::mcp_service` (Hub side), not in the agent process. The agent's `build_kernel_from_config` only selects the provider type — actual spawn/failure-reporting is now the Hub's responsibility and is covered by `loopal-agent-hub` tests. Audit also confirmed that `goal_kickoff_barren_test` deleted in #180 is legitimate — `barren_continuation_count` was replaced by ContinuationGate + DegenerationDetector, and is now covered by `degeneration_e2e_test.rs` and `idle_e2e_test.rs`. No restoration needed there. Adjusted McpServerConfig::Stdio variant to include the new fields (`sharing`, `cwd_isolation`) introduced after 5e41a0f.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Systemic fix for the macmini-03-64 "Memory tool killed after 30s" incident: chrome-devtools-mcp's slow startup blocked
agent/start, and SingletonLock collisions across concurrent agents made the situation unrecoverable. Instead of patching the symptom (raise timeout / move user-data-dir), this branch moves both Vault and MCP up to the Hub process and introduces a proper resource sharing model so the root causes can't recur.HubSingleton(default cwd-scoped) /PerAgent/SpawnTree— sub-agents reuse root's MCP via proxy IPCHubHealth+ 1Hz poller surface IPC degradation in TUI; structuredSecretIpcErrorover IPC; retry with exponential backoffChanges
New crates (2):
loopal-secret-client—SecretClienttrait +HubSecretClientIPC impl +HubHealth+RetryPolicy+ structuredSecretIpcErrorloopal-hub-vault—HubVaultServicewithwith_identityDI seamNew modules:
agent-hub/mcp_service/(5 files) —HubMcpServicewith 3-scope HashMap (independent RwLock partitioning)agent-hub/spawn_registry.rs— single source of truth for parent chain (ADR §7)agent-hub/dispatch/secret_handlers.rs—hub/secret/*IPC handlersruntime/agent_loop/hub_health.rs— poller emitsHubDegraded/HubRecoveredModified crates:
loopal-{agent-hub, agent-server, config, kernel, mcp, protocol, runtime, secret-runtime, session, view-state, tool-api}— Vault→SecretClient migration, McpSharing wiring, view-state integrationTest additions:
Architectural decisions
See
docs/architecture.md(ADR table with 14 decisions) anddocs/stage{1..4}-*.md(stage-by-stage design rationale).Key principles:
SecretClientawareness (per user feedback during review)HubHealth.degraded_at_unix_ms,SpawnRegistry)RetryPolicy, env-var-tunable poll intervalsSecretIpcErrorover wire — no string matchingEnv vars
LOOPAL_MCP_STARTUP_WAIT_SECS(default 5) — bounded MCP startup waitLOOPAL_HUB_HEALTH_TICK_SECS(default 1) — health poller cadenceLOOPAL_MCP_POLL_SECS(default 3) — sub-agent settle poll cadenceLOOPAL_MCP_POLL_QUIET_STREAK(default 4) — ticks of no new tools before poll terminatesLOOPAL_HUB_DEGRADE_THRESHOLDnot yet exposed (default 3, futureHubHealth::with_threshold)Test plan
agent/start--user-data-dir⚠ Hub degraded XXson Hub IPC failure