feat(hub): move Vault + MCP to Hub process with sharing model by yishuiliunian · Pull Request #179 · AgentsMesh/Loopal

yishuiliunian · 2026-05-21T06:43:49Z

Summary

Systemic fix for the macmini-03-64 "Memory tool killed after 30s" incident: chrome-devtools-mcp's slow startup blocked agent/start, and SingletonLock collisions across concurrent agents made the situation unrecoverable. Instead of patching the symptom (raise timeout / move user-data-dir), this branch moves both Vault and MCP up to the Hub process and introduces a proper resource sharing model so the root causes can't recur.

Hub-Centric architecture: Agent is execution unit; Hub is resource management center
Three MCP sharing modes: HubSingleton (default cwd-scoped) / PerAgent / SpawnTree — sub-agents reuse root's MCP via proxy IPC
Resilience layer: HubHealth + 1Hz poller surface IPC degradation in TUI; structured SecretIpcError over IPC; retry with exponential backoff

Changes

New crates (2):

loopal-secret-client — SecretClient trait + HubSecretClient IPC impl + HubHealth + RetryPolicy + structured SecretIpcError
loopal-hub-vault — HubVaultService with with_identity DI seam

New modules:

agent-hub/mcp_service/ (5 files) — HubMcpService with 3-scope HashMap (independent RwLock partitioning)
agent-hub/spawn_registry.rs — single source of truth for parent chain (ADR §7)
agent-hub/dispatch/secret_handlers.rs — hub/secret/* IPC handlers
runtime/agent_loop/hub_health.rs — poller emits HubDegraded / HubRecovered

Modified crates: loopal-{agent-hub, agent-server, config, kernel, mcp, protocol, runtime, secret-runtime, session, view-state, tool-api} — Vault→SecretClient migration, McpSharing wiring, view-state integration

Test additions:

30+ unit + integration tests
4 e2e tests with real age vault (byte-for-byte plaintext via IPC)
Mutation testing verified: 4 deliberate bugs each caught by 1-3 tests

Architectural decisions

See docs/architecture.md (ADR table with 14 decisions) and docs/stage{1..4}-*.md (stage-by-stage design rationale).

Key principles:

Provider trait stays clean — no SecretClient awareness (per user feedback during review)
Single-writer invariants enforced by code structure, not comments (HubHealth.degraded_at_unix_ms, SpawnRegistry)
Mechanism / policy separation: RetryPolicy, env-var-tunable poll intervals
Structured SecretIpcError over wire — no string matching

Env vars

LOOPAL_MCP_STARTUP_WAIT_SECS (default 5) — bounded MCP startup wait
LOOPAL_HUB_HEALTH_TICK_SECS (default 1) — health poller cadence
LOOPAL_MCP_POLL_SECS (default 3) — sub-agent settle poll cadence
LOOPAL_MCP_POLL_QUIET_STREAK (default 4) — ticks of no new tools before poll terminates
LOOPAL_HUB_DEGRADE_THRESHOLD not yet exposed (default 3, future HubHealth::with_threshold)

Test plan

CI passes (bazel build //... + bazel test //... + clippy zero-warning)
Verify on macmini-03-64: chrome-devtools-mcp slow startup no longer blocks agent/start
Verify multiple agents in different cwds get isolated --user-data-dir
Verify TUI shows ⚠ Hub degraded XXs on Hub IPC failure

Systemic fix for the macmini-03-64 "Memory tool killed after 30s" incident: chrome-devtools-mcp's slow startup blocked agent/start, and SingletonLock collisions across concurrent agents made the situation unrecoverable. Instead of patching the symptom, this branch moves both Vault and MCP up to the Hub process and introduces a proper resource sharing model so the root causes can't recur. Stage 1 — Vault on Hub - loopal-secret-client crate: SecretClient trait + HubSecretClient IPC impl + HubHealth + RetryPolicy + structured SecretIpcError - loopal-hub-vault crate: HubVaultService + with_identity DI seam - Agent gets plaintext only via hub/secret/* IPC; Provider trait stays clean (no SecretClient awareness, per architectural decision) - SpawnRegistry::verify_vault_access guards cwd-scoped boundaries Stage 2 — MCP on Hub - HubMcpService in agent-hub owns LocalMcpProvider per scope - Sub-agents use McpProxyClient over IPC; never spawn MCP children - LocalMcpProvider spawns connections in background (fire-and-forget); bounded finalize_mcp_tools + late settle-poll for slow servers - LocalMcpProvider::has_server O(1) replaces N+1 list_tools probe Stage 3 — Sharing model - McpSharing { HubSingleton (default) | PerAgent | SpawnTree } - CwdIsolation { arg, cache_subdir }: config-driven chrome-style per-cwd --user-data-dir injection (no hardcoded server names) - SpawnRegistry as single source of truth for parent chain (ADR §7); walk_to_root returns Result distinguishing NotFound vs CycleDetected Stage 4 — Resilience - HubHealth { degraded_at_unix_ms, single-writer atomics } - 1Hz hub_health_poller emits HubDegraded/HubRecovered with synthetic initial emit at agent boot so a pre-degraded Hub surfaces in TUI - SessionViewState.hub_degraded_since_ms cleared on session_resume - emit failures log via tracing::warn instead of silent break Phase 5 — Stage debt cleanup - spawn_proxy_mcp_settle_poll terminates after quiet_streak ticks - SecretIpcError JSON over IPC; classify_rpc reads RpcError directly, no string-matching - expand_template rewrites via manual cursor + push_str so plaintext lives in one buffer (zeroize on SecretString::from) - verify_caller returns SecretError, error paths uniformly structured E2E + mutation testing - 30+ integration tests across view-state, mcp_service, agent-hub - e2e_real_vault_test exercises real age encrypt/decrypt → IPC → byte-for-byte plaintext assertion - cross-cwd request asserted to PermissionDenied (single variant) - 4 deliberate mutations each caught by 1-3 tests Env vars exposed: LOOPAL_HUB_HEALTH_TICK_SECS, LOOPAL_MCP_POLL_SECS, LOOPAL_MCP_POLL_QUIET_STREAK, LOOPAL_MCP_STARTUP_WAIT_SECS.

Windows CI runners have no ~/.ssh/id_ed25519 so HubVaultService::with_noop_audit() (which calls discover()) was panicking before reaching the actual permission check. Use with_identity() with the test SSH key fixture instead, matching the pattern already used in e2e_real_vault_test.rs.

Closes the e2e coverage gap identified in the post-deletion audit: `build_kernel_from_config` had no direct tests for its depth-/production- based branch selection of MCP backend after its deletion in #179 (Vault+MCP-to-Hub migration). Restored 7 tests covering still-applicable invariants: - depth_zero_uses_local_backend_with_manager — root agent always owns a LocalMcpProvider (mcp_manager().is_some()) - depth_gt0_with_hub_client_uses_proxy_backend — sub-agent + hub_client → McpProxyClient, no local manager - depth_gt0_without_hub_client_falls_back_to_local — defensive fallback - non_production_skips_mcp_entirely — test mode keeps Local backend but skips MCP server spawn - build_kernel_with_slow_mcp_server_returns_within_bounded_wait — core startup-resilience promise: 1s budget + overhead ≤3s wall-clock - build_kernel_mixed_servers_completes_within_bounded_wait — even with multiple problem servers, build respects bounded wait - sub_agent_build_with_slow_root_config_does_not_spawn_local_mcp — anti- process-explosion: depth>0 + hub_client skips local MCP spawn entirely (chrome-devtools-mcp duplication scenario from the original incident) Two tests from the original were genuinely obsolete and NOT restored: - build_kernel_with_failing_mcp_server_does_not_propagate_error - build_kernel_skips_disabled_servers_entirely Both asserted on `mcp_provider().snapshot()` showing per-server status. Post-#179, MCP server spawning happens in `loopal-agent-hub::mcp_service` (Hub side), not in the agent process. The agent's `build_kernel_from_config` only selects the provider type — actual spawn/failure-reporting is now the Hub's responsibility and is covered by `loopal-agent-hub` tests. Audit also confirmed that `goal_kickoff_barren_test` deleted in #180 is legitimate — `barren_continuation_count` was replaced by ContinuationGate + DegenerationDetector, and is now covered by `degeneration_e2e_test.rs` and `idle_e2e_test.rs`. No restoration needed there. Adjusted McpServerConfig::Stdio variant to include the new fields (`sharing`, `cwd_isolation`) introduced after 5e41a0f.

yishuiliunian added 3 commits May 21, 2026 14:42

fix: rustfmt across modified crates

9ec1243

yishuiliunian merged commit 5e41a0f into main May 21, 2026
4 checks passed

yishuiliunian deleted the feat/hub-vault-mcp-migration branch May 21, 2026 07:18

yishuiliunian mentioned this pull request May 26, 2026

test: restore build_kernel_depth_test (7 tests, originally from #179) #185

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hub): move Vault + MCP to Hub process with sharing model#179

feat(hub): move Vault + MCP to Hub process with sharing model#179
yishuiliunian merged 3 commits into
mainfrom
feat/hub-vault-mcp-migration

yishuiliunian commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yishuiliunian commented May 21, 2026

Summary

Changes

Architectural decisions

Env vars

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant