Skip to content

feat(hub): move Vault + MCP to Hub process with sharing model#179

Merged
yishuiliunian merged 3 commits into
mainfrom
feat/hub-vault-mcp-migration
May 21, 2026
Merged

feat(hub): move Vault + MCP to Hub process with sharing model#179
yishuiliunian merged 3 commits into
mainfrom
feat/hub-vault-mcp-migration

Conversation

@yishuiliunian
Copy link
Copy Markdown
Contributor

Summary

Systemic fix for the macmini-03-64 "Memory tool killed after 30s" incident: chrome-devtools-mcp's slow startup blocked agent/start, and SingletonLock collisions across concurrent agents made the situation unrecoverable. Instead of patching the symptom (raise timeout / move user-data-dir), this branch moves both Vault and MCP up to the Hub process and introduces a proper resource sharing model so the root causes can't recur.

  • Hub-Centric architecture: Agent is execution unit; Hub is resource management center
  • Three MCP sharing modes: HubSingleton (default cwd-scoped) / PerAgent / SpawnTree — sub-agents reuse root's MCP via proxy IPC
  • Resilience layer: HubHealth + 1Hz poller surface IPC degradation in TUI; structured SecretIpcError over IPC; retry with exponential backoff

Changes

New crates (2):

  • loopal-secret-clientSecretClient trait + HubSecretClient IPC impl + HubHealth + RetryPolicy + structured SecretIpcError
  • loopal-hub-vaultHubVaultService with with_identity DI seam

New modules:

  • agent-hub/mcp_service/ (5 files) — HubMcpService with 3-scope HashMap (independent RwLock partitioning)
  • agent-hub/spawn_registry.rs — single source of truth for parent chain (ADR §7)
  • agent-hub/dispatch/secret_handlers.rshub/secret/* IPC handlers
  • runtime/agent_loop/hub_health.rs — poller emits HubDegraded / HubRecovered

Modified crates: loopal-{agent-hub, agent-server, config, kernel, mcp, protocol, runtime, secret-runtime, session, view-state, tool-api} — Vault→SecretClient migration, McpSharing wiring, view-state integration

Test additions:

  • 30+ unit + integration tests
  • 4 e2e tests with real age vault (byte-for-byte plaintext via IPC)
  • Mutation testing verified: 4 deliberate bugs each caught by 1-3 tests

Architectural decisions

See docs/architecture.md (ADR table with 14 decisions) and docs/stage{1..4}-*.md (stage-by-stage design rationale).

Key principles:

  1. Provider trait stays clean — no SecretClient awareness (per user feedback during review)
  2. Single-writer invariants enforced by code structure, not comments (HubHealth.degraded_at_unix_ms, SpawnRegistry)
  3. Mechanism / policy separation: RetryPolicy, env-var-tunable poll intervals
  4. Structured SecretIpcError over wire — no string matching

Env vars

  • LOOPAL_MCP_STARTUP_WAIT_SECS (default 5) — bounded MCP startup wait
  • LOOPAL_HUB_HEALTH_TICK_SECS (default 1) — health poller cadence
  • LOOPAL_MCP_POLL_SECS (default 3) — sub-agent settle poll cadence
  • LOOPAL_MCP_POLL_QUIET_STREAK (default 4) — ticks of no new tools before poll terminates
  • LOOPAL_HUB_DEGRADE_THRESHOLD not yet exposed (default 3, future HubHealth::with_threshold)

Test plan

  • CI passes (bazel build //... + bazel test //... + clippy zero-warning)
  • Verify on macmini-03-64: chrome-devtools-mcp slow startup no longer blocks agent/start
  • Verify multiple agents in different cwds get isolated --user-data-dir
  • Verify TUI shows ⚠ Hub degraded XXs on Hub IPC failure

Systemic fix for the macmini-03-64 "Memory tool killed after 30s"
incident: chrome-devtools-mcp's slow startup blocked agent/start, and
SingletonLock collisions across concurrent agents made the situation
unrecoverable. Instead of patching the symptom, this branch moves both
Vault and MCP up to the Hub process and introduces a proper resource
sharing model so the root causes can't recur.

Stage 1 — Vault on Hub
  - loopal-secret-client crate: SecretClient trait + HubSecretClient
    IPC impl + HubHealth + RetryPolicy + structured SecretIpcError
  - loopal-hub-vault crate: HubVaultService + with_identity DI seam
  - Agent gets plaintext only via hub/secret/* IPC; Provider trait stays
    clean (no SecretClient awareness, per architectural decision)
  - SpawnRegistry::verify_vault_access guards cwd-scoped boundaries

Stage 2 — MCP on Hub
  - HubMcpService in agent-hub owns LocalMcpProvider per scope
  - Sub-agents use McpProxyClient over IPC; never spawn MCP children
  - LocalMcpProvider spawns connections in background (fire-and-forget);
    bounded finalize_mcp_tools + late settle-poll for slow servers
  - LocalMcpProvider::has_server O(1) replaces N+1 list_tools probe

Stage 3 — Sharing model
  - McpSharing { HubSingleton (default) | PerAgent | SpawnTree }
  - CwdIsolation { arg, cache_subdir }: config-driven chrome-style
    per-cwd --user-data-dir injection (no hardcoded server names)
  - SpawnRegistry as single source of truth for parent chain (ADR §7);
    walk_to_root returns Result distinguishing NotFound vs CycleDetected

Stage 4 — Resilience
  - HubHealth { degraded_at_unix_ms, single-writer atomics }
  - 1Hz hub_health_poller emits HubDegraded/HubRecovered with synthetic
    initial emit at agent boot so a pre-degraded Hub surfaces in TUI
  - SessionViewState.hub_degraded_since_ms cleared on session_resume
  - emit failures log via tracing::warn instead of silent break

Phase 5 — Stage debt cleanup
  - spawn_proxy_mcp_settle_poll terminates after quiet_streak ticks
  - SecretIpcError JSON over IPC; classify_rpc reads RpcError directly,
    no string-matching
  - expand_template rewrites via manual cursor + push_str so plaintext
    lives in one buffer (zeroize on SecretString::from)
  - verify_caller returns SecretError, error paths uniformly structured

E2E + mutation testing
  - 30+ integration tests across view-state, mcp_service, agent-hub
  - e2e_real_vault_test exercises real age encrypt/decrypt → IPC →
    byte-for-byte plaintext assertion
  - cross-cwd request asserted to PermissionDenied (single variant)
  - 4 deliberate mutations each caught by 1-3 tests

Env vars exposed: LOOPAL_HUB_HEALTH_TICK_SECS, LOOPAL_MCP_POLL_SECS,
LOOPAL_MCP_POLL_QUIET_STREAK, LOOPAL_MCP_STARTUP_WAIT_SECS.
Windows CI runners have no ~/.ssh/id_ed25519 so
HubVaultService::with_noop_audit() (which calls discover()) was
panicking before reaching the actual permission check. Use
with_identity() with the test SSH key fixture instead, matching the
pattern already used in e2e_real_vault_test.rs.
@yishuiliunian yishuiliunian merged commit 5e41a0f into main May 21, 2026
4 checks passed
@yishuiliunian yishuiliunian deleted the feat/hub-vault-mcp-migration branch May 21, 2026 07:18
yishuiliunian added a commit that referenced this pull request May 26, 2026
Closes the e2e coverage gap identified in the post-deletion audit:
`build_kernel_from_config` had no direct tests for its depth-/production-
based branch selection of MCP backend after its deletion in #179
(Vault+MCP-to-Hub migration).

Restored 7 tests covering still-applicable invariants:

- depth_zero_uses_local_backend_with_manager — root agent always owns a
  LocalMcpProvider (mcp_manager().is_some())
- depth_gt0_with_hub_client_uses_proxy_backend — sub-agent + hub_client →
  McpProxyClient, no local manager
- depth_gt0_without_hub_client_falls_back_to_local — defensive fallback
- non_production_skips_mcp_entirely — test mode keeps Local backend but
  skips MCP server spawn
- build_kernel_with_slow_mcp_server_returns_within_bounded_wait — core
  startup-resilience promise: 1s budget + overhead ≤3s wall-clock
- build_kernel_mixed_servers_completes_within_bounded_wait — even with
  multiple problem servers, build respects bounded wait
- sub_agent_build_with_slow_root_config_does_not_spawn_local_mcp — anti-
  process-explosion: depth>0 + hub_client skips local MCP spawn entirely
  (chrome-devtools-mcp duplication scenario from the original incident)

Two tests from the original were genuinely obsolete and NOT restored:
- build_kernel_with_failing_mcp_server_does_not_propagate_error
- build_kernel_skips_disabled_servers_entirely

Both asserted on `mcp_provider().snapshot()` showing per-server status.
Post-#179, MCP server spawning happens in `loopal-agent-hub::mcp_service`
(Hub side), not in the agent process. The agent's
`build_kernel_from_config` only selects the provider type — actual
spawn/failure-reporting is now the Hub's responsibility and is covered by
`loopal-agent-hub` tests.

Audit also confirmed that `goal_kickoff_barren_test` deleted in #180 is
legitimate — `barren_continuation_count` was replaced by ContinuationGate
+ DegenerationDetector, and is now covered by `degeneration_e2e_test.rs`
and `idle_e2e_test.rs`. No restoration needed there.

Adjusted McpServerConfig::Stdio variant to include the new fields
(`sharing`, `cwd_isolation`) introduced after 5e41a0f.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant