Skip to content

fix: add cross-cycle circuit breaker to respawn loop (#16)#17

Open
doobidoo wants to merge 1 commit into
ModernOps888:masterfrom
doobidoo:fix/respawn-circuit-breaker
Open

fix: add cross-cycle circuit breaker to respawn loop (#16)#17
doobidoo wants to merge 1 commit into
ModernOps888:masterfrom
doobidoo:fix/respawn-circuit-breaker

Conversation

@doobidoo
Copy link
Copy Markdown

Problem

dead_server_monitor respawns dead stdio servers with exponential backoff, capped at MAX_RESPAWN_ATTEMPTS = 5. That cap only guards a single respawn task — it does not stop the crash-after-successful-connect loop:

  1. StdioConnection::connect() succeeds — child spawns, MCP handshake completes.
  2. Child crashes seconds later (a misconfigured backend, etc.).
  3. Watchdog emits a fresh death event on death_rx.
  4. dead_server_monitor tokio::spawns another respawn task with a fresh 1..=5 counter.
  5. Goto 1, forever.

In production a misconfigured mcp-telegram backend crash-looped this way, piling up ~76 concurrent child processes, exhausting file descriptors (Too many open files (os error 24)), and driving host load average to 143.

See #16 for the full root-cause writeup.

Fix

dead_server_monitor now keeps a per-server death-timestamp history. The monitor while loop is the single consumer of death_rx, so a plain HashMap<String, Vec<Instant>> needs no lock.

On each death the timestamp is recorded and stale entries (older than CRASH_WINDOW) are pruned. If a server has died more than MAX_CRASHES_IN_WINDOW times within the window, the circuit breaker trips: respawn is abandoned for that server and an error is logged telling the operator to fix the server and restart mcplex.

  • CRASH_WINDOW = 60s
  • MAX_CRASHES_IN_WINDOW = 5

This bounds the blast radius — a single broken backend can no longer melt the host.

Notes

Test

Built and deployed locally; mcplex reconnects 5/5 servers normally, host load back to ~2.

🤖 Generated with Claude Code

The MAX_RESPAWN_ATTEMPTS cap only guards a single respawn task. A stdio
server whose connect() succeeds but then crashes seconds later emits a
fresh death event every cycle, each spawning a new respawn task with its
own fresh counter — an unbounded crash loop.

Observed in production: a misconfigured `mcp-telegram` backend (no
api_id/api_hash) crash-looped, piling up ~76 concurrent child processes
and exhausting file descriptors (`Too many open files`), driving host
load average to 143.

dead_server_monitor now tracks death timestamps per server (the monitor
while-loop is the single consumer of death_rx, so a plain HashMap needs
no lock). A server that dies more than MAX_CRASHES_IN_WINDOW (5) times
within CRASH_WINDOW (60s) trips the breaker: respawn is abandoned and an
error is logged. Bounds the blast radius of any single broken backend.

Fixes ModernOps888#16

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant