fix: add cross-cycle circuit breaker to respawn loop (#16) by doobidoo · Pull Request #17 · ModernOps888/mcplex

doobidoo · 2026-05-15T09:31:22Z

Problem

dead_server_monitor respawns dead stdio servers with exponential backoff, capped at MAX_RESPAWN_ATTEMPTS = 5. That cap only guards a single respawn task — it does not stop the crash-after-successful-connect loop:

StdioConnection::connect() succeeds — child spawns, MCP handshake completes.
Child crashes seconds later (a misconfigured backend, etc.).
Watchdog emits a fresh death event on death_rx.
dead_server_monitor tokio::spawns another respawn task with a fresh 1..=5 counter.
Goto 1, forever.

In production a misconfigured mcp-telegram backend crash-looped this way, piling up ~76 concurrent child processes, exhausting file descriptors (Too many open files (os error 24)), and driving host load average to 143.

See #16 for the full root-cause writeup.

Fix

dead_server_monitor now keeps a per-server death-timestamp history. The monitor while loop is the single consumer of death_rx, so a plain HashMap<String, Vec<Instant>> needs no lock.

On each death the timestamp is recorded and stale entries (older than CRASH_WINDOW) are pruned. If a server has died more than MAX_CRASHES_IN_WINDOW times within the window, the circuit breaker trips: respawn is abandoned for that server and an error is logged telling the operator to fix the server and restart mcplex.

CRASH_WINDOW = 60s
MAX_CRASHES_IN_WINDOW = 5

This bounds the blast radius — a single broken backend can no longer melt the host.

Notes

No new dependencies; HashMap + Instant from std.
Behavior unchanged for healthy servers and for servers that fail transiently within the existing 5-attempt window.
Branched off current upstream/master (includes Add mcplex doctor / healthcheck subcommand for liveness probing #13/Error log is append-only with no rotation/timestamp → stale errors mislead debugging #14/Docs: add launchd/systemd deployment + troubleshooting section to README/QUICKSTART #15, Dashboard 'Tool Statistics' empty on macOS — meta-tools skip ToolCall metric #10/Gateway crashes (SIGABRT) when two MCP servers register overlapping tool names #11). cargo build --release clean.

Test

Built and deployed locally; mcplex reconnects 5/5 servers normally, host load back to ~2.

🤖 Generated with Claude Code

The MAX_RESPAWN_ATTEMPTS cap only guards a single respawn task. A stdio server whose connect() succeeds but then crashes seconds later emits a fresh death event every cycle, each spawning a new respawn task with its own fresh counter — an unbounded crash loop. Observed in production: a misconfigured `mcp-telegram` backend (no api_id/api_hash) crash-looped, piling up ~76 concurrent child processes and exhausting file descriptors (`Too many open files`), driving host load average to 143. dead_server_monitor now tracks death timestamps per server (the monitor while-loop is the single consumer of death_rx, so a plain HashMap needs no lock). A server that dies more than MAX_CRASHES_IN_WINDOW (5) times within CRASH_WINDOW (60s) trips the breaker: respawn is abandoned and an error is logged. Bounds the blast radius of any single broken backend. Fixes ModernOps888#16 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

doobidoo mentioned this pull request May 15, 2026

Respawn loop has no cross-cycle circuit breaker — one crash-looping stdio server melts the host (load avg 143) doobidoo/mcplex#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add cross-cycle circuit breaker to respawn loop (#16)#17

fix: add cross-cycle circuit breaker to respawn loop (#16)#17
doobidoo wants to merge 1 commit into
ModernOps888:masterfrom
doobidoo:fix/respawn-circuit-breaker

doobidoo commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

doobidoo commented May 15, 2026

Problem

Fix

Notes

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant