fix: add cross-cycle circuit breaker to respawn loop (#16)#17
Open
doobidoo wants to merge 1 commit into
Open
Conversation
The MAX_RESPAWN_ATTEMPTS cap only guards a single respawn task. A stdio server whose connect() succeeds but then crashes seconds later emits a fresh death event every cycle, each spawning a new respawn task with its own fresh counter — an unbounded crash loop. Observed in production: a misconfigured `mcp-telegram` backend (no api_id/api_hash) crash-looped, piling up ~76 concurrent child processes and exhausting file descriptors (`Too many open files`), driving host load average to 143. dead_server_monitor now tracks death timestamps per server (the monitor while-loop is the single consumer of death_rx, so a plain HashMap needs no lock). A server that dies more than MAX_CRASHES_IN_WINDOW (5) times within CRASH_WINDOW (60s) trips the breaker: respawn is abandoned and an error is logged. Bounds the blast radius of any single broken backend. Fixes ModernOps888#16 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
dead_server_monitorrespawns dead stdio servers with exponential backoff, capped atMAX_RESPAWN_ATTEMPTS = 5. That cap only guards a single respawn task — it does not stop the crash-after-successful-connect loop:StdioConnection::connect()succeeds — child spawns, MCP handshake completes.death_rx.dead_server_monitortokio::spawns another respawn task with a fresh1..=5counter.In production a misconfigured
mcp-telegrambackend crash-looped this way, piling up ~76 concurrent child processes, exhausting file descriptors (Too many open files (os error 24)), and driving host load average to 143.See #16 for the full root-cause writeup.
Fix
dead_server_monitornow keeps a per-server death-timestamp history. The monitorwhileloop is the single consumer ofdeath_rx, so a plainHashMap<String, Vec<Instant>>needs no lock.On each death the timestamp is recorded and stale entries (older than
CRASH_WINDOW) are pruned. If a server has died more thanMAX_CRASHES_IN_WINDOWtimes within the window, the circuit breaker trips: respawn is abandoned for that server and an error is logged telling the operator to fix the server and restart mcplex.CRASH_WINDOW= 60sMAX_CRASHES_IN_WINDOW= 5This bounds the blast radius — a single broken backend can no longer melt the host.
Notes
HashMap+Instantfromstd.upstream/master(includes Addmcplex doctor/ healthcheck subcommand for liveness probing #13/Error log is append-only with no rotation/timestamp → stale errors mislead debugging #14/Docs: add launchd/systemd deployment + troubleshooting section to README/QUICKSTART #15, Dashboard 'Tool Statistics' empty on macOS — meta-tools skip ToolCall metric #10/Gateway crashes (SIGABRT) when two MCP servers register overlapping tool names #11).cargo build --releaseclean.Test
Built and deployed locally; mcplex reconnects 5/5 servers normally, host load back to ~2.
🤖 Generated with Claude Code