-
Notifications
You must be signed in to change notification settings - Fork 255
Raft leader re-election takes up to 90s after SIGTERM on a 3-node cluster #3229
Description
Environment
- Cluster: 3 nodes running ev-node-evm, raft consensus enabled
- Deployment: Docker Compose, one container per node
- Test: cyclic SIGTERM of the current raft leader, restarted after 60s, repeated every 150s
Expected behavior
After a SIGTERM on the current raft leader, the two surviving nodes elect a new leader within < 5 seconds.
Observed behavior
Election time is inconsistent and frequently exceeds the 5s target. Across 8 cycles of the same test:
| Cycle | Killed node | Election time |
|---|---|---|
| 1 | poc-ha-3 | +3s ✅ |
| 2 | poc-ha-1 | +3s ✅ |
| 3 | poc-ha-2 | +90s ❌ |
| 4 | poc-ha-2 | not detected ❌ |
| 5 | poc-ha-2 | +76s ❌ |
| 6 | poc-ha-1 | +3s ✅ |
| 7 | poc-ha-2 | +23s ❌ |
| 8 | poc-ha-2 | +24s ❌ |
The slow elections occur specifically when the same node is killed repeatedly while the previously killed node from the prior cycle has not yet fully rejoined. When different nodes are killed in rotation and the cluster is fully healthy between kills, election completes in 2–3s.
Errors observed in container logs
When the election is slow, the following errors appear in the logs of the node that was previously SIGTERM'd and restarted, while it is running as the active leader:
WRN apply channel full, dropping message component=raft-fsm
ERR raft: failed to heartbeat to: peer="..." error="dial tcp ...: connection refused"
Rollback failed: tx closed
WRN apply channel full, dropping message component=raft-fsm
Rollback failed: tx closed
Rollback failed: tx closed
The Rollback failed: tx closed messages appear continuously at a high rate (~30/s) while the node is producing blocks normally. Block production itself is not interrupted at this stage, but the raft FSM appears to be in a degraded state.
When the node subsequently receives SIGTERM, it exits with:
WRN timed out waiting for raft messages to land during shutdown error="max wait time reached"
ERR node error error="leader lock lost" component=main
Error: leader lock lost
Reproduction pattern
The degraded election is reliably triggered by this sequence:
- SIGTERM node A (leader) → node B becomes leader in ~3s ✅
- SIGTERM node B (leader, while node A is still restarting) → election takes 23–90s ❌
- SIGTERM node B again → election slow again ❌
- SIGTERM a different node (follower) → election back to ~3s ✅
- SIGTERM node B again → election slow again ❌
The degradation appears to be cumulative on a node that has been SIGTERM'd without the cluster being fully healthy first.
Raft configuration (from process flags)
--evnode.raft.heartbeat_timeout 350ms
--evnode.raft.leader_lease_timeout 175ms
--evnode.raft.send_timeout 200ms
Notes
- The
Rollback failed: tx closederrors appear on the running leader before it is killed, not only on restart. This suggests the issue manifests during normal operation after an earlier unclean shutdown, not just at startup. - When the condition is present, the surviving nodes do eventually elect a new leader — the cluster self-heals. The concern is the duration.
- Container logs for the affected cycles are available locally if additional context is needed.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status