Skip to content

Raft leader re-election takes up to 90s after SIGTERM on a 3-node cluster #3229

@auricom

Description

@auricom

Environment

  • Cluster: 3 nodes running ev-node-evm, raft consensus enabled
  • Deployment: Docker Compose, one container per node
  • Test: cyclic SIGTERM of the current raft leader, restarted after 60s, repeated every 150s

Expected behavior

After a SIGTERM on the current raft leader, the two surviving nodes elect a new leader within < 5 seconds.

Observed behavior

Election time is inconsistent and frequently exceeds the 5s target. Across 8 cycles of the same test:

Cycle Killed node Election time
1 poc-ha-3 +3s ✅
2 poc-ha-1 +3s ✅
3 poc-ha-2 +90s
4 poc-ha-2 not detected
5 poc-ha-2 +76s
6 poc-ha-1 +3s ✅
7 poc-ha-2 +23s
8 poc-ha-2 +24s

The slow elections occur specifically when the same node is killed repeatedly while the previously killed node from the prior cycle has not yet fully rejoined. When different nodes are killed in rotation and the cluster is fully healthy between kills, election completes in 2–3s.

Errors observed in container logs

When the election is slow, the following errors appear in the logs of the node that was previously SIGTERM'd and restarted, while it is running as the active leader:

WRN  apply channel full, dropping message  component=raft-fsm
ERR  raft: failed to heartbeat to: peer="..." error="dial tcp ...: connection refused"
     Rollback failed: tx closed
WRN  apply channel full, dropping message  component=raft-fsm
     Rollback failed: tx closed
     Rollback failed: tx closed

The Rollback failed: tx closed messages appear continuously at a high rate (~30/s) while the node is producing blocks normally. Block production itself is not interrupted at this stage, but the raft FSM appears to be in a degraded state.

When the node subsequently receives SIGTERM, it exits with:

WRN  timed out waiting for raft messages to land during shutdown  error="max wait time reached"
ERR  node error  error="leader lock lost"  component=main
Error: leader lock lost

Reproduction pattern

The degraded election is reliably triggered by this sequence:

  1. SIGTERM node A (leader) → node B becomes leader in ~3s ✅
  2. SIGTERM node B (leader, while node A is still restarting) → election takes 23–90s ❌
  3. SIGTERM node B again → election slow again ❌
  4. SIGTERM a different node (follower) → election back to ~3s ✅
  5. SIGTERM node B again → election slow again ❌

The degradation appears to be cumulative on a node that has been SIGTERM'd without the cluster being fully healthy first.

Raft configuration (from process flags)

--evnode.raft.heartbeat_timeout     350ms
--evnode.raft.leader_lease_timeout  175ms
--evnode.raft.send_timeout          200ms

Notes

  • The Rollback failed: tx closed errors appear on the running leader before it is killed, not only on restart. This suggests the issue manifests during normal operation after an earlier unclean shutdown, not just at startup.
  • When the condition is present, the surviving nodes do eventually elect a new leader — the cluster self-heals. The concern is the duration.
  • Container logs for the affected cycles are available locally if additional context is needed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions