Skip to content

fix: Publisher-mode synchronization option for failover scenario#3222

Merged
julienrbrt merged 7 commits intomainfrom
alex/sync_race
Apr 7, 2026
Merged

fix: Publisher-mode synchronization option for failover scenario#3222
julienrbrt merged 7 commits intomainfrom
alex/sync_race

Conversation

@alpe
Copy link
Copy Markdown
Contributor

@alpe alpe commented Mar 31, 2026

Overview

E2E HA tests fail sometimes on a race when the leader is waiting for p2p sync complete on a fresh start.

Summary by CodeRabbit

  • New Features

    • Publisher-mode sync startup and a "start for publishing" service mode for early P2P readiness and failover publishing.
  • Documentation

    • Clarified Raft bootstrap/startup docs and CLI help: bootstrap is compatibility-only; startup mode is auto-selected based on local state.
  • Improvements

    • More robust Raft startup/leader election, failover recovery, sync startup/retry, P2P initialization, and logging/diagnostics.
  • Tests

    • Expanded unit and E2E tests covering raft bootstrapping, failover, sync publishing, and logging.
  • Chores

    • Changelog updated with unreleased publisher-mode entry.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a publisher-mode startup path for sync services during Raft leader failover, refactors syncer lifecycle/status APIs and sync service startup/retry, adjusts Raft startup/election semantics and recovery, improves E2E P2P identities and logging, enables local Go module replaces, and updates docs and changelog.

Changes

Cohort / File(s) Summary
Publisher-mode & failover
node/failover.go, test/e2e/failover_e2e_test.go
Add publisher-mode decision path for raft leaders with empty stores; wire publisher startup into failover; adjust E2E P2P address handling and leader detection.
Sync service lifecycle & status
pkg/sync/sync_service.go, pkg/sync/sync_service_test.go, pkg/sync/syncer_status.go, pkg/sync/syncer_status_test.go
Introduce cancellable start context, background p2p-init retry loop, startSyncer(bool,error), mutex-guarded startOnce/stopIfStarted, and corresponding tests.
Block sync recovery & tests
block/internal/syncing/syncer.go, block/internal/syncing/syncer_test.go
Extract trySyncNextBlockWithState(...), use explicit currentState during RecoverFromRaft, set InitialHeight on genesis fallback, and add Raft recovery tests.
Raft startup & election
pkg/raft/node.go, pkg/raft/node_test.go, pkg/raft/election.go, pkg/raft/election_test.go
Auto-select startup mode from persisted Raft config, refactor follower start/verification with polling and recover-on-sync-failure behavior, and add nil-receiver test and mock adjustments.
E2E infra & helpers
test/e2e/sut_helper.go, test/e2e/failover_e2e_test.go
Per-node libp2p identity/multiaddr support, separate listen vs peer addresses, improved process wait logging, conditional log-dir creation, and safer ticker/timeouts.
Docs, flags & changelog
docs/guides/raft_production.md, docs/learn/config.md, pkg/config/config.go, pkg/config/config_test.go, CHANGELOG.md
Document Raft startup-mode auto-selection and mark raft.bootstrap legacy; update example flag namespaces; add raft flag assertions in tests; add changelog entry for publisher-mode sync option.
Go module replace activation
apps/evm/go.mod
Uncomment local replace directives to resolve github.com/evstack/ev-node and .../execution/evm to local relative paths.

Sequence Diagram(s)

sequenceDiagram
    participant Failover as Failover Manager
    participant Raft as Raft Node
    participant Store as Block Store
    participant Sync as Sync Service
    participant P2P as P2P Init

    Note over Failover,Raft: failover / startup decision

    Failover->>Raft: query leader status & raft config
    Raft-->>Failover: leader? / persisted config?
    alt Leader & stores empty
        Failover->>Store: read header/data heights
        Store-->>Failover: heights empty
        Failover->>Sync: StartForPublishing(ctx)
        Sync->>Sync: prepareStart (create syncer, startSubscriber)
        Sync->>P2P: startSubscriber (no blocking init)
        Sync-->>Failover: publishing started
    else Normal startup
        Failover->>Sync: Start(ctx)
        Sync->>P2P: initFromP2POnce
        loop retry until started or ctx.Done
            P2P-->>Sync: success/fail
            Sync->>Sync: startSyncer if needed
        end
        Sync-->>Failover: normal started
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • tac0turtle
  • chatton

Poem

🐇 I hop with a ledger and tiny drum,
Publisher-mode wakes when the peers are numb,
Mutexes guard starts, retries hum in the night,
Raft finds its footing, P2P sparks the light,
I nibble changelogs and cheer: “All systems might!”

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description identifies the problem (race in E2E HA tests during fresh start) but lacks implementation details, testing approach, or reference to issue #3222. Expand the overview to explain the publisher-mode solution, how it addresses the race condition, testing strategy, and link the related issue (#3222) for complete context.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary change: adding publisher-mode synchronization for failover scenarios.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch alex/sync_race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 7, 2026, 4:54 PM

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 31, 2026

Claude finished @julienrbrt's task in 1m 49s —— View job


PR Review: Publisher-mode Synchronization for Failover

  • Gathering context and reading changed files
  • Reviewing sync service changes
  • Reviewing failover logic
  • Reviewing raft election changes
  • Reviewing block syncing changes
  • Reviewing tests
  • Final summary

Overview

The PR fixes a deadlock/race in the E2E HA failover scenario: a newly-elected Raft leader with an empty store was blocked waiting for P2P peers to provide block 1, while those peers were waiting for the leader to produce block 1. The fix introduces StartForPublishing, a new startup mode that skips the blocking P2P init and lets the leader produce the first block immediately.

The overall approach is sound. The extracted prepareStart/startSubscriber factoring is clean. Here are the findings:


Critical / Must-Fix

1. Stop() panics on partial startup — pkg/sync/sync_service.go:450

func (syncService *SyncService[H]) Stop(ctx context.Context) error {
    // unsubscribe from topic first so that sub.Stop() does not fail
    syncService.topicSubscription.Cancel()  // ← nil panic if startSubscriber never ran

If Start or StartForPublishing returns an error before startSubscriber assigns topicSubscription, then any subsequent Stop call will nil-panic. This is reachable in the existing Run method — if syncWg.Wait() returns an error from one service, the deferred stopService calls still fire for both. Fix this →

Similarly, p2pServer, ex, and sub are only assigned in setupP2PInfrastructure, so Stop calling .Stop() on them can also panic if setup failed partway through.


Major

2. storeHeight read but not used in publisher-mode guard — node/failover.go:200

storeHeight, err := f.store.Height(ctx)   // read
// ...
if headerHeight > 0 || dataHeight > 0 {   // storeHeight never checked!
    return false
}

A node with existing blocks in the main store but empty sync stores (e.g., after a crash mid-sync) will still enter publisher mode and skip the blocking sync. The guard should be:

if storeHeight > 0 || headerHeight > 0 || dataHeight > 0 {

Fix this →

3. startSyncer passes request context to long-lived syncer — pkg/sync/sync_service.go:254

startSyncer is called from WriteToStoreAndBroadcast(ctx, ...), which can receive a short-lived request context. If that context is cancelled (timeout, HTTP request end) while syncer.Start(ctx) is running, the syncer may be left in a broken state. The syncer should be started with a service-scoped context that outlives individual requests. Fix this →

4. apps/evm/go.mod has live replace directives pointing to ../../

replace (
    github.com/evstack/ev-node => ../../
    github.com/evstack/ev-node/execution/evm => ../../execution/evm
)

These were intentionally uncommented (the PR walkthrough mentions it), but committing local replace directives into a published module's go.mod breaks downstream consumers who go get the module. These should live in a go.work file for local dev, not in go.mod. If this is intentional for CI testing only, consider a comment and/or CI guard to prevent publishing.


Minor / Nitpick

5. waitForMsgsLanded not cancellable during follower startup — pkg/raft/election.go:93

if err = d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil {

waitForMsgsLanded internally uses context.Background(), so a shutdown/role-change while startFollower is in this wait will block for up to SendTimeout. Per the repo's own guidelines, "use context.Context for cancellation". Consider threading ctx through this call.

6. Genesis fallback in RecoverFromRaft sets incomplete state — block/internal/syncing/syncer.go:1206-1212

The fallback when store.GetState fails sets only ChainID, InitialHeight, and LastBlockHeight, but not AppHash, DAHeight, or LastBlockTime. The normal initializeState() path runs InitChain to seed all of these. Applying a raft block against this minimal state could miscompute the next AppHash. This is a low-probability path but worth hardening.

7. Duplicate log line — pkg/raft/election.go:144 and 148

After winning leadership from follower, the log "became leader, stopping follower operations" is printed twice (lines 130 and 144 in the final file).

8. Test assertion gap — pkg/sync/sync_service_test.go

The TestStartForPublishing test only asserts svc.storeInitialized after WriteToStoreAndBroadcast. A regression where publishing starts locally but never broadcasts to peers would pass. Consider asserting that client2 receives the broadcasted header via a require.Eventually.


Concurrency / Logic Review

The SyncerStatus mutex-based startOnce/stopIfStarted is a clean replacement for the previous atomic.Bool pattern. No issues found.

The initFromP2PWithRetry timeout-and-continue approach (defer to DA sync after 30s) is sensible. The exponential backoff resets correctly.

The verifyState polling loop for height-0 raft state is a good improvement over the immediate skip. The min(50ms, max(timeout/4, 1ms)) ticker calculation is slightly convoluted — worth a comment explaining it's capped to avoid spinning.


Summary

The core deadlock fix is correct and the factoring is clean. The two must-fix items before merge are:

  1. Nil-panic in Stop() on partial startup (affects reliability in error paths)
  2. storeHeight not used in publisher-mode guard (logic gap that widens the blast radius of publisher mode)

The apps/evm/go.mod replace directives need a decision on the right strategy (go.work vs go.mod).

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 53.84615% with 60 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.67%. Comparing base (ff88b95) to head (5218cc6).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
node/failover.go 31.25% 16 Missing and 6 partials ⚠️
pkg/sync/sync_service.go 47.36% 13 Missing and 7 partials ⚠️
pkg/raft/election.go 53.12% 10 Missing and 5 partials ⚠️
pkg/sync/syncer_status.go 90.47% 1 Missing and 1 partial ⚠️
pkg/raft/node.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3222      +/-   ##
==========================================
+ Coverage   61.45%   61.67%   +0.22%     
==========================================
  Files         120      120              
  Lines       12504    12635     +131     
==========================================
+ Hits         7684     7793     +109     
- Misses       3959     3968       +9     
- Partials      861      874      +13     
Flag Coverage Δ
combined 61.67% <53.84%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-07 17:09 UTC

@alpe alpe force-pushed the alex/sync_race branch 2 times, most recently from ec0ffc4 to 59dc917 Compare April 2, 2026 08:51
@alpe alpe force-pushed the alex/sync_race branch from 59dc917 to 74641e0 Compare April 2, 2026 10:06
@alpe alpe changed the title fix: Publisher-mode synchronization option for failover scenario [WIP]fix: Publisher-mode synchronization option for failover scenario Apr 2, 2026
@alpe alpe marked this pull request as ready for review April 2, 2026 10:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
test/e2e/sut_helper.go (1)

189-195: ⚠️ Potential issue | 🟡 Minor

Close the per-process logfile handle.

os.Create on Line 192 leaves an open fd that is never closed. E2E suites spawn a lot of processes, so this can leak descriptors and delay flushing the captured logs.

Suggested change
 		logfile, err := os.Create(logfileName)
 		require.NoError(s.t, err)
+		s.t.Cleanup(func() { _ = logfile.Close() })
 		errReader = io.NopCloser(io.TeeReader(errReader, logfile))
 		outReader = io.NopCloser(io.TeeReader(outReader, logfile))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/sut_helper.go` around lines 189 - 195, The per-process logfile
created by os.Create (logfile) in the block that checks s.processLogDir() is
never closed; wrap the logfile so it is closed when the returned readers are
closed by replacing the io.NopCloser(io.TeeReader(..., logfile)) usage with a
ReadCloser that closes the underlying logfile on Close, or otherwise ensure
logfile.Close() is called (e.g., return a combined io.ReadCloser that delegates
Read to the TeeReader and Close to logfile.Close) for both errReader and
outReader so the file descriptor is released and buffered data flushed.
block/internal/syncing/syncer.go (1)

1204-1213: ⚠️ Potential issue | 🔴 Critical

Bootstrap the same genesis state as initializeState().

This fallback only sets ChainID, InitialHeight, and LastBlockHeight, but Line 808 later executes the recovered block against currentState.AppHash. On a fresh node that means raft recovery runs against a different execution baseline than normal startup, which calls exec.InitChain and seeds AppHash, DAHeight, and LastBlockTime. Please reuse that bootstrap path here before applying the raft block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/syncer.go` around lines 1204 - 1213, The fallback
after s.store.GetState should bootstrap the same full genesis state as
initializeState() instead of only setting ChainID/InitialHeight/LastBlockHeight;
update the recovery path that sets currentState and stateBootstrapped to call
the same initialization logic (e.g., run the InitChain/bootstrap sequence used
in initializeState) so that currentState.AppHash, DAHeight, LastBlockTime (and
any other genesis-seeded fields) are populated before the raft/recovered block
is executed; ensure you invoke the same executor/init routine (the InitChain or
genesis bootstrap used by initializeState) rather than manually setting only
those three fields.
pkg/sync/sync_service.go (1)

190-205: ⚠️ Potential issue | 🟠 Major

Unwind partially started components on startup errors.

By the time Line 233 returns from setupP2PInfrastructure, the store, exchange server, and exchange can already be running. If newSyncer fails here — or startSubscriber fails in the new publisher-mode path — these methods return without tearing those pieces down, so a failed start leaks live components into the process.

As per coding guidelines, "Be mindful of goroutine leaks in Go code".

Also applies to: 219-225, 231-245

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/sync/sync_service.go` around lines 190 - 205, The startup may leave
store/exchange/exchange-server running if later steps fail (e.g. newSyncer or
startSubscriber); update the startup sequence (functions prepareStart /
setupP2PInfrastructure / initFromP2PWithRetry) to register and invoke cleanup on
error: after setupP2PInfrastructure returns and before calling newSyncer or
startSubscriber, capture the created components (store, exchangeServer, exchange
or any returned tearDown/Close funcs) and ensure they are stopped when a
subsequent error occurs (use a deferred cleanup or explicit teardown call on
error paths in the block around prepareStart -> initFromP2PWithRetry ->
newSyncer/startSubscriber). Reference symbols to change: prepareStart,
setupP2PInfrastructure, initFromP2PWithRetry, newSyncer, and startSubscriber —
ensure each path that returns an error calls the appropriate stop/Close/Shutdown
methods for the started components to avoid goroutine/resource leaks.
🧹 Nitpick comments (1)
pkg/sync/sync_service_test.go (1)

63-97: Assert remote peer delivery, not just local initialization.

This test pays the cost of bringing up client2, but the only postcondition is svc.storeInitialized. A regression where StartForPublishing initializes locally but never broadcasts to peers would still pass. Please add an assertion on peer 2's view of the header so the new publisher-mode path is actually covered.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/sync/sync_service_test.go` around lines 63 - 97, The test currently only
asserts svc.storeInitialized after broadcasting; add an assertion that the
remote peer (client2) actually receives the broadcasted header: before calling
svc.WriteToStoreAndBroadcast, register or subscribe a handler on client2 to
capture incoming P2PSignedHeader messages (using client2's subscribe/handler
API), then after WriteToStoreAndBroadcast use require.Eventually to poll/assert
that the handler received a header equal to signedHeader (or matching
height/DataHash/AppHash). Reference client1/client2, NewHeaderSyncService,
StartForPublishing, WriteToStoreAndBroadcast and svc.storeInitialized when
locating where to add the subscription and the eventual assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/evm/go.mod`:
- Around line 5-8: Remove the repo-local replace directives from the module
manifest (the replace block that points github.com/evstack/ev-node and
github.com/evstack/ev-node/execution/evm to relative ../../ paths) so the
published module no longer contains repository-specific local wiring; keep the
go.mod clean for downstream consumers and, if local development linking is
needed, use a go.work file or temporary local replace only in your working copy
before publishing.

In `@block/internal/syncing/syncer.go`:
- Around line 1230-1235: The retry uses the identical currentState so
errInvalidState from AssertValidForNextState will always recur; before retrying
the call to trySyncNextBlockWithState, refresh the snapshot/state (e.g. read
s.GetLastState() or obtain the latest snapshot used by AssertValidForNextState)
and pass that refreshed state into trySyncNextBlockWithState instead of reusing
the original currentState, or recompute/validate the next state after calling
s.SetLastState(currentState) so the second call sees the updated global state.

In `@CHANGELOG.md`:
- Around line 12-21: The changelog contains unresolved git conflict markers
(<<<<<<<, =======, >>>>>>>) — remove those markers and produce a single resolved
section (e.g., keep "### Added") combining both bullet entries into the final
list so both PRs stay: the publisher-mode synchronization line [`#3222`] and the
"Improve execution/evm check..." line [`#3221`]; ensure only one header remains
and that the PR links and bullets are intact and properly formatted.

In `@docs/guides/raft_production.md`:
- Around line 93-100: Update the example to use the canonical aggregator flag
and drop the legacy bootstrap flag: replace the --rollkit.node.aggregator=true
occurrence with --evnode.node.aggregator=true, and remove the
--evnode.raft.bootstrap=true line (only keep raft-related flags like
--evnode.raft.enable, --evnode.raft.node_id, --evnode.raft.raft_addr,
--evnode.raft.raft_dir, and --evnode.raft.peers). Ensure the rest of the example
(e.g., --rollkit.p2p.listen_address) stays unchanged unless also intended to be
canonicalized.

In `@node/failover.go`:
- Around line 193-209: The gating logic ignores storeHeight when deciding to
start in publisher mode; change the condition that currently checks headerHeight
and dataHeight to include storeHeight so publishing only starts when the main
block store and both sync stores are empty. In other words, adjust the boolean
check around headerSyncService.Store().Height() and
dataSyncService.Store().Height() (the if that returns false) to also consider
the local storeHeight variable so StartForPublishing is only taken when
storeHeight == 0 && headerHeight == 0 && dataHeight == 0.

In `@pkg/raft/election.go`:
- Around line 93-96: The wait here uses d.node.waitForMsgsLanded which
internally uses context.Background(), so follower startup isn’t cancellable;
update this call to use the Run cancellation context (e.g., pass the existing
ctx) and add a ctx-aware variant on the node (e.g., waitForMsgsLandedCtx(ctx,
timeout) or change waitForMsgsLanded to accept a context) so the call in
election.go uses that context and honors cancellation/role changes; modify the
node implementation (pkg/raft/node.go – waitForMsgsLanded) to accept and
propagate the provided context instead of creating context.Background().

---

Outside diff comments:
In `@block/internal/syncing/syncer.go`:
- Around line 1204-1213: The fallback after s.store.GetState should bootstrap
the same full genesis state as initializeState() instead of only setting
ChainID/InitialHeight/LastBlockHeight; update the recovery path that sets
currentState and stateBootstrapped to call the same initialization logic (e.g.,
run the InitChain/bootstrap sequence used in initializeState) so that
currentState.AppHash, DAHeight, LastBlockTime (and any other genesis-seeded
fields) are populated before the raft/recovered block is executed; ensure you
invoke the same executor/init routine (the InitChain or genesis bootstrap used
by initializeState) rather than manually setting only those three fields.

In `@pkg/sync/sync_service.go`:
- Around line 190-205: The startup may leave store/exchange/exchange-server
running if later steps fail (e.g. newSyncer or startSubscriber); update the
startup sequence (functions prepareStart / setupP2PInfrastructure /
initFromP2PWithRetry) to register and invoke cleanup on error: after
setupP2PInfrastructure returns and before calling newSyncer or startSubscriber,
capture the created components (store, exchangeServer, exchange or any returned
tearDown/Close funcs) and ensure they are stopped when a subsequent error occurs
(use a deferred cleanup or explicit teardown call on error paths in the block
around prepareStart -> initFromP2PWithRetry -> newSyncer/startSubscriber).
Reference symbols to change: prepareStart, setupP2PInfrastructure,
initFromP2PWithRetry, newSyncer, and startSubscriber — ensure each path that
returns an error calls the appropriate stop/Close/Shutdown methods for the
started components to avoid goroutine/resource leaks.

In `@test/e2e/sut_helper.go`:
- Around line 189-195: The per-process logfile created by os.Create (logfile) in
the block that checks s.processLogDir() is never closed; wrap the logfile so it
is closed when the returned readers are closed by replacing the
io.NopCloser(io.TeeReader(..., logfile)) usage with a ReadCloser that closes the
underlying logfile on Close, or otherwise ensure logfile.Close() is called
(e.g., return a combined io.ReadCloser that delegates Read to the TeeReader and
Close to logfile.Close) for both errReader and outReader so the file descriptor
is released and buffered data flushed.

---

Nitpick comments:
In `@pkg/sync/sync_service_test.go`:
- Around line 63-97: The test currently only asserts svc.storeInitialized after
broadcasting; add an assertion that the remote peer (client2) actually receives
the broadcasted header: before calling svc.WriteToStoreAndBroadcast, register or
subscribe a handler on client2 to capture incoming P2PSignedHeader messages
(using client2's subscribe/handler API), then after WriteToStoreAndBroadcast use
require.Eventually to poll/assert that the handler received a header equal to
signedHeader (or matching height/DataHash/AppHash). Reference client1/client2,
NewHeaderSyncService, StartForPublishing, WriteToStoreAndBroadcast and
svc.storeInitialized when locating where to add the subscription and the
eventual assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b5476c3-9ec8-4d67-a4cb-fcde2300edbd

📥 Commits

Reviewing files that changed from the base of the PR and between 022b565 and 74641e0.

⛔ Files ignored due to path filters (1)
  • apps/evm/go.sum is excluded by !**/*.sum
📒 Files selected for processing (19)
  • CHANGELOG.md
  • apps/evm/go.mod
  • block/internal/syncing/syncer.go
  • block/internal/syncing/syncer_test.go
  • docs/guides/raft_production.md
  • docs/learn/config.md
  • node/failover.go
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/raft/election.go
  • pkg/raft/election_test.go
  • pkg/raft/node.go
  • pkg/raft/node_test.go
  • pkg/sync/sync_service.go
  • pkg/sync/sync_service_test.go
  • pkg/sync/syncer_status.go
  • pkg/sync/syncer_status_test.go
  • test/e2e/failover_e2e_test.go
  • test/e2e/sut_helper.go

Comment on lines +1230 to +1235
err := s.trySyncNextBlockWithState(ctx, event, currentState)
if err != nil && stateBootstrapped && errors.Is(err, errInvalidState) {
s.logger.Debug().Err(err).Msg("raft recovery failed after bootstrap state init, retrying once")
// Keep strict validation semantics; this retry only guards startup ordering races.
s.SetLastState(currentState)
err = s.trySyncNextBlockWithState(ctx, event, currentState)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The retry cannot clear errInvalidState.

Lines 719-727 validate against the currentState argument, and Line 1235 passes the same snapshot again. s.SetLastState(currentState) only updates the global pointer, so an errInvalidState from AssertValidForNextState will reproduce deterministically on the retry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/syncer.go` around lines 1230 - 1235, The retry uses
the identical currentState so errInvalidState from AssertValidForNextState will
always recur; before retrying the call to trySyncNextBlockWithState, refresh the
snapshot/state (e.g. read s.GetLastState() or obtain the latest snapshot used by
AssertValidForNextState) and pass that refreshed state into
trySyncNextBlockWithState instead of reusing the original currentState, or
recompute/validate the next state after calling s.SetLastState(currentState) so
the second call sees the updated global state.

Comment on lines +93 to +96
if err = d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil {
// this wait can legitimately time out
d.logger.Debug().Err(err).Msg("timed out waiting for raft messages before follower verification; continuing")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep follower startup cancellable.

This new wait path ignores Run cancellation because pkg/raft/node.go:145-175 implements waitForMsgsLanded with its own context.Background() timeout. On shutdown or role changes, follower startup can now sit here for up to SendTimeout. A ctx-aware variant would keep this path responsive. As per coding guidelines, "Use context.Context for cancellation in Go".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/raft/election.go` around lines 93 - 96, The wait here uses
d.node.waitForMsgsLanded which internally uses context.Background(), so follower
startup isn’t cancellable; update this call to use the Run cancellation context
(e.g., pass the existing ctx) and add a ctx-aware variant on the node (e.g.,
waitForMsgsLandedCtx(ctx, timeout) or change waitForMsgsLanded to accept a
context) so the call in election.go uses that context and honors
cancellation/role changes; modify the node implementation (pkg/raft/node.go –
waitForMsgsLanded) to accept and propagate the provided context instead of
creating context.Background().

@alpe alpe changed the title [WIP]fix: Publisher-mode synchronization option for failover scenario fix: Publisher-mode synchronization option for failover scenario Apr 2, 2026
@alpe alpe requested a review from julienrbrt April 2, 2026 11:42
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/guides/raft_production.md (1)

93-100: ⚠️ Potential issue | 🟡 Minor

Remove the bootstrap flag from the example command.

The documentation now describes --evnode.raft.bootstrap as a compatibility-only flag (line 36) and states that "startup mode is selected automatically from persisted raft configuration state" (lines 60-62). However, the example still includes --evnode.raft.bootstrap=true on line 98, making the configuration appear more complex than necessary and contradicting the "optional" guidance.

For the primary/happy-path example, omit the bootstrap flag to demonstrate the recommended automatic mode detection.

✂️ Proposed fix
   --evnode.raft.enable=true \
   --evnode.raft.node_id="node-1" \
   --evnode.raft.raft_addr="0.0.0.0:5001" \
   --evnode.raft.raft_dir="/var/lib/ev-node/raft" \
-  --evnode.raft.bootstrap=true \
   --evnode.raft.peers="node-1@10.0.1.1:5001,node-2@10.0.1.2:5001,node-3@10.0.1.3:5001" \
   --evnode.p2p.listen_address="/ip4/0.0.0.0/tcp/26656" \

As per coding guidelines, docs/guides/**/*.md: Structure guides with clear step-by-step instructions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/raft_production.md` around lines 93 - 100, The example command
still includes the compatibility-only flag --evnode.raft.bootstrap=true which
contradicts the guide text that startup mode is selected automatically; remove
the --evnode.raft.bootstrap option from the example lines (the stanza containing
--evnode.raft.enable, --evnode.raft.node_id, --evnode.raft.raft_addr,
--evnode.raft.raft_dir, --evnode.raft.peers, and --evnode.p2p.listen_address) so
the primary/happy-path example demonstrates automatic mode detection without the
bootstrap flag.
🧹 Nitpick comments (2)
docs/guides/raft_production.md (2)

58-67: Fix markdown formatting issues.

Static analysis flagged several markdown formatting violations in this section:

  • Line 58: Missing blank line above the heading
  • Lines 60-66: Inconsistent spacing after list markers (3 spaces instead of 1)

These don't affect the rendered output in most viewers but should be fixed for consistency and linter compliance.

🔧 Proposed formatting fixes
 
 ### 1. Static Peering & Automatic Startup Mode
 Use static peering with automatic mode selection from local raft configuration:
-*   If local raft configuration already exists in `--evnode.raft.raft_dir`, the node starts in rejoin mode.
-*   If no local raft configuration exists yet, the node bootstraps from configured peers.
-*   `--evnode.raft.bootstrap` is retained for compatibility but does not control mode selection.
-*   **All configured cluster members** should list the full set of peers in `--evnode.raft.peers`.
+* If local raft configuration already exists in `--evnode.raft.raft_dir`, the node starts in rejoin mode.
+* If no local raft configuration exists yet, the node bootstraps from configured peers.
+* `--evnode.raft.bootstrap` is retained for compatibility but does not control mode selection.
+* **All configured cluster members** should list the full set of peers in `--evnode.raft.peers`.
 *   The `peers` list format is strict: `NodeID@Host:Port`.
-*   **Limitation**: Dynamic addition of peers (run-time membership changes) via RPC/CLI is not currently exposed.
-*   **Not supported**: Joining an existing cluster as a brand-new node that was not part of the initial static membership.
+* **Limitation**: Dynamic addition of peers (run-time membership changes) via RPC/CLI is not currently exposed.
+* **Not supported**: Joining an existing cluster as a brand-new node that was not part of the initial static membership.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/raft_production.md` around lines 58 - 67, Add a blank line before
the "### 1. Static Peering & Automatic Startup Mode" heading and normalize the
list bullets to use a single space after each marker and consistent single-space
indentation; update the list items referencing flags like
--evnode.raft.raft_dir, --evnode.raft.bootstrap, --evnode.raft.peers and the
peers format `NodeID@Host:Port` so each bullet begins with "- " (dash + one
space) and remove extra spaces after markers to satisfy markdown linter
expectations.

36-36: Clarify what "compatibility flag" means.

The description states the bootstrap flag is for "compatibility" but doesn't explain with what or whether it has any effect. Consider being explicit: does the flag still influence behavior, or is it completely ignored? If it's retained only for backward config-file compatibility (parsed but unused), state that clearly.

📝 Suggested clarification
-| `--evnode.raft.bootstrap` | `raft.bootstrap` | Compatibility flag. Startup mode is selected automatically from persisted raft configuration state. | optional |
+| `--evnode.raft.bootstrap` | `raft.bootstrap` | Legacy flag (parsed but unused). Startup mode is selected automatically from persisted raft configuration state. | omit |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/guides/raft_production.md` at line 36, The table entry for
`--evnode.raft.bootstrap` / `raft.bootstrap` is vague about "compatibility
flag"; update the description to explicitly state its current behavior (e.g., if
it is retained only for backward compatibility and parsed but ignored at
runtime, change the text to "Deprecated compatibility flag — parsed for backward
compatibility but ignored; startup mode is selected automatically from persisted
Raft configuration state"; if it still affects startup behavior, state that and
describe the exact effect). Make the change to the table cell containing
`--evnode.raft.bootstrap` / `raft.bootstrap` so readers know whether the flag is
effective or merely retained for legacy config parsing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/guides/raft_production.md`:
- Around line 93-100: The example command still includes the compatibility-only
flag --evnode.raft.bootstrap=true which contradicts the guide text that startup
mode is selected automatically; remove the --evnode.raft.bootstrap option from
the example lines (the stanza containing --evnode.raft.enable,
--evnode.raft.node_id, --evnode.raft.raft_addr, --evnode.raft.raft_dir,
--evnode.raft.peers, and --evnode.p2p.listen_address) so the primary/happy-path
example demonstrates automatic mode detection without the bootstrap flag.

---

Nitpick comments:
In `@docs/guides/raft_production.md`:
- Around line 58-67: Add a blank line before the "### 1. Static Peering &
Automatic Startup Mode" heading and normalize the list bullets to use a single
space after each marker and consistent single-space indentation; update the list
items referencing flags like --evnode.raft.raft_dir, --evnode.raft.bootstrap,
--evnode.raft.peers and the peers format `NodeID@Host:Port` so each bullet
begins with "- " (dash + one space) and remove extra spaces after markers to
satisfy markdown linter expectations.
- Line 36: The table entry for `--evnode.raft.bootstrap` / `raft.bootstrap` is
vague about "compatibility flag"; update the description to explicitly state its
current behavior (e.g., if it is retained only for backward compatibility and
parsed but ignored at runtime, change the text to "Deprecated compatibility flag
— parsed for backward compatibility but ignored; startup mode is selected
automatically from persisted Raft configuration state"; if it still affects
startup behavior, state that and describe the exact effect). Make the change to
the table cell containing `--evnode.raft.bootstrap` / `raft.bootstrap` so
readers know whether the flag is effective or merely retained for legacy config
parsing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b93e6a52-29c2-4bb1-a32e-10cacf79f60b

📥 Commits

Reviewing files that changed from the base of the PR and between 53f5ded and a2607e5.

📒 Files selected for processing (1)
  • docs/guides/raft_production.md

Source: "",
})
}
err := s.trySyncNextBlockWithState(ctx, event, currentState)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is effectivelly to use the correct state when we override it at L1207 right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, before it was re-read in the method and could be empty

func (syncService *SyncService[H]) startSyncer(ctx context.Context) error {
if syncService.syncerStatus.isStarted() {
return nil
// StartForPublishing starts the sync service in publisher mode.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to wrap my head around this. Why is it exactly needed? A node can fetch block 1 over DA and p2p 🤔.
Or is it because the connection happens before DA fetching, effectively deadlocking it?

Copy link
Copy Markdown
Contributor Author

@alpe alpe Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a race that could lead to a deadlock:

  • P2P init waits for block 1 from peers.
  • Block 1 production waits for P2P init to finish.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think in all cases we shouldn't need to wait for p2p init to start when we have DA. So we can probably simplify this to the later

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/evm/go.mod`:
- Around line 5-8: Add the same replace directive used by apps/evm to the
execution/evm module so it uses the local github.com/evstack/ev-node source
instead of the v1.0.0 registry version: edit the execution/evm module's go.mod
and add a replace entry mapping github.com/evstack/ev-node to the local ../../
path (and similarly map github.com/evstack/ev-node/execution/evm to
../../execution/evm if needed) so both modules resolve the same local codebase.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8102abcd-d222-49a2-bcda-27bfca7d4cf5

📥 Commits

Reviewing files that changed from the base of the PR and between a2607e5 and 1b8da2e.

⛔ Files ignored due to path filters (1)
  • apps/evm/go.sum is excluded by !**/*.sum
📒 Files selected for processing (1)
  • apps/evm/go.mod

Comment on lines +5 to +8
replace (
github.com/evstack/ev-node => ../../
github.com/evstack/ev-node/execution/evm => ../../execution/evm
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if execution/evm has local replace directives or uses registry versions

echo "=== Checking execution/evm/go.mod for replace directives ==="
rg -n "^replace" execution/evm/go.mod

echo ""
echo "=== Checking what version of ev-node execution/evm requires ==="
rg -n "github.com/evstack/ev-node" execution/evm/go.mod | head -5

echo ""
echo "=== Checking for active go.work file that might reconcile versions ==="
fd -H '^go\.work$' . --max-depth 1

Repository: evstack/ev-node

Length of output: 385


Add replace directive to execution/evm/go.mod for version consistency.

execution/evm/go.mod explicitly requires github.com/evstack/ev-node v1.0.0 from the registry (line 7), but apps/evm/go.mod replaces it with the local ../../ path. To avoid implicit version inconsistency and ensure that execution/evm explicitly uses the same local ev-node code as apps/evm, add the matching replace directive to execution/evm/go.mod.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/evm/go.mod` around lines 5 - 8, Add the same replace directive used by
apps/evm to the execution/evm module so it uses the local
github.com/evstack/ev-node source instead of the v1.0.0 registry version: edit
the execution/evm module's go.mod and add a replace entry mapping
github.com/evstack/ev-node to the local ../../ path (and similarly map
github.com/evstack/ev-node/execution/evm to ../../execution/evm if needed) so
both modules resolve the same local codebase.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/sync/sync_service.go`:
- Line 428: The Stop method currently calls
syncService.topicSubscription.Cancel() which can nil-panics if Start failed
before startSubscriber/prepareStart completed; add a nil guard in Stop so it
only calls Cancel when syncService.topicSubscription != nil (and similarly
protect any other dereferences of topicSubscription), ensuring Start/Stop handle
partial startup without panicking; locate symbols Stop, Start, startSubscriber,
prepareStart, and topicSubscription to apply the check.
- Around line 304-318: startSyncer currently forwards the caller's request
context into syncer.Start, which can cause the syncer startup to be cancelled by
request timeouts; modify startSyncer so the function uses a service-scoped
background context when invoking syncer.Start (instead of the incoming ctx) —
e.g., use an existing SyncService field like syncService.ctx or create a
detached context with context.WithCancel tied to the service lifecycle, and pass
that into the closure given to syncerStatus.startOnce when calling syncer.Start;
keep the startOnce usage and error handling the same but ensure the syncer.Start
call is insulated from request cancellation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bd1d47db-74de-45a4-b680-de1b903700cf

📥 Commits

Reviewing files that changed from the base of the PR and between 1b8da2e and 0982a03.

📒 Files selected for processing (3)
  • node/failover.go
  • pkg/sync/sync_service.go
  • pkg/sync/sync_service_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • node/failover.go
  • pkg/sync/sync_service_test.go

syncService.wg.Wait()

// unsubscribe from topic first so that sub.Stop() does not fail
syncService.topicSubscription.Cancel()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential nil dereference if Stop called after partial Start failure.

If Start() fails before startSubscriber() completes (e.g., during prepareStart()), then topicSubscription will be nil. Calling Cancel() on nil will panic.

🛡️ Add nil guard
-	syncService.topicSubscription.Cancel()
+	if syncService.topicSubscription != nil {
+		syncService.topicSubscription.Cancel()
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
syncService.topicSubscription.Cancel()
if syncService.topicSubscription != nil {
syncService.topicSubscription.Cancel()
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/sync/sync_service.go` at line 428, The Stop method currently calls
syncService.topicSubscription.Cancel() which can nil-panics if Start failed
before startSubscriber/prepareStart completed; add a nil guard in Stop so it
only calls Cancel when syncService.topicSubscription != nil (and similarly
protect any other dereferences of topicSubscription), ensuring Start/Stop handle
partial startup without panicking; locate symbols Stop, Start, startSubscriber,
prepareStart, and topicSubscription to apply the check.

@julienrbrt julienrbrt merged commit d163059 into main Apr 7, 2026
35 of 36 checks passed
@julienrbrt julienrbrt deleted the alex/sync_race branch April 7, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants