Skip to content

fix(datasets-raw): replace random provider selection with attempt-based rotation#1995

Open
Theodus wants to merge 4 commits intomainfrom
theodus/rpc-fallback
Open

fix(datasets-raw): replace random provider selection with attempt-based rotation#1995
Theodus wants to merge 4 commits intomainfrom
theodus/rpc-fallback

Conversation

@Theodus
Copy link
Member

@Theodus Theodus commented Mar 19, 2026

Replace random provider shuffling with deterministic provider selection based on job attempt count. On each retry, the job selects the next provider in lexicographic order, cycling through all available providers.

  • Add find_providers method returning all matching providers in deterministic order
  • Select provider in job_impl using attempt_count % num_providers
  • Remove rand dependency

claude[bot]

This comment was marked as outdated.

Copy link
Contributor

@shiyasmohd shiyasmohd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Added some suggestions as comments.

My only concern is, this approach forces us to have a fallback provider.
Eg:- If only one provider is configured, and if it fails, the job process fails instantly without retry (correct me if i'm wrong). This job will the retry only when the scheduler retries the next time. My suggestion would be to add 3 or 5 retries with fallback. Leaving this decision to you. This could be added in a follow up PR or this PR itself.

@Theodus Theodus marked this pull request as draft March 23, 2026 14:34
Replace random provider selection with deterministic ordering and add
BlockStreamerWithFallback to try each provider once on recoverable errors.
- Add `BlockStreamerWithFallback` that iterates providers in order,
  falling back to the next on recoverable errors and aborting on fatal
- Add `find_providers` and `create_block_stream_clients` to
  `ProvidersRegistry` returning all matching providers in name order
- Add `AllClientCreationsFailed` error variant to distinguish from
  `ProviderNotFound` when providers exist but all fail to connect
- Replace random shuffle with deterministic lexicographic selection
- Remove `rand` dependency from `providers-registry`
@Theodus Theodus force-pushed the theodus/rpc-fallback branch 2 times, most recently from 85e4bfc to 3cd3996 Compare March 23, 2026 15:40
@Theodus Theodus changed the title feat(datasets-raw): add provider fallback for block streaming fix(datasets-raw): replace random provider selection with attempt-based rotation Mar 23, 2026
…ed rotation

Replace random provider shuffling with deterministic provider selection
based on job attempt count. On each retry, the job selects the next
provider in lexicographic order, cycling through all available providers.

- Add find_providers method returning all matching providers in deterministic order
- Select provider in job_impl using attempt_count % num_providers
- Remove rand dependency
Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Most important finding: Likely off-by-one bug in provider rotation. get_attempt_count returns 1 on the first job execution (it counts SCHEDULED events), so 1 % num_providers skips provider index 0 on the initial attempt. The first provider in lexicographic order is only used when attempt is a multiple of num_providers.

Other findings:

  • Silent error swallowing when get_attempt_count DB query fails (no logging)
  • Silent skip of unparseable provider configs in find_providers (no warning logged, unlike env substitution failures)
  • Doc/implementation mismatch in provider-registry.md — describes single-provider selection but find_providers returns a list for callers to select from
  • The ordering guarantee of find_providers depends on BTreeMap iteration order — worth documenting this dependency
  • eth_call always picks the first provider while block streaming rotates — potential provider affinity concern noted via TODO

@Theodus Theodus force-pushed the theodus/rpc-fallback branch from 1938b9f to 3037507 Compare March 23, 2026 16:16
@Theodus Theodus requested a review from shiyasmohd March 23, 2026 16:21
@Theodus Theodus marked this pull request as ready for review March 23, 2026 16:21
@Theodus
Copy link
Member Author

Theodus commented Mar 23, 2026

@shiyasmohd, can I get another review on this since there has been significant changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants