fix(datasets-raw): replace random provider selection with attempt-based rotation#1995
fix(datasets-raw): replace random provider selection with attempt-based rotation#1995
Conversation
shiyasmohd
left a comment
There was a problem hiding this comment.
LGTM. Added some suggestions as comments.
My only concern is, this approach forces us to have a fallback provider.
Eg:- If only one provider is configured, and if it fails, the job process fails instantly without retry (correct me if i'm wrong). This job will the retry only when the scheduler retries the next time. My suggestion would be to add 3 or 5 retries with fallback. Leaving this decision to you. This could be added in a follow up PR or this PR itself.
Replace random provider selection with deterministic ordering and add BlockStreamerWithFallback to try each provider once on recoverable errors. - Add `BlockStreamerWithFallback` that iterates providers in order, falling back to the next on recoverable errors and aborting on fatal - Add `find_providers` and `create_block_stream_clients` to `ProvidersRegistry` returning all matching providers in name order - Add `AllClientCreationsFailed` error variant to distinguish from `ProviderNotFound` when providers exist but all fail to connect - Replace random shuffle with deterministic lexicographic selection - Remove `rand` dependency from `providers-registry`
85e4bfc to
3cd3996
Compare
…ed rotation Replace random provider shuffling with deterministic provider selection based on job attempt count. On each retry, the job selects the next provider in lexicographic order, cycling through all available providers. - Add find_providers method returning all matching providers in deterministic order - Select provider in job_impl using attempt_count % num_providers - Remove rand dependency
3cd3996 to
1a4dcd1
Compare
There was a problem hiding this comment.
Code Review Summary
Most important finding: Likely off-by-one bug in provider rotation. get_attempt_count returns 1 on the first job execution (it counts SCHEDULED events), so 1 % num_providers skips provider index 0 on the initial attempt. The first provider in lexicographic order is only used when attempt is a multiple of num_providers.
Other findings:
- Silent error swallowing when
get_attempt_countDB query fails (no logging) - Silent skip of unparseable provider configs in
find_providers(no warning logged, unlike env substitution failures) - Doc/implementation mismatch in
provider-registry.md— describes single-provider selection butfind_providersreturns a list for callers to select from - The ordering guarantee of
find_providersdepends onBTreeMapiteration order — worth documenting this dependency - eth_call always picks the first provider while block streaming rotates — potential provider affinity concern noted via TODO
1938b9f to
3037507
Compare
|
@shiyasmohd, can I get another review on this since there has been significant changes? |
Replace random provider shuffling with deterministic provider selection based on job attempt count. On each retry, the job selects the next provider in lexicographic order, cycling through all available providers.