Skip to content

Improve HTTP response handling and retry logic#520

Open
MichaelGHSeg wants to merge 12 commits intomasterfrom
response-status-updates
Open

Improve HTTP response handling and retry logic#520
MichaelGHSeg wants to merge 12 commits intomasterfrom
response-status-updates

Conversation

@MichaelGHSeg
Copy link
Contributor

@MichaelGHSeg MichaelGHSeg commented Feb 12, 2026

Summary

Implements improved HTTP response handling and retry logic to align with changes in analytics-java and analytics-next:

  • ✅ Add Authorization header (Basic auth with write key, OAuth Bearer token)
  • ✅ Add X-Retry-Count header on all requests (starts at 0)
  • ✅ Implement Retry-After header support (capped at 300s, doesn't count against retry budget)
  • ✅ Granular status code classification for retryable vs non-retryable errors
  • ✅ Replace backoff decorator with custom retry loop for better control
  • ✅ Exponential backoff with jitter (0.5s base, 60s cap)
  • ✅ Increase max retries from 10 to 1000 (to accommodate faster retry cadence)
  • ✅ Clear OAuth token on 511 Network Authentication Required
  • ✅ 413 Payload Too Large is non-retryable

Status Code Handling

Retryable 4xx: 408, 410, 429, 460
Non-retryable 4xx: 400, 401, 403, 404, 413, 422, and all other 4xx
Retryable 5xx: All except 501, 505
Non-retryable 5xx: 501, 505

Retry-After Behavior

  • Respected for status codes: 429
  • Retry-After attempts do NOT count against backoff retry budget
  • Capped at 300 seconds maximum delay

Test Coverage

  • Added 30 new comprehensive tests
  • Total test suite: 106 tests passing
  • Tests cover: Authorization headers, X-Retry-Count, Retry-After parsing, status code classification, backoff logic, network errors, retry budget separation

Changes

Modified Files

  • segment/analytics/consumer.py: Custom retry loop, status code classification, Retry-After support
  • segment/analytics/request.py: Authorization header, X-Retry-Count header, parse_retry_after()
  • segment/analytics/client.py: Increased default max_retries to 1000
  • segment/analytics/test/test_consumer.py: 13 new test cases
  • segment/analytics/test/test_request.py: 17 new test cases

Alignment

This implementation aligns with:

  • analytics-java commit c3f60c5 and 80cc33d
  • analytics-next commit 31e275d2

🤖 Generated with Claude Code

MichaelGHSeg and others added 2 commits February 11, 2026 19:20
- Add Authorization header with Basic auth (base64 encoded write key)
- Add X-Retry-Count header on all requests (starts at 0)
- Implement Retry-After header support (capped at 300s)
- Retry-After attempts don't count against backoff retry budget
- Add granular status code classification:
  - Retryable 4xx: 408, 410, 429, 460
  - Non-retryable 4xx: 400, 401, 403, 404, 413, 422
  - Retryable 5xx: all except 501, 505
  - Non-retryable 5xx: 501, 505
- Replace backoff decorator with custom retry loop
- Exponential backoff with jitter (0.5s base, 60s cap)
- Clear OAuth token on 511 Network Authentication Required
- 413 Payload Too Large is non-retryable
- Add 30 new comprehensive tests (106 total tests)

Aligns with analytics-java and analytics-next retry behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligns with analytics-java change to accommodate shorter backoff
periods (0.5s base, 60s cap). With faster retries, a higher retry
limit allows for better resilience during extended outages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves HTTP response handling and retry logic in the analytics-python client, aligning with changes from analytics-java and analytics-next. The implementation replaces the backoff decorator with a custom retry loop that provides fine-grained control over retry behavior, distinguishing between Retry-After attempts and exponential backoff attempts.

Changes:

  • Replaced backoff decorator with custom retry loop implementing exponential backoff with jitter (0.5s base, 60s cap)
  • Added Authorization header support (Basic auth with write key and OAuth Bearer token)
  • Added X-Retry-Count header to track attempt numbers
  • Implemented Retry-After header support (capped at 300s, doesn't count against retry budget)
  • Granular status code classification distinguishing retryable from non-retryable errors
  • Increased default max_retries from 10 to 1000 to accommodate faster retry cadence
  • OAuth token clearing expanded to include 511 status code

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
segment/analytics/request.py Added Authorization header (Basic/Bearer), X-Retry-Count header, parse_retry_after() function, and response object to APIError
segment/analytics/consumer.py Replaced backoff decorator with custom retry loop implementing exponential backoff, Retry-After support, and granular status code classification
segment/analytics/client.py Increased default max_retries from 10 to 1000
segment/analytics/test/test_request.py Added 17 comprehensive tests for authorization headers, X-Retry-Count, Retry-After parsing, and OAuth token clearing
segment/analytics/test/test_consumer.py Added 13 comprehensive tests for retry logic, status code classification, backoff behavior, and Retry-After support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 196 to 221
# Calculate exponential backoff delay with jitter
base_delay = 0.5 * (2 ** (backoff_attempts - 1))
jitter = random.uniform(0, 0.1 * base_delay)
delay = min(base_delay + jitter, 60) # Cap at 60 seconds

self.log.debug(
f"Retry attempt {backoff_attempts}/{self.retries} (total attempts: {total_attempts}) "
f"after {delay:.2f}s for status {e.status}"
)
time.sleep(delay)

except Exception as e:
# Network errors or other exceptions - retry with backoff
total_attempts += 1
backoff_attempts += 1

if backoff_attempts >= max_backoff_attempts:
self.log.error(
f"All {self.retries} retries exhausted after {total_attempts} total attempts. Final error: {e}"
)
raise

# Calculate exponential backoff delay with jitter
base_delay = 0.5 * (2 ** (backoff_attempts - 1))
jitter = random.uniform(0, 0.1 * base_delay)
delay = min(base_delay + jitter, 60) # Cap at 60 seconds
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exponential backoff calculation is duplicated between the APIError handler (lines 196-199) and the generic Exception handler (lines 218-221). Consider extracting this logic into a helper function to reduce code duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
Comment on lines 172 to 179
if should_use_retry_after(e.status) and e.response:
retry_after = parse_retry_after(e.response)
if retry_after:
self.log.debug(
f"Retry-After header present: waiting {retry_after}s (attempt {total_attempts})"
)
time.sleep(retry_after)
continue # Does not count against backoff budget
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry logic does not impose an upper bound on Retry-After attempts. If the server continuously responds with Retry-After headers (for status codes 408, 429, or 503), the client could retry indefinitely. Consider adding a maximum total attempt limit to prevent infinite retry loops, even when Retry-After is present. For example, you could add a check like if total_attempts >= some_large_limit: raise before processing Retry-After.

Copilot uses AI. Check for mistakes.
MichaelGHSeg and others added 6 commits February 12, 2026 18:07
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Extract duplicate exponential backoff calculation into helper function
- Add upper bound (max_total_attempts) to prevent infinite retry loops with Retry-After
- Improves code maintainability and prevents edge case of continuous Retry-After responses

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove 408/503 from Retry-After eligibility (only 429 uses Retry-After)
- Add rate-limit state to Consumer (rate_limited_until, rate_limit_start_time)
- 429 with Retry-After: set rate-limit state, raise to caller for requeue
- 429 without Retry-After: counted backoff (not pipeline blocking)
- Add maxTotalBackoffDuration / maxRateLimitDuration config (default 43200s)
- upload() checks rate-limit state before request(), enforces duration limit
- 511 OAuth gating: only retry when OauthManager is configured
- Add tests: T04, T17, T19, T20; update 429/408/503 behavior tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Handle Retry-After: 0 correctly by checking 'is not None' instead of
truthiness. Prevent silent batch re-queue on 429 without Retry-After
by gating the upload() re-queue path on rate_limited_until being set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

segment/analytics/consumer.py:158

  • When a 429 with Retry-After is received (lines 136-144), the batch items are re-queued using queue.put(), but then in the finally block (lines 156-158), all items in the batch are marked as task_done(). This means the items that were re-queued are still marked as done, which could cause queue.join() to return prematurely before those re-queued items are actually processed. Consider either not calling task_done() for re-queued items, or handling the re-queueing differently.
            if e.status == 429 and self.rate_limited_until is not None:
                # 429: rate-limit state already set by request(). Re-queue batch.
                self.log.debug('429 received. Re-queuing batch and halting upload iteration.')
                for item in batch:
                    try:
                        self.queue.put(item, block=False)
                    except Exception:
                        pass  # Queue full, item lost
                success = False
            else:
                self.log.error('error uploading: %s', e)
                success = False
                if self.on_error:
                    self.on_error(e, batch)
        except Exception as e:
            self.log.error('error uploading: %s', e)
            success = False
            if self.on_error:
                self.on_error(e, batch)
        finally:
            # mark items as acknowledged from queue
            for _ in batch:
                self.queue.task_done()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

log.debug('received response: %s', payload)
raise APIError(res.status_code, payload['code'], payload['message'])
raise APIError(res.status_code, payload['code'], payload['message'], res)
except ValueError:
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handler catches ValueError for when res.json() fails to parse JSON, but it doesn't catch KeyError that would be raised if the parsed JSON doesn't contain 'code' or 'message' keys (line 102). Consider catching both ValueError and KeyError to handle malformed JSON responses more gracefully.

Suggested change
except ValueError:
except (ValueError, KeyError):

Copilot uses AI. Check for mistakes.
Comment on lines +387 to +417
def test_408_and_503_use_backoff(self):
"""Test that 408 and 503 use exponential backoff"""
track = {'type': 'track', 'event': 'python event', 'userId': 'userId'}

for status_code in [408, 503]:
consumer = Consumer(None, 'testsecret', retries=2)
call_count = 0
sleep_durations = []

def mock_post_fn(*args, **kwargs):
nonlocal call_count
call_count += 1
if call_count == 1:
response = mock.Mock()
response.headers = {'Retry-After': '5'}
error = APIError(status_code, 'error', 'Error')
error.response = response
raise error
return mock.Mock(status_code=200)

def mock_sleep(duration):
sleep_durations.append(duration)

with mock.patch('segment.analytics.consumer.post', side_effect=mock_post_fn):
with mock.patch('time.sleep', side_effect=mock_sleep):
consumer.request([track])

# Should use backoff delay (0 for first retry), NOT the Retry-After value of 5
self.assertEqual(call_count, 2)
self.assertEqual(len(sleep_durations), 1)
self.assertEqual(sleep_durations[0], 0, f'{status_code} should use backoff, not Retry-After')
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states "Retry-After Behavior - Respected for status codes: 408, 429, 503", but the implementation only respects Retry-After for 429. Status codes 408 and 503 always use exponential backoff, ignoring any Retry-After header. This test confirms this behavior is intentional. Consider updating the PR description to accurately reflect that only 429 respects Retry-After headers.

Copilot uses AI. Check for mistakes.
MichaelGHSeg and others added 4 commits February 25, 2026 16:39
- Add KeyError to except clause when parsing JSON response to handle
  missing 'code' or 'message' keys
- Add explanatory comment on pre-existing except-pass pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
requests.Response.__bool__() returns False for non-2xx status codes.
The checks `if e.response` and `if response` evaluated to False for
429 responses, so parse_retry_after() was never called and the SDK
fell back to normal backoff instead of respecting Retry-After.

Changed both checks to explicit `is not None` comparisons.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add on_error handler to capture delivery failures from the SDK.
Reports success=false with the first error message when any batch
fails (non-retryable error or retries exhausted).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants