Conversation
…java into release/3.5.4
* Changing release plugin * Deploy working * Rolling back to snapshot * [maven-release-plugin] prepare release analytics-parent-3.5.4 * [maven-release-plugin] prepare for next development iteration * putting back release instructions
…java into release/3.5.4
…to response-status-updates
- Add Authorization header with Basic auth to all requests - Reduce retry backoff times for faster recovery (100ms base, 1min cap) - Respect Retry-After header with 5-minute max cap - Always include X-Retry-Count header in requests - Consolidate SegmentService upload methods - Increase max flush attempts to accommodate shorter backoff times Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove 413 (Payload Too Large) from retryable status codes Payload size won't shrink on retry, so retrying is pointless - Verify 503 Retry-After handling matches spec 503 checks Retry-After header first, falls back to exponential backoff - Add test for 413 non-retryable behavior - Add test for Retry-After delay mechanism (validates cap logic) The 300-second cap is enforced at AnalyticsClient.java:556-558 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix import ordering (anyInt moved to correct position) - Split long lines to respect line length limits - Remove extra blank line Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This pull request implements comprehensive improvements to the response status handling and retry mechanism for the Segment Analytics client. The changes introduce Retry-After header support, update the retry logic to distinguish between server-directed and backoff-based retries, add an X-Retry-Count header to all requests, and move authentication from URL to Authorization header using Basic auth.
Changes:
- Implemented Retry-After header parsing and handling for status codes 429, 408, and 503, with retries based on server-specified delays not counting against the client's retry budget
- Added X-Retry-Count header to all upload requests to communicate retry attempt number (starting at 0)
- Moved authentication from URL-based to Authorization header using HTTP Basic auth with the write key
- Expanded retryable status codes to include specific 4xx errors (408, 410, 429, 460) and refined 5xx retry logic to exclude 501, 505, and 511
- Adjusted backoff configuration (100ms base, 1min cap) and increased default maximumFlushAttempts from 3 to 1000
- Added comprehensive test coverage for new retry scenarios
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| analytics-core/src/main/java/com/segment/analytics/http/SegmentService.java | Added X-Retry-Count header parameter to upload method signature |
| analytics/src/main/java/com/segment/analytics/AnalyticsRequestInterceptor.java | Added writeKey parameter and Authorization header construction using Basic auth |
| analytics/src/main/java/com/segment/analytics/Analytics.java | Updated interceptor instantiation with writeKey and increased default maximumFlushAttempts to 1000 |
| analytics/src/main/java/com/segment/analytics/internal/AnalyticsClient.java | Implemented retry strategy enum, Retry-After header parsing, updated backoff configuration, refined status code retry logic, and separated tracking of total attempts vs backoff attempts |
| analytics/src/test/java/com/segment/analytics/internal/AnalyticsClientTest.java | Updated all mock calls to include retry count parameter and added 7 new test cases covering retryable client errors, non-retryable 5xx errors, Retry-After handling, X-Retry-Count header verification, and 413 Payload Too Large behavior |
| analytics/src/test/java/com/segment/analytics/AnalyticsRequestInterceptorTest.java | Added assertion to verify Authorization header is correctly set with Basic auth credentials |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
analytics/src/main/java/com/segment/analytics/internal/AnalyticsClient.java
Outdated
Show resolved
Hide resolved
| // Adjusted upward to accomodate shorter max retry backoff. | ||
| maximumFlushAttempts = 1000; |
There was a problem hiding this comment.
The default value increase from 3 to 1000 for maximumFlushAttempts seems excessive and potentially problematic. This change dramatically increases the number of retry attempts from 3 to 1000 by default, which could lead to: 1) Significantly longer delays before giving up on failed requests, 2) Increased resource consumption for repeatedly failing requests, and 3) Potential operational issues if the service is down. While the comment mentions this is to "accommodate shorter max retry backoff", the backoff was changed from 15s base/1h cap to 100ms base/1m cap, which means retries will happen faster but the sheer number (1000) seems unreasonably high. Consider a more balanced value that accounts for the faster backoff without creating an unbounded retry scenario (e.g., 10-50 attempts).
| // Adjusted upward to accomodate shorter max retry backoff. | |
| maximumFlushAttempts = 1000; | |
| // Bounded default to accommodate shorter max retry backoff without excessive retries. | |
| maximumFlushAttempts = 50; |
| private static Long parseRetryAfterSeconds(String headerValue) { | ||
| if (headerValue == null) { | ||
| return null; | ||
| } | ||
| headerValue = headerValue.trim(); | ||
| if (headerValue.isEmpty()) { | ||
| return null; | ||
| } | ||
| try { | ||
| long seconds = Long.parseLong(headerValue); | ||
| if (seconds <= 0L) { | ||
| return null; | ||
| } | ||
| if (seconds > MAX_RETRY_AFTER_SECONDS) { | ||
| return MAX_RETRY_AFTER_SECONDS; | ||
| } | ||
| return seconds; | ||
| } catch (NumberFormatException ignored) { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
The parseRetryAfterSeconds method handles several edge cases (null, empty, negative, zero, over cap) but lacks explicit test coverage for these scenarios. While the happy path is tested in retryAfterHeaderRespectsShortDelay, consider adding tests for edge cases such as: invalid numeric format, negative values, zero, values exceeding MAX_RETRY_AFTER_SECONDS, empty strings, and whitespace-only strings to ensure the method is robust.
| if (status == 408 || status == 410 || status == 429 || status == 460) { | ||
| return true; |
There was a problem hiding this comment.
Missing test coverage for HTTP status codes 410 and 460, which are explicitly marked as retryable in the isStatusRetryWithBackoff method. Add tests to verify that these status codes are indeed retried with backoff as intended.
| private static boolean isStatusRetryAfterEligible(int status) { | ||
| return status == 429 || status == 408 || status == 503; |
There was a problem hiding this comment.
Missing test coverage for HTTP status code 503 (Service Unavailable), which is both Retry-After eligible and retryable with backoff. Add tests to verify: 1) 503 with a valid Retry-After header uses the RETRY_AFTER strategy, and 2) 503 without a Retry-After header falls back to the BACKOFF strategy.
| if (status >= 500 && status < 600) { | ||
| return status != 501 && status != 505 && status != 511; |
There was a problem hiding this comment.
Missing test coverage for HTTP status codes 505 (HTTP Version Not Supported) and 511 (Network Authentication Required), which are explicitly excluded from retry logic in the isStatusRetryWithBackoff method. While 501 is tested, add tests for 505 and 511 to ensure these status codes are not retried as intended.
| return status >= 500 && status < 600; | ||
| private static boolean isStatusRetryWithBackoff(int status) { | ||
| // Explicitly retry these client errors | ||
| if (status == 408 || status == 410 || status == 429 || status == 460) { |
There was a problem hiding this comment.
HTTP status 410 (Gone) is being treated as retryable, but this status code typically indicates that a resource has been permanently removed and will never be available again. Retrying such requests is unlikely to succeed and wastes resources. Consider removing 410 from the list of retryable client errors unless there's a specific reason the Segment API uses 410 in a non-standard way to indicate a temporary condition.
| if (status == 408 || status == 410 || status == 429 || status == 460) { | |
| if (status == 408 || status == 429 || status == 460) { |
| private static Long parseRetryAfterSeconds(String headerValue) { | ||
| if (headerValue == null) { | ||
| return null; | ||
| } | ||
| headerValue = headerValue.trim(); | ||
| if (headerValue.isEmpty()) { | ||
| return null; | ||
| } | ||
| try { | ||
| long seconds = Long.parseLong(headerValue); | ||
| if (seconds <= 0L) { | ||
| return null; | ||
| } | ||
| if (seconds > MAX_RETRY_AFTER_SECONDS) { | ||
| return MAX_RETRY_AFTER_SECONDS; | ||
| } | ||
| return seconds; | ||
| } catch (NumberFormatException ignored) { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
The Retry-After header parsing only handles numeric (delay-seconds) format but not HTTP-date format. According to RFC 7231, the Retry-After header can be either a delay in seconds or an HTTP-date. While delay-seconds is more common, some servers may use HTTP-date format. Consider adding support for HTTP-date parsing to be fully RFC-compliant, or document that only delay-seconds format is supported.
| if (isStatusRetryWithBackoff(status)) { | ||
| client.log.print( | ||
| DEBUG, "Could not upload batch %s due to rate limiting. Retrying.", batch.sequence()); | ||
| return true; | ||
| DEBUG, | ||
| "Could not upload batch %s due to retryable status %s. Retrying with backoff.", | ||
| batch.sequence(), | ||
| status); | ||
| return new UploadResult(RetryStrategy.BACKOFF); |
There was a problem hiding this comment.
Missing test coverage for the fallback behavior when a status code is Retry-After eligible (429, 408, 503) but the Retry-After header is missing or invalid. In this case, the code should fall back to the BACKOFF strategy. Add a test to verify this fallback behavior works correctly, ensuring that requests with invalid Retry-After headers still get retried using exponential backoff instead of failing immediately.
…csClient.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Remove 408/503 from Retry-After eligibility (only 429 uses Retry-After) - Add 429 pipeline blocking: shared rate-limit state on AnalyticsClient (rateLimited, rateLimitWaitUntil, rateLimitStartTime), Looper checks before submitting BatchUploadTasks - Add maxTotalBackoffDuration / maxRateLimitDuration config with 12h defaults, exposed via Analytics.Builder - Restructure BatchUploadTask.run() retry loop: RATE_LIMITED strategy sets shared state and sleeps, BACKOFF strategy tracks firstFailureTime for duration guard, success clears rate-limit state - Add tests: 408/503 use BACKOFF, 429 sets/clears rate-limit state, maxTotalBackoffDuration drops batch, maxRateLimitDuration drops batch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the Looper defers batch submission due to rate limiting and the trigger was a batch-size overflow, the overflow message was being lost because it had already been consumed from the queue but not added to the batch. Now re-offers the overflow message back to the queue so it can be retried on the next flush cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| void setRateLimitState(long retryAfterSeconds) { | ||
| long now = System.currentTimeMillis(); | ||
| if (rateLimitStartTime == 0) { | ||
| rateLimitStartTime = now; | ||
| } | ||
| rateLimitWaitUntil = now + (retryAfterSeconds * 1000); | ||
| rateLimited = true; | ||
| } |
There was a problem hiding this comment.
There's a race condition in the setRateLimitState method. Multiple threads could call this method concurrently (from different BatchUploadTask instances), and the compound operation of checking rateLimitStartTime and setting it is not atomic. This could result in rateLimitStartTime being set incorrectly or multiple times, leading to inaccurate maxRateLimitDuration checks.
The check-then-act pattern at lines 234-236 should be synchronized or use atomic operations to ensure thread safety.
| private static boolean is5xx(int status) { | ||
| return status >= 500 && status < 600; | ||
| private static boolean isStatusRetryWithBackoff(int status) { | ||
| // Explicitly retry these client errors |
There was a problem hiding this comment.
Status code 429 is checked in both isStatusRetryAfterEligible (line 594) and isStatusRetryWithBackoff (line 706). Since isStatusRetryAfterEligible is checked first in the upload method (line 543), any 429 response will always be handled by the RATE_LIMITED strategy before isStatusRetryWithBackoff is ever called. This makes the inclusion of 429 in isStatusRetryWithBackoff redundant, though it may serve as defensive programming.
Consider removing 429 from isStatusRetryWithBackoff to avoid confusion, or add a comment explaining it's defensive.
| // Explicitly retry these client errors | |
| // Explicitly retry these client errors. | |
| // Note: 429 is also handled earlier via isStatusRetryAfterEligible (RATE_LIMITED strategy). | |
| // Including 429 here is defensive, so that if call-site ordering changes it will still be retried. |
| // Explicitly retry these client errors | ||
| if (status == 408 || status == 410 || status == 429 || status == 460) { |
There was a problem hiding this comment.
HTTP status code 410 (Gone) is marked as retryable in the isStatusRetryWithBackoff method. However, 410 typically indicates that a resource is permanently gone and will not come back, making it semantically different from transient errors. Retrying a 410 is unlikely to succeed and may waste resources.
Consider removing 410 from the retryable status codes unless there's a specific reason why the Segment API uses 410 for transient errors. If 410 is intentionally retryable for this API, add a comment explaining why.
| // Explicitly retry these client errors | |
| if (status == 408 || status == 410 || status == 429 || status == 460) { | |
| // Explicitly retry these client errors (transient or rate-limited) | |
| if (status == 408 || status == 429 || status == 460) { |
| // Check maxRateLimitDuration | ||
| if (client.rateLimitStartTime > 0 | ||
| && System.currentTimeMillis() - client.rateLimitStartTime | ||
| > client.maxRateLimitDurationMs) { | ||
| client.clearRateLimitState(); | ||
| break; | ||
| } |
There was a problem hiding this comment.
The maxRateLimitDuration check at lines 640-644 happens AFTER setRateLimitState is called (line 637). This means that if maxRateLimitDuration is very short (e.g., 1ms as tested), and the first 429 response is received, the rate limit state is set, then immediately the check might pass and the batch is dropped. However, the batch hasn't even attempted to wait for the Retry-After period yet (that happens at line 650).
Consider moving the maxRateLimitDuration check to AFTER the sleep, or restructure the logic so that the batch gets at least one chance to retry after the Retry-After period expires.
| // Adjusted upward to accommodate shorter max retry backoff. | ||
| maximumFlushAttempts = 1000; |
There was a problem hiding this comment.
The default maximumFlushAttempts has been increased from 3 to 1000. While the comment mentions this is to "accommodate shorter max retry backoff," this is a dramatic increase (over 300x) that significantly changes the retry behavior. Combined with the new maxTotalBackoffDurationMs limit (12 hours by default), this could result in many more retry attempts within that window.
Consider whether 1000 is the appropriate value, or if a more moderate increase would be sufficient. Also, ensure this change is documented in release notes as it significantly affects operational behavior.
| // Adjusted upward to accommodate shorter max retry backoff. | |
| maximumFlushAttempts = 1000; | |
| // Increased from 3 to accommodate shorter max retry backoff while keeping retries bounded. | |
| maximumFlushAttempts = 10; |
| boolean isRateLimited() { | ||
| if (!rateLimited) return false; | ||
| if (System.currentTimeMillis() >= rateLimitWaitUntil) { | ||
| rateLimited = false; | ||
| return false; | ||
| } | ||
| return true; | ||
| } |
There was a problem hiding this comment.
There's a race condition in the isRateLimited method. The compound operation at lines 249-251 (checking the time and clearing the rateLimited flag) is not atomic. Multiple threads calling this method concurrently could result in inconsistent state where some threads see the rate limit as active while others see it as inactive, even though they're checking at the same time.
This should be synchronized or use atomic operations to ensure thread safety.
| private static boolean is5xx(int status) { | ||
| return status >= 500 && status < 600; | ||
| private static boolean isStatusRetryWithBackoff(int status) { | ||
| // Explicitly retry these client errors |
There was a problem hiding this comment.
The status code 460 is checked in isStatusRetryWithBackoff at line 706. However, HTTP status code 460 is not a standard HTTP status code (it's a non-standard code used by some specific services). This appears to be intentional for compatibility with specific backend behavior, but it would be helpful to add a comment explaining why this non-standard status code is included in the retry logic.
| // Explicitly retry these client errors | |
| // Explicitly retry these client errors. | |
| // Note: 460 is a non-standard status code used by certain backends to indicate | |
| // transient, retryable failures; it is intentionally included here for compatibility. |
| public Builder maxTotalBackoffDuration(long duration, TimeUnit unit) { | ||
| long seconds = unit.toSeconds(duration); | ||
| if (seconds < 1) { | ||
| throw new IllegalArgumentException("maxTotalBackoffDuration must be at least 1 second."); | ||
| } | ||
| this.maxTotalBackoffDurationMs = unit.toMillis(duration); | ||
| return this; | ||
| } |
There was a problem hiding this comment.
The validation converts the duration to seconds for validation but stores the value in milliseconds. This could allow an edge case where a duration less than 1 second but greater than 0 milliseconds (e.g., 500 milliseconds) would pass validation because unit.toSeconds(duration) would round down to 0 and fail the check. However, a duration of exactly 1000 milliseconds would pass (1 second) but a duration of 999 milliseconds would incorrectly fail even though it's valid in milliseconds.
Consider validating in milliseconds instead: if (unit.toMillis(duration) < 1000) or documenting that the minimum is 1 second (not 1 millisecond).
| public Builder maxRateLimitDuration(long duration, TimeUnit unit) { | ||
| long seconds = unit.toSeconds(duration); | ||
| if (seconds < 1) { | ||
| throw new IllegalArgumentException("maxRateLimitDuration must be at least 1 second."); | ||
| } | ||
| this.maxRateLimitDurationMs = unit.toMillis(duration); | ||
| return this; | ||
| } |
There was a problem hiding this comment.
The validation converts the duration to seconds for validation but stores the value in milliseconds. This could allow an edge case where a duration less than 1 second but greater than 0 milliseconds (e.g., 500 milliseconds) would pass validation because unit.toSeconds(duration) would round down to 0 and fail the check. However, a duration of exactly 1000 milliseconds would pass (1 second) but a duration of 999 milliseconds would incorrectly fail even though it's valid in milliseconds.
Consider validating in milliseconds instead: if (unit.toMillis(duration) < 1000) or documenting that the minimum is 1 second (not 1 millisecond).
| if (isRateLimited() && message != StopMessage.STOP) { | ||
| log.print(DEBUG, "Rate-limited. Deferring batch submission."); | ||
| // Don't clear messages — they'll be picked up on the next flush trigger | ||
| if (batchSizeLimitReached) { | ||
| // Preserve overflow message while deferring submission due to rate limiting. | ||
| // This message was consumed from the queue but not added to the current batch. | ||
| if (!messageQueue.offer(message)) { | ||
| log.print( | ||
| ERROR, | ||
| "Failed to preserve overflow message while rate-limited; message may be dropped."); | ||
| } | ||
| batchSizeLimitReached = false; | ||
| } |
There was a problem hiding this comment.
When rate-limited and the batch is deferred in the Looper (lines 407-419), the messages remain in the LinkedList but are never submitted if rate-limiting persists indefinitely. While the Looper thread will eventually process these on the next flush trigger or shutdown, if the rate-limit state lasts a very long time, these messages accumulate in memory.
Consider adding monitoring or logging to track how long messages remain in the deferred state, especially if rate-limiting persists for extended periods.
- Add synchronized to rate-limit state methods to fix check-then-act race condition - Add comments explaining defensive 429 and non-standard 460 in isStatusRetryWithBackoff - Fix outdated test comments and line references - Widen timing assertion upper bound (3s → 5s) to reduce CI flakiness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
No description provided.