OCPBUGS-81270: fix(external-dns): mitigate Azure DNS API throttling#8098
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe Azure External DNS provider config is updated to add an environment variable Sequence Diagram(s)✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/util/install.go (1)
57-57: Consider making the interval configurable or adding a comment explaining the rationale.This hardcodes
"3m"rather than passing through fromoptslike other external-dns settings (lines 54-56). While this aligns with the PR's goal to mitigate Azure DNS throttling in CI, it silently overrides the default"1m"interval for all e2e test installs.If this is intentional CI-specific behavior, a brief comment would clarify intent for future maintainers. Alternatively, expose it via
HyperShiftOperatorInstallOptionsfor flexibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/util/install.go` at line 57, installOpts.ExternalDNSInterval is hardcoded to "3m", silently overriding the default; either expose this as a configurable field on HyperShiftOperatorInstallOptions and pass opts.ExternalDNSInterval through when setting installOpts.ExternalDNSInterval (add the new field to HyperShiftOperatorInstallOptions and propagate it in the install setup), or if the 3m value is an intentional CI-only workaround add a concise comment next to the installOpts.ExternalDNSInterval assignment explaining the CI-throttling rationale and why it differs from the default so future maintainers understand the decision.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/e2e/util/install.go`:
- Line 57: installOpts.ExternalDNSInterval is hardcoded to "3m", silently
overriding the default; either expose this as a configurable field on
HyperShiftOperatorInstallOptions and pass opts.ExternalDNSInterval through when
setting installOpts.ExternalDNSInterval (add the new field to
HyperShiftOperatorInstallOptions and propagate it in the install setup), or if
the 3m value is an intentional CI-only workaround add a concise comment next to
the installOpts.ExternalDNSInterval assignment explaining the CI-throttling
rationale and why it differs from the default so future maintainers understand
the decision.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: ba8bebfd-6406-41ff-be93-8022ea9cd1d7
📒 Files selected for processing (3)
cmd/install/assets/hypershift_operator.gocmd/install/assets/hypershift_operator_test.gotest/e2e/util/install.go
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aks |
Test Resultse2e-aws
e2e-aks
|
|
/retest |
1 similar comment
|
/retest |
Add Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429). Changes: - Add --azure-zones-cache-duration=1h to cache DNS zone listings, preventing repeated listAllByDNSZoneNextResults API calls that trigger Azure throttling. This mirrors the existing --aws-zones-cache-duration=1h used for the AWS provider. - Set AZURE_SDK_MAX_RETRIES=5 environment variable to increase the Azure SDK retry count (default is 3), allowing transient 429 errors to be retried with backoff as documented in the upstream external-dns Azure tutorial. - Increase the external-dns polling interval to 3m for e2e tests, which create 7+ HostedClusters simultaneously and generate heavy DNS API load. Root cause analysis: In the periodic-ci AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with ExternalDNSReachable=False because the external-dns pod received a 429 Throttled response from Azure DNS: dns.RecordSetsClient#listAllByDNSZoneNextResults: StatusCode=429 Code="Throttled" Message="Too many operations are requested" The external-dns pod had zero successful sync cycles across its entire lifetime - the only log entries after startup were the 429 error. Without DNS records, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster. openshift.com) never resolved, causing all clusters to fail validation with "no such host" errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b81af5f to
f74873b
Compare
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
1 similar comment
|
/retest |
|
/test e2e-azure-self-managed |
|
/retest-required |
|
/test e2e-azure-self-managed |
|
/test e2e-aks |
|
/test e2e-aks |
|
/lgtm |
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox, Nirshal The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by azure-e2es |
|
@bryan-cox: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@bryan-cox: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
2aa4e8a
into
openshift:main
|
@bryan-cox: Jira Issue Verification Checks: Jira Issue OCPBUGS-81270 Jira Issue OCPBUGS-81270 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in accepted release 4.22.0-0.nightly-2026-04-03-204456 |
What this PR does / why we need it:
Adds Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429).
Root Cause
In the periodic AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with
ExternalDNSReachable=False. The external-dns pod logged a single error after startup:The external-dns pod had zero successful sync cycles across its entire lifetime — the only log entry after startup was the 429 error. Without DNS records being created, the API server hostnames (
*.aks-e2e.hypershift.azure.devcluster.openshift.com) never resolved, causing all clusters to fail validation withno such hosterrors.Changes
cmd/install/assets/hypershift_operator.go— AddAZURE_SDK_MAX_RETRIES=5env var to the Azure external-dns deployment. This increases the Azure SDK retry count from the default of 3, allowing transient 429 errors to be retried with exponential backoff. This is the upstream-recommended mitigation.test/e2e/util/install.go— SetExternalDNSInterval = "3m"for e2e test installs. The default1mcreates excessive API pressure when 7+ HostedClusters run simultaneously in CI. This only affects CI e2e tests, not production deployments.cmd/install/assets/hypershift_operator_test.go— Updated the Azure ExternalDNS test case to verify the newAZURE_SDK_MAX_RETRIES=5env var is set.Note on
--azure-zones-cache-durationThe upstream docs also recommend
--azure-zones-cache-durationto cache DNS zone listings, but this flag was added in a newer version of external-dns than the v0.13.x image currently shipped (registry.redhat.io/edo/external-dns-rhel8@sha256:638fb6b5..., tag 1.1.0-3). Adding this flag would crash the external-dns pod with an "unknown flag" error. This mitigation should be added when the external-dns image is upgraded.Which issue(s) this PR fixes:
Fixes periodic AKS e2e job failures caused by Azure DNS API throttling (all 15 test failures in build 2037565923602206720).
Special notes for your reviewer:
--aws-zones-cache-duration=1hand--aws-batch-change-interval=10s. This PR brings Azure closer to parity.AZURE_SDK_MAX_RETRIESenv var is documented upstream as the recommended way to handle Azure rate limiting.3mis deliberately only in the e2e install path (test/e2e/util/install.go), not in the default for all users.Checklist: