OCPBUGS-81270: fix(external-dns): mitigate Azure DNS API throttling by bryan-cox · Pull Request #8098 · openshift/hypershift

bryan-cox · 2026-03-27T19:21:29Z

What this PR does / why we need it:

Adds Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429).

Root Cause

In the periodic AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with ExternalDNSReachable=False. The external-dns pod logged a single error after startup:

dns.RecordSetsClient#listAllByDNSZoneNextResults: Failure responding to next results request:
StatusCode=429 -- Original Error: autorest/azure: Service returned an error.
Status=429 Code="Throttled" Message="Too many operations are requested. Current operation is throttled."

The external-dns pod had zero successful sync cycles across its entire lifetime — the only log entry after startup was the 429 error. Without DNS records being created, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster.openshift.com) never resolved, causing all clusters to fail validation with no such host errors.

Changes

cmd/install/assets/hypershift_operator.go — Add AZURE_SDK_MAX_RETRIES=5 env var to the Azure external-dns deployment. This increases the Azure SDK retry count from the default of 3, allowing transient 429 errors to be retried with exponential backoff. This is the upstream-recommended mitigation.
test/e2e/util/install.go — Set ExternalDNSInterval = "3m" for e2e test installs. The default 1m creates excessive API pressure when 7+ HostedClusters run simultaneously in CI. This only affects CI e2e tests, not production deployments.
cmd/install/assets/hypershift_operator_test.go — Updated the Azure ExternalDNS test case to verify the new AZURE_SDK_MAX_RETRIES=5 env var is set.

Note on `--azure-zones-cache-duration`

The upstream docs also recommend --azure-zones-cache-duration to cache DNS zone listings, but this flag was added in a newer version of external-dns than the v0.13.x image currently shipped (registry.redhat.io/edo/external-dns-rhel8@sha256:638fb6b5..., tag 1.1.0-3). Adding this flag would crash the external-dns pod with an "unknown flag" error. This mitigation should be added when the external-dns image is upgraded.

Which issue(s) this PR fixes:

Fixes periodic AKS e2e job failures caused by Azure DNS API throttling (all 15 test failures in build 2037565923602206720).

Special notes for your reviewer:

The AWS provider already has equivalent throttling mitigations: --aws-zones-cache-duration=1h and --aws-batch-change-interval=10s. This PR brings Azure closer to parity.
The AZURE_SDK_MAX_RETRIES env var is documented upstream as the recommended way to handle Azure rate limiting.
The e2e interval increase to 3m is deliberately only in the e2e install path (test/e2e/util/install.go), not in the default for all users.

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

openshift-ci-robot · 2026-03-27T19:21:39Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2026-03-27T19:21:47Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8c808b7b-1138-4f7c-9d2d-863c178a5247

📥 Commits

Reviewing files that changed from the base of the PR and between b81af5f and f74873b.

📒 Files selected for processing (3)

cmd/install/assets/hypershift_operator.go
cmd/install/assets/hypershift_operator_test.go
test/e2e/util/install.go

✅ Files skipped from review due to trivial changes (2)

cmd/install/assets/hypershift_operator_test.go
cmd/install/assets/hypershift_operator.go

🚧 Files skipped from review as they are similar to previous changes (1)

test/e2e/util/install.go

📝 Walkthrough

Walkthrough

The Azure External DNS provider config is updated to add an environment variable AZURE_SDK_MAX_RETRIES=5 to the external-dns container and to append the CLI argument --azure-config-file=/etc/provider/credentials. Unit tests were extended to allow asserting container env vars for Azure and verify the new env var. The e2e install helper now sets installOpts.ExternalDNSInterval to "3m".

Sequence Diagram(s)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

test/e2e/util/install.go (1)
57-57: Consider making the interval configurable or adding a comment explaining the rationale.

This hardcodes "3m" rather than passing through from opts like other external-dns settings (lines 54-56). While this aligns with the PR's goal to mitigate Azure DNS throttling in CI, it silently overrides the default "1m" interval for all e2e test installs.

If this is intentional CI-specific behavior, a brief comment would clarify intent for future maintainers. Alternatively, expose it via HyperShiftOperatorInstallOptions for flexibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/util/install.go` at line 57, installOpts.ExternalDNSInterval is
hardcoded to "3m", silently overriding the default; either expose this as a
configurable field on HyperShiftOperatorInstallOptions and pass
opts.ExternalDNSInterval through when setting installOpts.ExternalDNSInterval
(add the new field to HyperShiftOperatorInstallOptions and propagate it in the
install setup), or if the 3m value is an intentional CI-only workaround add a
concise comment next to the installOpts.ExternalDNSInterval assignment
explaining the CI-throttling rationale and why it differs from the default so
future maintainers understand the decision.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/util/install.go`:
- Line 57: installOpts.ExternalDNSInterval is hardcoded to "3m", silently
overriding the default; either expose this as a configurable field on
HyperShiftOperatorInstallOptions and pass opts.ExternalDNSInterval through when
setting installOpts.ExternalDNSInterval (add the new field to
HyperShiftOperatorInstallOptions and propagate it in the install setup), or if
the 3m value is an intentional CI-only workaround add a concise comment next to
the installOpts.ExternalDNSInterval assignment explaining the CI-throttling
rationale and why it differs from the default so future maintainers understand
the decision.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: ba8bebfd-6406-41ff-be93-8022ea9cd1d7

📥 Commits

Reviewing files that changed from the base of the PR and between c25481f and b81af5f.

📒 Files selected for processing (3)

cmd/install/assets/hypershift_operator.go
cmd/install/assets/hypershift_operator_test.go
test/e2e/util/install.go

openshift-ci-robot · 2026-03-27T19:27:13Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Adds Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429).

Root Cause

In the periodic AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with ExternalDNSReachable=False. The external-dns pod logged a single error after startup:
dns.RecordSetsClient#listAllByDNSZoneNextResults: Failure responding to next results request:
StatusCode=429 -- Original Error: autorest/azure: Service returned an error.
Status=429 Code="Throttled" Message="Too many operations are requested. Current operation is throttled."
The external-dns pod had zero successful sync cycles across its entire lifetime — the only log entry after startup was the 429 error. Without DNS records being created, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster.openshift.com) never resolved, causing all clusters to fail validation with no such host errors.

The 429 was triggered by the listAllByDNSZoneNextResults call — a paginated zone record listing that external-dns performs every sync cycle because no zone caching was configured for the Azure provider. With 7+ HostedClusters creating routes simultaneously, the API call volume exceeded Azure's per-subscription rate limit.

Key Evidence

Evidence Detail

external-dns log Only startup + 1 error (429 Throttled on listAllByDNSZoneNextResults) — zero sync messages in ~25 min

Pod status restartCount: 0, running since 16:35:30Z — pod was healthy, just throttled

HostedCluster conditions All 7 HCs: ExternalDNSReachable=False, Available=False (KASLoadBalancerNotReachable), DataPlaneConnectionAvailable=Unknown (no workers)

LoadBalancer Shared ingress LB had external IP 20.88.124.35 — networking was fine, only DNS was broken

Control plane EtcdAvailable=True, KubeAPIServerAvailable=True, ReconciliationSucceeded=True — control plane was healthy

Passing test comparison TestHAEtcdChaos passed because it uses type: LoadBalancer (not Route), so it doesn't depend on ExternalDNS at all

DNS failures 65 no such host errors across all 7 unique API hostnames

Upstream docs external-dns Azure throttling docs confirm --azure-zones-cache-duration and AZURE_SDK_MAX_RETRIES as the recommended mitigations

Changes

cmd/install/assets/hypershift_operator.go — Add Azure-specific external-dns configuration:

--azure-zones-cache-duration=1h: Caches DNS zone listings to avoid repeated listAllByDNSZoneNextResults API calls. This mirrors the existing --aws-zones-cache-duration=1h already set for the AWS provider.

AZURE_SDK_MAX_RETRIES=5 env var: Increases the Azure SDK retry count from the default of 3, allowing transient 429 errors to be retried with exponential backoff.

test/e2e/util/install.go — Set ExternalDNSInterval = "3m" for e2e test installs. The default 1m creates excessive API pressure when 7+ HostedClusters run simultaneously in CI. This only affects CI e2e tests, not production deployments.

cmd/install/assets/hypershift_operator_test.go — Updated the Azure test case to verify the new --azure-zones-cache-duration=1h arg and AZURE_SDK_MAX_RETRIES=5 env var.

Why all three fixes

Fix Scope What it prevents

--azure-zones-cache-duration=1h All Azure deployments Eliminates the uncached zone listing call that caused the 429

AZURE_SDK_MAX_RETRIES=5 All Azure deployments Retries transient 429s with backoff instead of failing immediately

ExternalDNSInterval = "3m" CI e2e tests only Reduces polling frequency under the heavy concurrent load unique to CI

Which issue(s) this PR fixes:

Fixes periodic AKS e2e job failures caused by Azure DNS API throttling (all 15 test failures in build 2037565923602206720).

Special notes for your reviewer:

The AWS provider already has equivalent throttling mitigations: --aws-zones-cache-duration=1h and --aws-batch-change-interval=10s. This PR brings Azure to parity.

The AZURE_SDK_MAX_RETRIES env var is documented upstream as the recommended way to handle Azure rate limiting.

The e2e interval increase to 3m is deliberately only in the e2e install path (test/e2e/util/install.go), not in the default for all users.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

New Features

Azure External DNS provider configuration enhanced with SDK retry limits and DNS zone caching duration settings.

Tests

Expanded test coverage for Azure External DNS provider configuration validation and e2e installation options.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bryan-cox · 2026-03-27T19:28:14Z

/jira refresh

openshift-ci-robot · 2026-03-27T19:28:19Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bryan-cox · 2026-03-28T01:46:32Z

/test e2e-aks

cwbotbot · 2026-03-28T03:00:02Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-04-02T14:48:53Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-04-02T14:49:26Z
View Job
View Job History

bryan-cox · 2026-03-30T10:35:56Z

/retest

bryan-cox · 2026-03-30T12:30:25Z

/retest

Add Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429). Changes: - Add --azure-zones-cache-duration=1h to cache DNS zone listings, preventing repeated listAllByDNSZoneNextResults API calls that trigger Azure throttling. This mirrors the existing --aws-zones-cache-duration=1h used for the AWS provider. - Set AZURE_SDK_MAX_RETRIES=5 environment variable to increase the Azure SDK retry count (default is 3), allowing transient 429 errors to be retried with backoff as documented in the upstream external-dns Azure tutorial. - Increase the external-dns polling interval to 3m for e2e tests, which create 7+ HostedClusters simultaneously and generate heavy DNS API load. Root cause analysis: In the periodic-ci AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with ExternalDNSReachable=False because the external-dns pod received a 429 Throttled response from Azure DNS: dns.RecordSetsClient#listAllByDNSZoneNextResults: StatusCode=429 Code="Throttled" Message="Too many operations are requested" The external-dns pod had zero successful sync cycles across its entire lifetime - the only log entries after startup were the 429 error. Without DNS records, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster. openshift.com) never resolved, causing all clusters to fail validation with "no such host" errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci-robot · 2026-03-30T14:34:30Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is invalid:

expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

What this PR does / why we need it:

Adds Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429).

Root Cause

In the periodic AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with ExternalDNSReachable=False. The external-dns pod logged a single error after startup:
dns.RecordSetsClient#listAllByDNSZoneNextResults: Failure responding to next results request:
StatusCode=429 -- Original Error: autorest/azure: Service returned an error.
Status=429 Code="Throttled" Message="Too many operations are requested. Current operation is throttled."
The external-dns pod had zero successful sync cycles across its entire lifetime — the only log entry after startup was the 429 error. Without DNS records being created, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster.openshift.com) never resolved, causing all clusters to fail validation with no such host errors.

The 429 was triggered by the listAllByDNSZoneNextResults call — a paginated zone record listing that external-dns performs every sync cycle because no zone caching was configured for the Azure provider. With 7+ HostedClusters creating routes simultaneously, the API call volume exceeded Azure's per-subscription rate limit.

Key Evidence

Evidence Detail

external-dns log Only startup + 1 error (429 Throttled on listAllByDNSZoneNextResults) — zero sync messages in ~25 min

Pod status restartCount: 0, running since 16:35:30Z — pod was healthy, just throttled

HostedCluster conditions All 7 HCs: ExternalDNSReachable=False, Available=False (KASLoadBalancerNotReachable), DataPlaneConnectionAvailable=Unknown (no workers)

LoadBalancer Shared ingress LB had external IP 20.88.124.35 — networking was fine, only DNS was broken

Control plane EtcdAvailable=True, KubeAPIServerAvailable=True, ReconciliationSucceeded=True — control plane was healthy

Passing test comparison TestHAEtcdChaos passed because it uses type: LoadBalancer (not Route), so it doesn't depend on ExternalDNS at all

DNS failures 65 no such host errors across all 7 unique API hostnames

Upstream docs external-dns Azure throttling docs confirm --azure-zones-cache-duration and AZURE_SDK_MAX_RETRIES as the recommended mitigations

Changes

cmd/install/assets/hypershift_operator.go — Add Azure-specific external-dns configuration:

--azure-zones-cache-duration=1h: Caches DNS zone listings to avoid repeated listAllByDNSZoneNextResults API calls. This mirrors the existing --aws-zones-cache-duration=1h already set for the AWS provider.

AZURE_SDK_MAX_RETRIES=5 env var: Increases the Azure SDK retry count from the default of 3, allowing transient 429 errors to be retried with exponential backoff.

test/e2e/util/install.go — Set ExternalDNSInterval = "3m" for e2e test installs. The default 1m creates excessive API pressure when 7+ HostedClusters run simultaneously in CI. This only affects CI e2e tests, not production deployments.

cmd/install/assets/hypershift_operator_test.go — Updated the Azure test case to verify the new --azure-zones-cache-duration=1h arg and AZURE_SDK_MAX_RETRIES=5 env var.

Why all three fixes

Fix Scope What it prevents

--azure-zones-cache-duration=1h All Azure deployments Eliminates the uncached zone listing call that caused the 429

AZURE_SDK_MAX_RETRIES=5 All Azure deployments Retries transient 429s with backoff instead of failing immediately

ExternalDNSInterval = "3m" CI e2e tests only Reduces polling frequency under the heavy concurrent load unique to CI

Which issue(s) this PR fixes:

Fixes periodic AKS e2e job failures caused by Azure DNS API throttling (all 15 test failures in build 2037565923602206720).

Special notes for your reviewer:

The AWS provider already has equivalent throttling mitigations: --aws-zones-cache-duration=1h and --aws-batch-change-interval=10s. This PR brings Azure to parity.

The AZURE_SDK_MAX_RETRIES env var is documented upstream as the recommended way to handle Azure rate limiting.

The e2e interval increase to 3m is deliberately only in the e2e install path (test/e2e/util/install.go), not in the default for all users.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

New Features

Azure External DNS provider: add SDK retry limit (AZURE_SDK_MAX_RETRIES=5) and include Azure config file argument for provider credentials.

Tests

Expanded unit tests to validate Azure External DNS environment and args.

e2e install option now sets ExternalDNSInterval to "3m".

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bryan-cox · 2026-03-30T14:35:28Z

/retest

bryan-cox · 2026-03-30T18:59:25Z

/retest

bryan-cox · 2026-03-30T18:59:49Z

/test e2e-azure-self-managed
/test e2e-aks

bryan-cox · 2026-03-31T13:28:57Z

/retest-required

bryan-cox · 2026-03-31T16:52:19Z

/test e2e-azure-self-managed
/test e2e-aks

bryan-cox · 2026-04-01T12:38:04Z

/test e2e-aks

bryan-cox · 2026-04-01T14:16:23Z

/test e2e-aks
/test e2e-azure-self-managed

Nirshal · 2026-04-02T14:48:21Z

/lgtm

openshift-ci-robot · 2026-04-02T14:48:22Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

openshift-ci · 2026-04-02T14:48:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, Nirshal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bryan-cox · 2026-04-02T15:06:53Z

/verified by azure-e2es

openshift-ci-robot · 2026-04-02T15:07:07Z

@bryan-cox: This PR has been marked as verified by azure-e2es.

Details

In response to this:

/verified by azure-e2es

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bryan-cox · 2026-04-02T15:07:23Z

/jira refresh

openshift-ci-robot · 2026-04-02T15:07:27Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is invalid:

expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bryan-cox · 2026-04-02T15:07:45Z

/jira refresh

openshift-ci-robot · 2026-04-02T15:07:53Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-bot · 2026-04-02T15:21:20Z

/retest-required

Remaining retests: 0 against base HEAD e087807 and 2 for PR HEAD f74873b in total

openshift-ci · 2026-04-02T17:23:20Z

@bryan-cox: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-04-02T17:26:03Z

@bryan-cox: Jira Issue Verification Checks: Jira Issue OCPBUGS-81270
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-81270 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

What this PR does / why we need it:

Adds Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429).

Root Cause

In the periodic AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with ExternalDNSReachable=False. The external-dns pod logged a single error after startup:
dns.RecordSetsClient#listAllByDNSZoneNextResults: Failure responding to next results request:
StatusCode=429 -- Original Error: autorest/azure: Service returned an error.
Status=429 Code="Throttled" Message="Too many operations are requested. Current operation is throttled."
The external-dns pod had zero successful sync cycles across its entire lifetime — the only log entry after startup was the 429 error. Without DNS records being created, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster.openshift.com) never resolved, causing all clusters to fail validation with no such host errors.

Changes

cmd/install/assets/hypershift_operator.go — Add AZURE_SDK_MAX_RETRIES=5 env var to the Azure external-dns deployment. This increases the Azure SDK retry count from the default of 3, allowing transient 429 errors to be retried with exponential backoff. This is the upstream-recommended mitigation.

test/e2e/util/install.go — Set ExternalDNSInterval = "3m" for e2e test installs. The default 1m creates excessive API pressure when 7+ HostedClusters run simultaneously in CI. This only affects CI e2e tests, not production deployments.

cmd/install/assets/hypershift_operator_test.go — Updated the Azure ExternalDNS test case to verify the new AZURE_SDK_MAX_RETRIES=5 env var is set.

Note on --azure-zones-cache-duration

The upstream docs also recommend --azure-zones-cache-duration to cache DNS zone listings, but this flag was added in a newer version of external-dns than the v0.13.x image currently shipped (registry.redhat.io/edo/external-dns-rhel8@sha256:638fb6b5..., tag 1.1.0-3). Adding this flag would crash the external-dns pod with an "unknown flag" error. This mitigation should be added when the external-dns image is upgraded.

Which issue(s) this PR fixes:

Fixes periodic AKS e2e job failures caused by Azure DNS API throttling (all 15 test failures in build 2037565923602206720).

Special notes for your reviewer:

The AWS provider already has equivalent throttling mitigations: --aws-zones-cache-duration=1h and --aws-batch-change-interval=10s. This PR brings Azure closer to parity.

The AZURE_SDK_MAX_RETRIES env var is documented upstream as the recommended way to handle Azure rate limiting.

The e2e interval increase to 3m is deliberately only in the e2e install path (test/e2e/util/install.go), not in the default for all users.

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2026-04-04T04:51:21Z

Fix included in accepted release 4.22.0-0.nightly-2026-04-03-204456

openshift-ci Bot added the do-not-merge/needs-area label Mar 27, 2026

openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Mar 27, 2026

openshift-ci Bot requested review from enxebre and sjenning March 27, 2026 19:22

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2026

coderabbitai Bot reviewed Mar 27, 2026

View reviewed changes

bryan-cox changed the title ~~fix(external-dns): mitigate Azure DNS API throttling~~ OCPBUGS-81270: fix(external-dns): mitigate Azure DNS API throttling Mar 27, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 27, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 27, 2026

bryan-cox force-pushed the fix-azure-external-dns-throttling-v2 branch from b81af5f to f74873b Compare March 30, 2026 14:31

openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 30, 2026

Nirshal approved these changes Apr 2, 2026

View reviewed changes

openshift-ci Bot assigned Nirshal Apr 2, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 2, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 2, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 2, 2026

openshift-merge-bot Bot merged commit 2aa4e8a into openshift:main Apr 2, 2026
27 of 28 checks passed

Conversation

bryan-cox commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Root Cause

Changes

Note on --azure-zones-cache-duration

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci-robot commented Mar 27, 2026

Uh oh!

coderabbitai Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Mar 27, 2026

What this PR does / why we need it:

Root Cause

Key Evidence

Changes

Why all three fixes

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

bryan-cox commented Mar 27, 2026

Uh oh!

openshift-ci-robot commented Mar 27, 2026

Uh oh!

bryan-cox commented Mar 28, 2026

Uh oh!

cwbotbot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

bryan-cox commented Mar 30, 2026

Uh oh!

bryan-cox commented Mar 30, 2026

Uh oh!

openshift-ci-robot commented Mar 30, 2026

What this PR does / why we need it:

Root Cause

Key Evidence

Changes

Why all three fixes

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

bryan-cox commented Mar 30, 2026

Uh oh!

bryan-cox commented Mar 30, 2026

Uh oh!

bryan-cox commented Mar 30, 2026

Uh oh!

bryan-cox commented Mar 31, 2026

Uh oh!

bryan-cox commented Mar 31, 2026

Uh oh!

bryan-cox commented Apr 1, 2026

Uh oh!

bryan-cox commented Apr 1, 2026

Uh oh!

Nirshal commented Apr 2, 2026

Uh oh!

openshift-ci-robot commented Apr 2, 2026

Uh oh!

openshift-ci Bot commented Apr 2, 2026

Uh oh!

bryan-cox commented Apr 2, 2026

bryan-cox commented Mar 27, 2026 •

edited

Loading

Note on `--azure-zones-cache-duration`

coderabbitai Bot commented Mar 27, 2026 •

edited

Loading

cwbotbot commented Mar 28, 2026 •

edited

Loading

Note on `--azure-zones-cache-duration`