Skip to content

OCPBUGS-81270: fix(external-dns): mitigate Azure DNS API throttling#8098

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
bryan-cox:fix-azure-external-dns-throttling-v2
Apr 2, 2026
Merged

OCPBUGS-81270: fix(external-dns): mitigate Azure DNS API throttling#8098
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
bryan-cox:fix-azure-external-dns-throttling-v2

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox commented Mar 27, 2026

What this PR does / why we need it:

Adds Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429).

Root Cause

In the periodic AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with ExternalDNSReachable=False. The external-dns pod logged a single error after startup:

dns.RecordSetsClient#listAllByDNSZoneNextResults: Failure responding to next results request:
StatusCode=429 -- Original Error: autorest/azure: Service returned an error.
Status=429 Code="Throttled" Message="Too many operations are requested. Current operation is throttled."

The external-dns pod had zero successful sync cycles across its entire lifetime — the only log entry after startup was the 429 error. Without DNS records being created, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster.openshift.com) never resolved, causing all clusters to fail validation with no such host errors.

Changes

  1. cmd/install/assets/hypershift_operator.go — Add AZURE_SDK_MAX_RETRIES=5 env var to the Azure external-dns deployment. This increases the Azure SDK retry count from the default of 3, allowing transient 429 errors to be retried with exponential backoff. This is the upstream-recommended mitigation.

  2. test/e2e/util/install.go — Set ExternalDNSInterval = "3m" for e2e test installs. The default 1m creates excessive API pressure when 7+ HostedClusters run simultaneously in CI. This only affects CI e2e tests, not production deployments.

  3. cmd/install/assets/hypershift_operator_test.go — Updated the Azure ExternalDNS test case to verify the new AZURE_SDK_MAX_RETRIES=5 env var is set.

Note on --azure-zones-cache-duration

The upstream docs also recommend --azure-zones-cache-duration to cache DNS zone listings, but this flag was added in a newer version of external-dns than the v0.13.x image currently shipped (registry.redhat.io/edo/external-dns-rhel8@sha256:638fb6b5..., tag 1.1.0-3). Adding this flag would crash the external-dns pod with an "unknown flag" error. This mitigation should be added when the external-dns image is upgraded.

Which issue(s) this PR fixes:

Fixes periodic AKS e2e job failures caused by Azure DNS API throttling (all 15 test failures in build 2037565923602206720).

Special notes for your reviewer:

  • The AWS provider already has equivalent throttling mitigations: --aws-zones-cache-duration=1h and --aws-batch-change-interval=10s. This PR brings Azure closer to parity.
  • The AZURE_SDK_MAX_RETRIES env var is documented upstream as the recommended way to handle Azure rate limiting.
  • The e2e interval increase to 3m is deliberately only in the e2e install path (test/e2e/util/install.go), not in the default for all users.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8c808b7b-1138-4f7c-9d2d-863c178a5247

📥 Commits

Reviewing files that changed from the base of the PR and between b81af5f and f74873b.

📒 Files selected for processing (3)
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/assets/hypershift_operator_test.go
  • test/e2e/util/install.go
✅ Files skipped from review due to trivial changes (2)
  • cmd/install/assets/hypershift_operator_test.go
  • cmd/install/assets/hypershift_operator.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/util/install.go

📝 Walkthrough

Walkthrough

The Azure External DNS provider config is updated to add an environment variable AZURE_SDK_MAX_RETRIES=5 to the external-dns container and to append the CLI argument --azure-config-file=/etc/provider/credentials. Unit tests were extended to allow asserting container env vars for Azure and verify the new env var. The e2e install helper now sets installOpts.ExternalDNSInterval to "3m".

Sequence Diagram(s)

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Mar 27, 2026
@openshift-ci openshift-ci Bot requested review from enxebre and sjenning March 27, 2026 19:22
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/util/install.go (1)

57-57: Consider making the interval configurable or adding a comment explaining the rationale.

This hardcodes "3m" rather than passing through from opts like other external-dns settings (lines 54-56). While this aligns with the PR's goal to mitigate Azure DNS throttling in CI, it silently overrides the default "1m" interval for all e2e test installs.

If this is intentional CI-specific behavior, a brief comment would clarify intent for future maintainers. Alternatively, expose it via HyperShiftOperatorInstallOptions for flexibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/util/install.go` at line 57, installOpts.ExternalDNSInterval is
hardcoded to "3m", silently overriding the default; either expose this as a
configurable field on HyperShiftOperatorInstallOptions and pass
opts.ExternalDNSInterval through when setting installOpts.ExternalDNSInterval
(add the new field to HyperShiftOperatorInstallOptions and propagate it in the
install setup), or if the 3m value is an intentional CI-only workaround add a
concise comment next to the installOpts.ExternalDNSInterval assignment
explaining the CI-throttling rationale and why it differs from the default so
future maintainers understand the decision.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/util/install.go`:
- Line 57: installOpts.ExternalDNSInterval is hardcoded to "3m", silently
overriding the default; either expose this as a configurable field on
HyperShiftOperatorInstallOptions and pass opts.ExternalDNSInterval through when
setting installOpts.ExternalDNSInterval (add the new field to
HyperShiftOperatorInstallOptions and propagate it in the install setup), or if
the 3m value is an intentional CI-only workaround add a concise comment next to
the installOpts.ExternalDNSInterval assignment explaining the CI-throttling
rationale and why it differs from the default so future maintainers understand
the decision.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: ba8bebfd-6406-41ff-be93-8022ea9cd1d7

📥 Commits

Reviewing files that changed from the base of the PR and between c25481f and b81af5f.

📒 Files selected for processing (3)
  • cmd/install/assets/hypershift_operator.go
  • cmd/install/assets/hypershift_operator_test.go
  • test/e2e/util/install.go

@bryan-cox bryan-cox changed the title fix(external-dns): mitigate Azure DNS API throttling OCPBUGS-81270: fix(external-dns): mitigate Azure DNS API throttling Mar 27, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Adds Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429).

Root Cause

In the periodic AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with ExternalDNSReachable=False. The external-dns pod logged a single error after startup:

dns.RecordSetsClient#listAllByDNSZoneNextResults: Failure responding to next results request:
StatusCode=429 -- Original Error: autorest/azure: Service returned an error.
Status=429 Code="Throttled" Message="Too many operations are requested. Current operation is throttled."

The external-dns pod had zero successful sync cycles across its entire lifetime — the only log entry after startup was the 429 error. Without DNS records being created, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster.openshift.com) never resolved, causing all clusters to fail validation with no such host errors.

The 429 was triggered by the listAllByDNSZoneNextResults call — a paginated zone record listing that external-dns performs every sync cycle because no zone caching was configured for the Azure provider. With 7+ HostedClusters creating routes simultaneously, the API call volume exceeded Azure's per-subscription rate limit.

Key Evidence

Evidence Detail
external-dns log Only startup + 1 error (429 Throttled on listAllByDNSZoneNextResults) — zero sync messages in ~25 min
Pod status restartCount: 0, running since 16:35:30Z — pod was healthy, just throttled
HostedCluster conditions All 7 HCs: ExternalDNSReachable=False, Available=False (KASLoadBalancerNotReachable), DataPlaneConnectionAvailable=Unknown (no workers)
LoadBalancer Shared ingress LB had external IP 20.88.124.35 — networking was fine, only DNS was broken
Control plane EtcdAvailable=True, KubeAPIServerAvailable=True, ReconciliationSucceeded=True — control plane was healthy
Passing test comparison TestHAEtcdChaos passed because it uses type: LoadBalancer (not Route), so it doesn't depend on ExternalDNS at all
DNS failures 65 no such host errors across all 7 unique API hostnames
Upstream docs external-dns Azure throttling docs confirm --azure-zones-cache-duration and AZURE_SDK_MAX_RETRIES as the recommended mitigations

Changes

  1. cmd/install/assets/hypershift_operator.go — Add Azure-specific external-dns configuration:
  • --azure-zones-cache-duration=1h: Caches DNS zone listings to avoid repeated listAllByDNSZoneNextResults API calls. This mirrors the existing --aws-zones-cache-duration=1h already set for the AWS provider.
  • AZURE_SDK_MAX_RETRIES=5 env var: Increases the Azure SDK retry count from the default of 3, allowing transient 429 errors to be retried with exponential backoff.
  1. test/e2e/util/install.go — Set ExternalDNSInterval = "3m" for e2e test installs. The default 1m creates excessive API pressure when 7+ HostedClusters run simultaneously in CI. This only affects CI e2e tests, not production deployments.

  2. cmd/install/assets/hypershift_operator_test.go — Updated the Azure test case to verify the new --azure-zones-cache-duration=1h arg and AZURE_SDK_MAX_RETRIES=5 env var.

Why all three fixes

Fix Scope What it prevents
--azure-zones-cache-duration=1h All Azure deployments Eliminates the uncached zone listing call that caused the 429
AZURE_SDK_MAX_RETRIES=5 All Azure deployments Retries transient 429s with backoff instead of failing immediately
ExternalDNSInterval = "3m" CI e2e tests only Reduces polling frequency under the heavy concurrent load unique to CI

Which issue(s) this PR fixes:

Fixes periodic AKS e2e job failures caused by Azure DNS API throttling (all 15 test failures in build 2037565923602206720).

Special notes for your reviewer:

  • The AWS provider already has equivalent throttling mitigations: --aws-zones-cache-duration=1h and --aws-batch-change-interval=10s. This PR brings Azure to parity.
  • The AZURE_SDK_MAX_RETRIES env var is documented upstream as the recommended way to handle Azure rate limiting.
  • The e2e interval increase to 3m is deliberately only in the e2e install path (test/e2e/util/install.go), not in the default for all users.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • Azure External DNS provider configuration enhanced with SDK retry limits and DNS zone caching duration settings.

  • Tests

  • Expanded test coverage for Azure External DNS provider configuration validation and e2e installation options.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-aks

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented Mar 28, 2026

Test Results

e2e-aws

e2e-aks

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

1 similar comment
@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

Add Azure-specific throttling mitigations to the external-dns deployment
to prevent DNS record creation failures caused by Azure DNS API rate
limiting (HTTP 429).

Changes:
- Add --azure-zones-cache-duration=1h to cache DNS zone listings,
  preventing repeated listAllByDNSZoneNextResults API calls that trigger
  Azure throttling. This mirrors the existing --aws-zones-cache-duration=1h
  used for the AWS provider.
- Set AZURE_SDK_MAX_RETRIES=5 environment variable to increase the Azure
  SDK retry count (default is 3), allowing transient 429 errors to be
  retried with backoff as documented in the upstream external-dns Azure
  tutorial.
- Increase the external-dns polling interval to 3m for e2e tests, which
  create 7+ HostedClusters simultaneously and generate heavy DNS API load.

Root cause analysis:
In the periodic-ci AKS e2e job (build 2037565923602206720), all 7
HostedClusters failed with ExternalDNSReachable=False because the
external-dns pod received a 429 Throttled response from Azure DNS:

  dns.RecordSetsClient#listAllByDNSZoneNextResults: StatusCode=429
  Code="Throttled" Message="Too many operations are requested"

The external-dns pod had zero successful sync cycles across its entire
lifetime - the only log entries after startup were the 429 error. Without
DNS records, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster.
openshift.com) never resolved, causing all clusters to fail validation with
"no such host" errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bryan-cox bryan-cox force-pushed the fix-azure-external-dns-throttling-v2 branch from b81af5f to f74873b Compare March 30, 2026 14:31
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

What this PR does / why we need it:

Adds Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429).

Root Cause

In the periodic AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with ExternalDNSReachable=False. The external-dns pod logged a single error after startup:

dns.RecordSetsClient#listAllByDNSZoneNextResults: Failure responding to next results request:
StatusCode=429 -- Original Error: autorest/azure: Service returned an error.
Status=429 Code="Throttled" Message="Too many operations are requested. Current operation is throttled."

The external-dns pod had zero successful sync cycles across its entire lifetime — the only log entry after startup was the 429 error. Without DNS records being created, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster.openshift.com) never resolved, causing all clusters to fail validation with no such host errors.

The 429 was triggered by the listAllByDNSZoneNextResults call — a paginated zone record listing that external-dns performs every sync cycle because no zone caching was configured for the Azure provider. With 7+ HostedClusters creating routes simultaneously, the API call volume exceeded Azure's per-subscription rate limit.

Key Evidence

Evidence Detail
external-dns log Only startup + 1 error (429 Throttled on listAllByDNSZoneNextResults) — zero sync messages in ~25 min
Pod status restartCount: 0, running since 16:35:30Z — pod was healthy, just throttled
HostedCluster conditions All 7 HCs: ExternalDNSReachable=False, Available=False (KASLoadBalancerNotReachable), DataPlaneConnectionAvailable=Unknown (no workers)
LoadBalancer Shared ingress LB had external IP 20.88.124.35 — networking was fine, only DNS was broken
Control plane EtcdAvailable=True, KubeAPIServerAvailable=True, ReconciliationSucceeded=True — control plane was healthy
Passing test comparison TestHAEtcdChaos passed because it uses type: LoadBalancer (not Route), so it doesn't depend on ExternalDNS at all
DNS failures 65 no such host errors across all 7 unique API hostnames
Upstream docs external-dns Azure throttling docs confirm --azure-zones-cache-duration and AZURE_SDK_MAX_RETRIES as the recommended mitigations

Changes

  1. cmd/install/assets/hypershift_operator.go — Add Azure-specific external-dns configuration:
  • --azure-zones-cache-duration=1h: Caches DNS zone listings to avoid repeated listAllByDNSZoneNextResults API calls. This mirrors the existing --aws-zones-cache-duration=1h already set for the AWS provider.
  • AZURE_SDK_MAX_RETRIES=5 env var: Increases the Azure SDK retry count from the default of 3, allowing transient 429 errors to be retried with exponential backoff.
  1. test/e2e/util/install.go — Set ExternalDNSInterval = "3m" for e2e test installs. The default 1m creates excessive API pressure when 7+ HostedClusters run simultaneously in CI. This only affects CI e2e tests, not production deployments.

  2. cmd/install/assets/hypershift_operator_test.go — Updated the Azure test case to verify the new --azure-zones-cache-duration=1h arg and AZURE_SDK_MAX_RETRIES=5 env var.

Why all three fixes

Fix Scope What it prevents
--azure-zones-cache-duration=1h All Azure deployments Eliminates the uncached zone listing call that caused the 429
AZURE_SDK_MAX_RETRIES=5 All Azure deployments Retries transient 429s with backoff instead of failing immediately
ExternalDNSInterval = "3m" CI e2e tests only Reduces polling frequency under the heavy concurrent load unique to CI

Which issue(s) this PR fixes:

Fixes periodic AKS e2e job failures caused by Azure DNS API throttling (all 15 test failures in build 2037565923602206720).

Special notes for your reviewer:

  • The AWS provider already has equivalent throttling mitigations: --aws-zones-cache-duration=1h and --aws-batch-change-interval=10s. This PR brings Azure to parity.
  • The AZURE_SDK_MAX_RETRIES env var is documented upstream as the recommended way to handle Azure rate limiting.
  • The e2e interval increase to 3m is deliberately only in the e2e install path (test/e2e/util/install.go), not in the default for all users.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • New Features

  • Azure External DNS provider: add SDK retry limit (AZURE_SDK_MAX_RETRIES=5) and include Azure config file argument for provider credentials.

  • Tests

  • Expanded unit tests to validate Azure External DNS environment and args.

  • e2e install option now sets ExternalDNSInterval to "3m".

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

1 similar comment
@bryan-cox
Copy link
Copy Markdown
Member Author

/retest

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-azure-self-managed
/test e2e-aks

@bryan-cox
Copy link
Copy Markdown
Member Author

/retest-required

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-azure-self-managed
/test e2e-aks

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-aks

@bryan-cox
Copy link
Copy Markdown
Member Author

/test e2e-aks
/test e2e-azure-self-managed

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 2, 2026
@Nirshal
Copy link
Copy Markdown
Contributor

Nirshal commented Apr 2, 2026

/lgtm

@openshift-ci-robot
Copy link
Copy Markdown

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, Nirshal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bryan-cox
Copy link
Copy Markdown
Member Author

/verified by azure-e2es

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This PR has been marked as verified by azure-e2es.

Details

In response to this:

/verified by azure-e2es

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-81270, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD e087807 and 2 for PR HEAD f74873b in total

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 2, 2026

@bryan-cox: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 2aa4e8a into openshift:main Apr 2, 2026
27 of 28 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: Jira Issue Verification Checks: Jira Issue OCPBUGS-81270
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-81270 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

What this PR does / why we need it:

Adds Azure-specific throttling mitigations to the external-dns deployment to prevent DNS record creation failures caused by Azure DNS API rate limiting (HTTP 429).

Root Cause

In the periodic AKS e2e job (build 2037565923602206720), all 7 HostedClusters failed with ExternalDNSReachable=False. The external-dns pod logged a single error after startup:

dns.RecordSetsClient#listAllByDNSZoneNextResults: Failure responding to next results request:
StatusCode=429 -- Original Error: autorest/azure: Service returned an error.
Status=429 Code="Throttled" Message="Too many operations are requested. Current operation is throttled."

The external-dns pod had zero successful sync cycles across its entire lifetime — the only log entry after startup was the 429 error. Without DNS records being created, the API server hostnames (*.aks-e2e.hypershift.azure.devcluster.openshift.com) never resolved, causing all clusters to fail validation with no such host errors.

Changes

  1. cmd/install/assets/hypershift_operator.go — Add AZURE_SDK_MAX_RETRIES=5 env var to the Azure external-dns deployment. This increases the Azure SDK retry count from the default of 3, allowing transient 429 errors to be retried with exponential backoff. This is the upstream-recommended mitigation.

  2. test/e2e/util/install.go — Set ExternalDNSInterval = "3m" for e2e test installs. The default 1m creates excessive API pressure when 7+ HostedClusters run simultaneously in CI. This only affects CI e2e tests, not production deployments.

  3. cmd/install/assets/hypershift_operator_test.go — Updated the Azure ExternalDNS test case to verify the new AZURE_SDK_MAX_RETRIES=5 env var is set.

Note on --azure-zones-cache-duration

The upstream docs also recommend --azure-zones-cache-duration to cache DNS zone listings, but this flag was added in a newer version of external-dns than the v0.13.x image currently shipped (registry.redhat.io/edo/external-dns-rhel8@sha256:638fb6b5..., tag 1.1.0-3). Adding this flag would crash the external-dns pod with an "unknown flag" error. This mitigation should be added when the external-dns image is upgraded.

Which issue(s) this PR fixes:

Fixes periodic AKS e2e job failures caused by Azure DNS API throttling (all 15 test failures in build 2037565923602206720).

Special notes for your reviewer:

  • The AWS provider already has equivalent throttling mitigations: --aws-zones-cache-duration=1h and --aws-batch-change-interval=10s. This PR brings Azure closer to parity.
  • The AZURE_SDK_MAX_RETRIES env var is documented upstream as the recommended way to handle Azure rate limiting.
  • The e2e interval increase to 3m is deliberately only in the e2e install path (test/e2e/util/install.go), not in the default for all users.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in accepted release 4.22.0-0.nightly-2026-04-03-204456

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cli Indicates the PR includes changes for CLI area/testing Indicates the PR includes changes for e2e testing jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants