Skip to content

feat: add --lazy-bind-reconcilers flag for deferred CRD binding#240

Open
michaelhtm wants to merge 1 commit into
aws-controllers-k8s:mainfrom
michaelhtm:feat/lazybindreconciler
Open

feat: add --lazy-bind-reconcilers flag for deferred CRD binding#240
michaelhtm wants to merge 1 commit into
aws-controllers-k8s:mainfrom
michaelhtm:feat/lazybindreconciler

Conversation

@michaelhtm
Copy link
Copy Markdown
Member

Issue #, if available:

Description of changes:
When a reconciler is bound for a CRD that doesn't exist in the cluster,
controller-runtime's cache sync times out and crash-loops the pod. This
adds an opt-in flag that defers reconciler binding for missing CRDs
until they become available, allowing the controller to start serving
installed CRDs immediately.

When enabled, a background goroutine polls the discovery API every 10s
for each missing CRD, binding the reconciler once detected. The context
is respected for graceful shutdown, and bind failures trigger os.Exit(1)
since they indicate a permanent configuration error.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ack-prow
Copy link
Copy Markdown

ack-prow Bot commented May 8, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ack-prow ack-prow Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 8, 2026
@ack-prow ack-prow Bot requested review from jlbutler and knottnt May 8, 2026 16:13
@ack-prow
Copy link
Copy Markdown

ack-prow Bot commented May 8, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: michaelhtm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-prow ack-prow Bot added the approved label May 8, 2026
@michaelhtm michaelhtm marked this pull request as ready for review May 11, 2026 17:11
@ack-prow ack-prow Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026
@ack-prow ack-prow Bot requested a review from a-hilaly May 11, 2026 17:11
michaelhtm added a commit to michaelhtm/ack-code-generator that referenced this pull request May 11, 2026
When a reconciler is bound for a CRD that doesn't exist in the cluster,
controller-runtime's cache sync times out and crash-loops the pod. This
adds an opt-in flag that defers reconciler binding for missing CRDs
until they become available, allowing the controller to start serving
installed CRDs immediately.

When enabled, a background goroutine polls the discovery API every 10s
for each missing CRD, binding the reconciler once detected. The context
is respected for graceful shutdown, and bind failures trigger os.Exit(1)
since they indicate a permanent configuration error.
@michaelhtm michaelhtm force-pushed the feat/lazybindreconciler branch from c2692d6 to bbbfd88 Compare May 11, 2026 17:42
@ack-prow
Copy link
Copy Markdown

ack-prow Bot commented May 11, 2026

@michaelhtm: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
iam-controller-test bbbfd88 link true /test iam-controller-test
s3-controller-test bbbfd88 link true /test s3-controller-test
sagemaker-controller-test bbbfd88 link true /test sagemaker-controller-test
ec2-controller-test bbbfd88 link true /test ec2-controller-test
ecr-controller-test bbbfd88 link true /test ecr-controller-test
sqs-controller-test bbbfd88 link true /test sqs-controller-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Comment thread pkg/config/config.go
flag.DurationVar(
&cfg.LazyBindRetryInterval, flagLazyBindRetryInterval,
10*time.Second,
"The interval at which the controller checks for CRD availability when lazy-bind-reconcilers is enabled.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Mentioning the unit of time in the description and/or the flag itself would be helpful.

// bindReconcilerFunc is an optional override for binding a reconciler.
// When nil, the default bindReconciler is used. This exists to support
// testing without hitting controller-runtime's global controller registry.
bindReconcilerFunc func(ctrlrt.Manager, acktypes.AWSResourceManagerFactory, ackcfg.Config, ackrtcache.Caches, *iamroleselector.Cache, bool, logr.Logger) error
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Is this being used?


if err != nil {
log.Error(err, "failed to bind reconciler after CRD became available, exiting")
os.Exit(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this would immediately exit the process. Would it be possible to initiate a graceful shutdown instead? Terminating immediately could prevent the controller from patching status updates which we've seen result in orphaned AWS resources.

aws-controllers-k8s/community#2497

},
)
rd.On("GroupVersionResource").Return(
schema.GroupVersionResource{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this also include a version?

// checkResourceInstalled returns whether the given GVR is installed in the
// cluster. It delegates to resourceInstalledCheckFunc if set (for testing),
// otherwise falls through to getResourceInstalled.
func (c *serviceController) checkResourceInstalled(mgr ctrlrt.Manager, gvr schema.GroupVersionResource) (bool, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a function to wrap the test func it might better to update the getResourceInstalled() function to accept what we need to mock as a dependency. Looking at getResourceInstalled by itself it isn't clear that it should be wrapped by the testing function and could lead to different code paths using getResourceInstalled and checkResourceInstalled in the future.

I do see that ctrlrt.Manager already has a GetRESTMapper that we may be able to mock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants