feat: add --lazy-bind-reconcilers flag for deferred CRD binding#240
feat: add --lazy-bind-reconcilers flag for deferred CRD binding#240michaelhtm wants to merge 1 commit into
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: michaelhtm The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When a reconciler is bound for a CRD that doesn't exist in the cluster, controller-runtime's cache sync times out and crash-loops the pod. This adds an opt-in flag that defers reconciler binding for missing CRDs until they become available, allowing the controller to start serving installed CRDs immediately. When enabled, a background goroutine polls the discovery API every 10s for each missing CRD, binding the reconciler once detected. The context is respected for graceful shutdown, and bind failures trigger os.Exit(1) since they indicate a permanent configuration error.
c2692d6 to
bbbfd88
Compare
|
@michaelhtm: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| flag.DurationVar( | ||
| &cfg.LazyBindRetryInterval, flagLazyBindRetryInterval, | ||
| 10*time.Second, | ||
| "The interval at which the controller checks for CRD availability when lazy-bind-reconcilers is enabled.", |
There was a problem hiding this comment.
nit: Mentioning the unit of time in the description and/or the flag itself would be helpful.
| // bindReconcilerFunc is an optional override for binding a reconciler. | ||
| // When nil, the default bindReconciler is used. This exists to support | ||
| // testing without hitting controller-runtime's global controller registry. | ||
| bindReconcilerFunc func(ctrlrt.Manager, acktypes.AWSResourceManagerFactory, ackcfg.Config, ackrtcache.Caches, *iamroleselector.Cache, bool, logr.Logger) error |
|
|
||
| if err != nil { | ||
| log.Error(err, "failed to bind reconciler after CRD became available, exiting") | ||
| os.Exit(1) |
There was a problem hiding this comment.
It looks like this would immediately exit the process. Would it be possible to initiate a graceful shutdown instead? Terminating immediately could prevent the controller from patching status updates which we've seen result in orphaned AWS resources.
| }, | ||
| ) | ||
| rd.On("GroupVersionResource").Return( | ||
| schema.GroupVersionResource{ |
There was a problem hiding this comment.
nit: should this also include a version?
| // checkResourceInstalled returns whether the given GVR is installed in the | ||
| // cluster. It delegates to resourceInstalledCheckFunc if set (for testing), | ||
| // otherwise falls through to getResourceInstalled. | ||
| func (c *serviceController) checkResourceInstalled(mgr ctrlrt.Manager, gvr schema.GroupVersionResource) (bool, error) { |
There was a problem hiding this comment.
Instead of creating a function to wrap the test func it might better to update the getResourceInstalled() function to accept what we need to mock as a dependency. Looking at getResourceInstalled by itself it isn't clear that it should be wrapped by the testing function and could lead to different code paths using getResourceInstalled and checkResourceInstalled in the future.
I do see that ctrlrt.Manager already has a GetRESTMapper that we may be able to mock.
Issue #, if available:
Description of changes:
When a reconciler is bound for a CRD that doesn't exist in the cluster,
controller-runtime's cache sync times out and crash-loops the pod. This
adds an opt-in flag that defers reconciler binding for missing CRDs
until they become available, allowing the controller to start serving
installed CRDs immediately.
When enabled, a background goroutine polls the discovery API every 10s
for each missing CRD, binding the reconciler once detected. The context
is respected for graceful shutdown, and bind failures trigger os.Exit(1)
since they indicate a permanent configuration error.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.