Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move webhook registration behind feature gate flag #5099

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bryan-cox
Copy link
Contributor

@bryan-cox bryan-cox commented Aug 28, 2024

What type of PR is this?
/kind bug

What this PR does / why we need it:
Move webhook registration behind feature gate flags similar to controller registration.

Without this PR, from a self-managed / externally managed infrastructure perspective, if you want to exclude the CRDs behind the MachinePool and ASOAPI feature flags, you'll get an error because the webhook for them is still registered.

E0828 10:05:27.972237       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

  • cherry-pick candidate

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Moves webhook registration behind feature gate flags like controller registration already does.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 28, 2024
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 28, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @bryan-cox. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 28, 2024
main.go Outdated Show resolved Hide resolved
@muraee
Copy link

muraee commented Aug 28, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 28, 2024
@nojnhuh
Copy link
Contributor

nojnhuh commented Aug 29, 2024

We use the webhooks to forbid creating resources disabled by feature flags. That's also what CAPI does so I think we should align with that: https://github.com/kubernetes-sigs/cluster-api/blob/be86b82e7e30a844bca141ff8bcdc450b0499549/exp/internal/webhooks/machinepool.go#L168. Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

This seems fine as long as users do some extra work to ensure those CRDs are not installed at all when the feature flags are disabled, but that would force users to adapt to keep the existing behavior and clusterctl doesn't make that easy.

Are you seeing any adverse behavior besides the error message?

@bryan-cox
Copy link
Contributor Author

We use the webhooks to forbid creating resources disabled by feature flags. That's also what CAPI does so I think we should align with that: https://github.com/kubernetes-sigs/cluster-api/blob/be86b82e7e30a844bca141ff8bcdc450b0499549/exp/internal/webhooks/machinepool.go#L168. Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

This seems fine as long as users do some extra work to ensure those CRDs are not installed at all when the feature flags are disabled, but that would force users to adapt to keep the existing behavior and clusterctl doesn't make that easy.

Are you seeing any adverse behavior besides the error message?

We aren't using AzureMachinePool. Yeah, we are seeing more than just the log message; the CAPZ pod restarts constantly. Here are some additional logs before the pod restarts:

E0829 15:50:31.089094       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"
I0829 15:50:38.588560       1 azuremachine_controller.go:243] "Reconciling AzureMachine" logger="controllers.AzureMachineReconciler.reconcileNormal" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine" AzureMachine="clusters-generic-hc/generic-hc-9npwz-8z465" namespace="clusters-generic-hc" name="generic-hc-9npwz-8z465" reconcileID="743788a0-e979-4c1e-9ca4-0c854d575fc0" x-ms-correlation-request-id="0951596f-73b2-4a57-801b-40faca63ef50"
I0829 15:50:38.809896       1 azuremachine_controller.go:243] "Reconciling AzureMachine" logger="controllers.AzureMachineReconciler.reconcileNormal" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine" AzureMachine="clusters-generic-hc/generic-hc-9npwz-7p4fb" namespace="clusters-generic-hc" name="generic-hc-9npwz-7p4fb" reconcileID="f9e21048-ea5d-44bf-9c2d-195d7ad86e74" x-ms-correlation-request-id="1e6492aa-fe1d-413c-9cac-292107e030f7"
E0829 15:50:41.091628       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"
E0829 15:50:41.235638       1 controller.go:203] "Could not wait for Cache to sync" err="failed to wait for ASOSecret caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.AzureManagedControlPlane" controller="ASOSecret" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.235695       1 internal.go:516] "Stopping and waiting for non leader election runnables"
I0829 15:50:41.235829       1 internal.go:520] "Stopping and waiting for leader election runnables"
I0829 15:50:41.235949       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.236026       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate"
I0829 15:50:41.236232       1 controller.go:242] "All workers finished" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate"
I0829 15:50:41.236158       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.236386       1 controller.go:242] "All workers finished" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.236177       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.236823       1 controller.go:242] "All workers finished" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.237036       1 controller.go:242] "All workers finished" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.237121       1 internal.go:528] "Stopping and waiting for caches"
I0829 15:50:41.237583       1 internal.go:532] "Stopping and waiting for webhooks"
I0829 15:50:41.237981       1 server.go:249] "Shutting down webhook server with timeout of 1 minute" logger="controller-runtime.webhook"
I0829 15:50:41.238191       1 internal.go:535] "Stopping and waiting for HTTP servers"
I0829 15:50:41.238323       1 server.go:231] "Shutting down metrics server with timeout of 1 minute" logger="controller-runtime.metrics"
I0829 15:50:41.238458       1 server.go:43] "shutting down server" kind="health probe" addr="[::]:9440"
I0829 15:50:41.238568       1 internal.go:539] "Wait completed, proceeding to shutdown the manager"
E0829 15:50:41.238677       1 main.go:353] "problem running manager" err="failed to wait for ASOSecret caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.AzureManagedControlPlane" logger="setup"

We have the MachinePool feature turned off in our pod deployment:

      containers:
      - args:
        - --namespace=$(MY_NAMESPACE)
        - --leader-elect=true
        - --feature-gates=MachinePool=false
...
        name: manager

@bryan-cox
Copy link
Contributor Author

FWIW the machines do get provisioned and join our cluster. The CAPZ pod just consistently restarts.

@bryan-cox
Copy link
Contributor Author

Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @nojnhuh

This is causing pod restarts and the fix follows an approach similar to CAPA's, so I think it's reasonable.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 11c709afba588489dedad6cf904e145c3451676a

Copy link

codecov bot commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 0% with 50 lines in your changes missing coverage. Please review.

Project coverage is 52.98%. Comparing base (dbc4d54) to head (32f73b1).
Report is 122 commits behind head on main.

Files with missing lines Patch % Lines
main.go 0.00% 50 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5099   +/-   ##
=======================================
  Coverage   52.98%   52.98%           
=======================================
  Files         273      273           
  Lines       29197    29145   -52     
=======================================
- Hits        15469    15443   -26     
+ Misses      12926    12900   -26     
  Partials      802      802           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nojnhuh
Copy link
Contributor

nojnhuh commented Sep 27, 2024

If they're no longer reachable, can we remove the checks in the webhooks that the corresponding feature gates for resources are enabled? e.g.

// NOTE: AzureMachinePool is behind MachinePool feature gate flag; the webhook
// must prevent creating new objects in case the feature flag is disabled.
if !feature.Gates.Enabled(capifeature.MachinePool) {
return nil, field.Forbidden(
field.NewPath("spec"),
"can be set only if the MachinePool feature flag is enabled",
)
}

@bryan-cox
Copy link
Contributor Author

If they're no longer reachable, can we remove the checks in the webhooks that the corresponding feature gates for resources are enabled? e.g.

// NOTE: AzureMachinePool is behind MachinePool feature gate flag; the webhook
// must prevent creating new objects in case the feature flag is disabled.
if !feature.Gates.Enabled(capifeature.MachinePool) {
return nil, field.Forbidden(
field.NewPath("spec"),
"can be set only if the MachinePool feature flag is enabled",
)
}

@nojnhuh - I don't understand. Could you clarify a bit more? Wouldn't those checks still be valid since those are behind a featuregate flag?

@nojnhuh
Copy link
Contributor

nojnhuh commented Oct 23, 2024

If they're no longer reachable, can we remove the checks in the webhooks that the corresponding feature gates for resources are enabled? e.g.

// NOTE: AzureMachinePool is behind MachinePool feature gate flag; the webhook
// must prevent creating new objects in case the feature flag is disabled.
if !feature.Gates.Enabled(capifeature.MachinePool) {
return nil, field.Forbidden(
field.NewPath("spec"),
"can be set only if the MachinePool feature flag is enabled",
)
}

@nojnhuh - I don't understand. Could you clarify a bit more? Wouldn't those checks still be valid since those are behind a featuregate flag?

If we only start the webhooks in main.go when the feature gate is enabled, is it possible for the feature gate to be disabled when we're inside the webhook? The feature gates can't be toggled at runtime AFAIK.

@bryan-cox bryan-cox force-pushed the fix-webhook-registration branch from 83f3f66 to 2f2e523 Compare October 23, 2024 18:26
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from mboersma. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2024
@bryan-cox bryan-cox force-pushed the fix-webhook-registration branch from 2f2e523 to 362e74d Compare October 23, 2024 19:09
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2024
@bryan-cox bryan-cox force-pushed the fix-webhook-registration branch from 362e74d to 26db138 Compare October 24, 2024 12:48
Move webhook registration behind feature gate flags similar to
controller registration.

Signed-off-by: Bryan Cox <[email protected]>
@bryan-cox bryan-cox force-pushed the fix-webhook-registration branch from 26db138 to 32f73b1 Compare October 24, 2024 13:45
@bryan-cox
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-aks

Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 802baeabc173b13515eb73bc8556cf322c7e37db

@bryan-cox
Copy link
Contributor Author

Hey @nojnhuh 👋🏻 - if you're good with the changes, can I get a /approve on this PR please?

@nojnhuh
Copy link
Contributor

nojnhuh commented Nov 7, 2024

Sorry, still working my way back to this.

/assign

@nawazkh nawazkh added this to the v1.18 milestone Nov 7, 2024
@nojnhuh
Copy link
Contributor

nojnhuh commented Nov 12, 2024

Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

^ (and when the AzureMachinePool CRD is installed, as it would be for default CAPZ installations?)

@bryan-cox Can you confirm that this is the case?

@bryan-cox
Copy link
Contributor Author

Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

^ (and when the AzureMachinePool CRD is installed, as it would be for default CAPZ installations?)

@bryan-cox Can you confirm that this is the case?

@nojnhuh I can try and take a look. The project I'm on doesn't use that CRD in our self-managed setup.

@bryan-cox
Copy link
Contributor Author

bryan-cox commented Nov 26, 2024

Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

^ (and when the AzureMachinePool CRD is installed, as it would be for default CAPZ installations?)
@bryan-cox Can you confirm that this is the case?

@nojnhuh I can try and take a look. The project I'm on doesn't use that CRD in our self-managed setup.

@nojnhuh - Sorry for the delay, but I tested this yesterday. If I install the AzureMachinePool CRD, there are no errors in the CAPZ pod, machines are provisioned, and join a nodepool in my self-managed setup.

However, the issue is still there if the AzureMachinePool CRD isn't installed where the pod restarts continually.

@willie-yao
Copy link
Contributor

@nojnhuh Did you have time to take a look at this again? No problem if not and feel free to assign me if so.

@nojnhuh
Copy link
Contributor

nojnhuh commented Dec 12, 2024

If you could take a look that would be great!

/assign @willie-yao
/unassign

@k8s-ci-robot k8s-ci-robot assigned willie-yao and unassigned nojnhuh Dec 12, 2024
@willie-yao
Copy link
Contributor

However, the issue is still there if the AzureMachinePool CRD isn't installed where the pod restarts continually.

Looks good to me, although I just wanted a bit of clarification on this statement. Does this issue still exist and how can you reproduce it? Which pod is restarting continually? @bryan-cox @mboersma

@bryan-cox
Copy link
Contributor Author

bryan-cox commented Dec 13, 2024

However, the issue is still there if the AzureMachinePool CRD isn't installed where the pod restarts continually.

Looks good to me, although I just wanted a bit of clarification on this statement. Does this issue still exist and how can you reproduce it? Which pod is restarting continually? @bryan-cox @mboersma

@willie-yao - The issue was still present when I made that comment. I haven't tested it again since. The project I work on in Red Hat uses the self-managed flavor of CAPZ (we BYO cloud infra) with the exception of machine management. We ran into this issue when we were reducing the Azure CRDs we install related to CAPZ. This is related to #5294. We don't use AzureMachinePool and several other CAPZ related CRDs. One could reproduce it using externally managed Azure infra and not installing the AzureMachinePool CRD.

The pod that was restarting was the one running the main CAPZ program, manager.

@willie-yao
Copy link
Contributor

Just wanted to make sure: This only happens when they try to create an AzureMachinePool when the MachinePool flag is disabled, or just when the MachinePool flag is disabled in general? If it's the former, I think we can go ahead and merge this, otherwise this would probably block the PR. Thoughts? @mboersma

@bryan-cox
Copy link
Contributor Author

Just wanted to make sure: This only happens when they try to create an AzureMachinePool when the MachinePool flag is disabled, or just when the MachinePool flag is disabled in general? If it's the former, I think we can go ahead and merge this, otherwise this would probably block the PR. Thoughts? @mboersma

@willie-yao - The pod restarts consistently when the AzureMachinePool CRD isn't installed. I can't recall completely what the flag status was.

Would it be helpful to have a working example of this to go over at one of the office hours?

@willie-yao
Copy link
Contributor

Would it be helpful to have a working example of this to go over at one of the office hours?

Yup that would be great! Thank you so much!! I'll try and reproduce this in the meantime and if so, I'll let you know. Apologies for the hassle and delay on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Review
Development

Successfully merging this pull request may close these issues.

8 participants