-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Live Control Plane Migration(CPM) or CPM with zero downtime #10686
Comments
Live CPM POC/StatusMain problems
ETCD Migration -Approaches evaluated -
DNS switching -
VPN migration -VPN should be available during whole migration for the availability of webhooks in Shoot. Actual FlowAs of now two gardenlets works parallely on one shoot based on set of annotation such as-
Following diagram explains the flow in more detail - Things that will/can change during implementation
Points for Discussion
Manual Work (Not implemented)
Changes required in repos -
Limitations
Data
|
KAPI downtime was eliminated (achieving zero downtime) by introducing a delay in deleting the source KAPI after migrating the DNS record. |
@acumino would you be able to provide an approximate timeline by when this would be available, productive ? |
@adenitiu The work on this is paused currently to prioritize InPlace. The work will probably start from Q2-2025. Don't have an exact timeline. |
How to categorize this issue?
/area open-source
/kind enhancement
What would you like to be added:
Currently, shoot control plane migrations cause temporary downtime for the shoot cluster because ETCD needs to be backed up and deleted before being restored in the new seed cluster. During this time, the API server, along with all other control plane components, is also taken down. Although the workload within the shoot cluster continues running, it cannot be reconciled, scaled, or updated, leading to downtime since the control plane is unavailable to users.
We would like to support live Control Plane Migration (CPM), allowing migrations to happen without causing downtime for the API server, thereby preventing downtime for the users. This ensures that the shoot cluster remains fully operational, with continuous availability of control-plane for the users.
We(@acumino, @shafeeqes and @ary1992) conducted a POC on this, and it is feasible to implement. More details can be found here.
Why is this needed:
Prevent downtime during control plane migration, ✨ enabling support for more use cases and scenarios, such as 'seed draining/shoot evictions' or streamlined seed cluster deletions.
The text was updated successfully, but these errors were encountered: