-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: add canary jobs and sync mechanism #33993
base: master
Are you sure you want to change the base?
Conversation
This makes editing a bit easier because one doesn't have to jump back and forth between different files while trying to keep periodic and presubmit jobs in sync. In the past, changes were made to one but not the other, perhaps because author and reviewer forgot about the other half of the jobs. It also enables usage of YAML anchors and aliases to define some settings only in one place. They are scoped to one document.
Breaking the jobs which are in use while making changes is annoying, but hard to avoid because testing these jobs locally is difficult. The https://docs.prow.k8s.io/docs/build-test-update/#how-to-test-a-prowjob method doesn't work (or at least not easily) because of nested containers (kind inside kind, for E2E) and the need for a special test environment (E2E node).
The wrong CI job got referenced.
The benefit of trying out changes in canary jobs is diminished if the actual change then still needs to be done manually. There has been at least one case elsewhere where the canary job changes were okay, but then copying them into the real jobs was bungled such that they broke. To avoid this, the shell script automates copying of changes. To use it, run dra-sync.sh on a new, clean branch and submit the generated commit in a PR.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
# on a kind cluster with containerd updated to a version with CDI support. | ||
- name: ci-kind-dra-canary | ||
cluster: eks-prow-build-cluster | ||
interval: 1000000h # Run only once on creation and when manually triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
I am not sure yet how this "manually triggered" could work. I asked on Slack: https://kubernetes.slack.com/archives/C09QZ4DQB/p1734506374693689
It's convenient to run `go test .` while editing the YAML files to see if there are any parse errors. Furthermore, `go test -v` prints the files in a normalized form. This can be used for before/after comparisons when making larger changes.
These different jobs were all similar, with small variations. Keeping them in sync was an on-going challenge that we lost: - at some point, features were enabled differently in E2E node presubmits and periodics (didn't make a difference, but the they weren't the same anymore) - resource settings for containerd vs CRI-O were different and it is unclear whether that was intentional (no comment about it) - clusters where different Defining common elements via YAML anchors once and reusing them via YAML aliases avoids this. `go test -v -run=TestYaml/dynamic-resource-allocation-canary.yaml .` can be used to see the full job definitions.
# | ||
# Lists cannot be extended the same way (https://github.com/yaml/yaml/issues/35). | ||
# | ||
# If unsure what the expanded jobs look like or to test parsing, run `go test -v .` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kannon92 @bart0sh: I went ahead and rewrote this file so that it avoids duplication between the different jobs. From 3942639:
dra: use YAML anchors and aliases to avoid duplication
These different jobs were all similar, with small variations.
Keeping them in sync was an on-going challenge that we lost:
- at some point, features were enabled differently in E2E node
presubmits and periodics (didn't make a difference, but the
they weren't the same anymore)
- resource settings for containerd vs CRI-O were different
and it is unclear whether that was intentional (no comment
about it)
- clusters where different
Defining common elements via YAML anchors once and reusing them via YAML
aliases avoids this.
`go test -v -run=TestYaml/dynamic-resource-allocation-canary.yaml .`
can be used to see the full job definitions.
What do you think?
I know @bart0sh pointed out the difficulty of keeping periodics and presubmits in sync. This approach solves that, but is it readable and usable enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some test failures that would need further work. But let's first hear whether this is worth it at all.
Perhaps there is some other templating mechanism that can be used instead to generate the YAML files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better than what we currently have.
However, I'd prefer to use existing solution suggested by @dims on the slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly It would be even better to avoid having a lot of configuration details in the generation script (https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/kops/build_jobs.py
) and have a yaml file with a set of values, templates for various job types and a generic build_jobs script that generates result by rendering templates based on values.yaml. I can play with this if you don't have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this PR is going down the wrong track. Instead of two files which we need to keep in sync, we should have a single dynamic-resource-allocation.yaml
which is fully generated. That file should also include canary presubmits where we can try out experimental changes before changing the generator to apply them also to the normal jobs.
We don't need canary periodic jobs, for two reasons:
- If the difference between presubmit and periodic is guaranteed to be limited to the
interval
vs.branch
settings, then testing the job spec with a canary presubmit is sufficient. - It is less critical if a periodic job breaks because it doesn't affect other developers in the PRs.
I can play with this if you don't have time.
Thanks, please do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This consolidates all DRA jobs (presubmit and periodic) in a single file. That makes it possible to mirror all jobs in a second file with
-canary
as suffix for the file and job names and to compare them with diff. It also enables usage of YAML anchors and aliases to reuse settings across jobs.The intent is to try out changes first in the canary jobs, then copy those changes into the real jobs. Because this process is error-prone (doing it manually elsewhere failed at least once), the
dra-sync.sh
helper script can be used to automate the copying. One commit was already generated that way in this PR./cc @bart0sh @kannon92
This is a follow-up to: