DRA: add canary jobs and sync mechanism #33993

pohly · 2024-12-18T09:18:24Z

This consolidates all DRA jobs (presubmit and periodic) in a single file. That makes it possible to mirror all jobs in a second file with -canary as suffix for the file and job names and to compare them with diff. It also enables usage of YAML anchors and aliases to reuse settings across jobs.

The intent is to try out changes first in the canary jobs, then copy those changes into the real jobs. Because this process is error-prone (doing it manually elsewhere failed at least once), the dra-sync.sh helper script can be used to automate the copying. One commit was already generated that way in this PR.

/cc @bart0sh @kannon92

This is a follow-up to:

This makes editing a bit easier because one doesn't have to jump back and forth between different files while trying to keep periodic and presubmit jobs in sync. In the past, changes were made to one but not the other, perhaps because author and reviewer forgot about the other half of the jobs. It also enables usage of YAML anchors and aliases to define some settings only in one place. They are scoped to one document.

Breaking the jobs which are in use while making changes is annoying, but hard to avoid because testing these jobs locally is difficult. The https://docs.prow.k8s.io/docs/build-test-update/#how-to-test-a-prowjob method doesn't work (or at least not easily) because of nested containers (kind inside kind, for E2E) and the need for a special test environment (E2E node).

The wrong CI job got referenced.

The benefit of trying out changes in canary jobs is diminished if the actual change then still needs to be done manually. There has been at least one case elsewhere where the canary job changes were okay, but then copying them into the real jobs was bungled such that they broke. To avoid this, the shell script automates copying of changes. To use it, run dra-sync.sh on a new, clean branch and submit the generated commit in a PR.

k8s-ci-robot · 2024-12-18T09:18:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign endocrimes for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

config/jobs/kubernetes/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2024-12-18T09:21:31Z

config/jobs/kubernetes/sig-node/dynamic-resource-allocation-canary.yaml

+  # on a kind cluster with containerd updated to a version with CDI support.
+  - name: ci-kind-dra-canary
+    cluster: eks-prow-build-cluster
+    interval: 1000000h # Run only once on creation and when manually triggered.


/hold

I am not sure yet how this "manually triggered" could work. I asked on Slack: https://kubernetes.slack.com/archives/C09QZ4DQB/p1734506374693689

It's convenient to run `go test .` while editing the YAML files to see if there are any parse errors. Furthermore, `go test -v` prints the files in a normalized form. This can be used for before/after comparisons when making larger changes.

These different jobs were all similar, with small variations. Keeping them in sync was an on-going challenge that we lost: - at some point, features were enabled differently in E2E node presubmits and periodics (didn't make a difference, but the they weren't the same anymore) - resource settings for containerd vs CRI-O were different and it is unclear whether that was intentional (no comment about it) - clusters where different Defining common elements via YAML anchors once and reusing them via YAML aliases avoids this. `go test -v -run=TestYaml/dynamic-resource-allocation-canary.yaml .` can be used to see the full job definitions.

pohly · 2024-12-18T14:53:05Z

config/jobs/kubernetes/sig-node/dynamic-resource-allocation-canary.yaml

+#
+# Lists cannot be extended the same way (https://github.com/yaml/yaml/issues/35).
+#
+# If unsure what the expanded jobs look like or to test parsing, run `go test -v .`


@kannon92 @bart0sh: I went ahead and rewrote this file so that it avoids duplication between the different jobs. From 3942639:

dra: use YAML anchors and aliases to avoid duplication These different jobs were all similar, with small variations. Keeping them in sync was an on-going challenge that we lost: - at some point, features were enabled differently in E2E node presubmits and periodics (didn't make a difference, but the they weren't the same anymore) - resource settings for containerd vs CRI-O were different and it is unclear whether that was intentional (no comment about it) - clusters where different Defining common elements via YAML anchors once and reusing them via YAML aliases avoids this. `go test -v -run=TestYaml/dynamic-resource-allocation-canary.yaml .` can be used to see the full job definitions.

What do you think?

I know @bart0sh pointed out the difficulty of keeping periodics and presubmits in sync. This approach solves that, but is it readable and usable enough?

There are some test failures that would need further work. But let's first hear whether this is worth it at all.

Perhaps there is some other templating mechanism that can be used instead to generate the YAML files?

This looks better than what we currently have.
However, I'd prefer to use existing solution suggested by @dims on the slack.

@pohly It would be even better to avoid having a lot of configuration details in the generation script (https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/kops/build_jobs.py) and have a yaml file with a set of values, templates for various job types and a generic build_jobs script that generates result by rendering templates based on values.yaml. I can play with this if you don't have time.

I agree, this PR is going down the wrong track. Instead of two files which we need to keep in sync, we should have a single dynamic-resource-allocation.yaml which is fully generated. That file should also include canary presubmits where we can try out experimental changes before changing the generator to apply them also to the normal jobs.

We don't need canary periodic jobs, for two reasons:

If the difference between presubmit and periodic is guaranteed to be limited to the interval vs. branch settings, then testing the job spec with a canary presubmit is sufficient.

It is less critical if a periodic job breaks because it doesn't affect other developers in the PRs.

I can play with this if you don't have time.

Thanks, please do.

@pohly this is what I came up with so far, PTAL: #34010

k8s-ci-robot · 2024-12-18T14:58:39Z

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-test-infra-prow-checkconfig	`3942639`	link	true	`/test pull-test-infra-prow-checkconfig`
pull-test-infra-unit-test	`3942639`	link	true	`/test pull-test-infra-unit-test`
pull-test-infra-unit-test-race-detector-nonblocking	`3942639`	link	false	`/test pull-test-infra-unit-test-race-detector-nonblocking`
pull-test-infra-verify-lint	`3942639`	link	true	`/test pull-test-infra-verify-lint`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot · 2024-12-19T01:50:36Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pohly and others added 5 commits December 18, 2024 08:39

dra canary: fix comment

b4f79a9

The wrong CI job got referenced.

dra: apply changes from canary jobs

c82feb7

k8s-ci-robot requested review from bart0sh and kannon92 December 18, 2024 09:18

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/config Issues or PRs related to code in /config labels Dec 18, 2024

k8s-ci-robot added area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Dec 18, 2024

pohly commented Dec 18, 2024

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 18, 2024

pohly added 2 commits December 18, 2024 15:48

node: add YAML test

71e5959

It's convenient to run `go test .` while editing the YAML files to see if there are any parse errors. Furthermore, `go test -v` prints the files in a normalized form. This can be used for before/after comparisons when making larger changes.

pohly commented Dec 18, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 19, 2024

bart0sh mentioned this pull request Dec 19, 2024

Failing SIG-Node presubmit jobs kubernetes/kubernetes#127831

Closed

8 tasks

pohly mentioned this pull request Dec 21, 2024

generate DRA job configs from a Jinja template #34010

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA: add canary jobs and sync mechanism #33993

DRA: add canary jobs and sync mechanism #33993

pohly commented Dec 18, 2024

k8s-ci-robot commented Dec 18, 2024

pohly Dec 18, 2024

pohly Dec 18, 2024

pohly Dec 18, 2024

bart0sh Dec 18, 2024

bart0sh Dec 18, 2024

pohly Dec 19, 2024

bart0sh Dec 19, 2024

k8s-ci-robot commented Dec 18, 2024

k8s-ci-robot commented Dec 19, 2024

DRA: add canary jobs and sync mechanism #33993

Are you sure you want to change the base?

DRA: add canary jobs and sync mechanism #33993

Conversation

pohly commented Dec 18, 2024

k8s-ci-robot commented Dec 18, 2024

pohly Dec 18, 2024

Choose a reason for hiding this comment

pohly Dec 18, 2024

Choose a reason for hiding this comment

pohly Dec 18, 2024

Choose a reason for hiding this comment

bart0sh Dec 18, 2024

Choose a reason for hiding this comment

bart0sh Dec 18, 2024

Choose a reason for hiding this comment

pohly Dec 19, 2024

Choose a reason for hiding this comment

bart0sh Dec 19, 2024

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 18, 2024

k8s-ci-robot commented Dec 19, 2024