Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: add canary jobs and sync mechanism #33993

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Dec 18, 2024

This consolidates all DRA jobs (presubmit and periodic) in a single file. That makes it possible to mirror all jobs in a second file with -canary as suffix for the file and job names and to compare them with diff. It also enables usage of YAML anchors and aliases to reuse settings across jobs.

The intent is to try out changes first in the canary jobs, then copy those changes into the real jobs. Because this process is error-prone (doing it manually elsewhere failed at least once), the dra-sync.sh helper script can be used to automate the copying. One commit was already generated that way in this PR.

/cc @bart0sh @kannon92

This is a follow-up to:

pohly and others added 5 commits December 18, 2024 08:39
This makes editing a bit easier because one doesn't have to jump back and forth
between different files while trying to keep periodic and presubmit jobs in
sync. In the past, changes were made to one but not the other, perhaps because
author and reviewer forgot about the other half of the jobs.

It also enables usage of YAML anchors and aliases to define some settings only
in one place. They are scoped to one document.
Breaking the jobs which are in use while making changes is annoying, but hard
to avoid because testing these jobs locally is difficult. The
https://docs.prow.k8s.io/docs/build-test-update/#how-to-test-a-prowjob method
doesn't work (or at least not easily) because of nested containers (kind inside
kind, for E2E) and the need for a special test environment (E2E node).
The wrong CI job got referenced.
The benefit of trying out changes in canary jobs is diminished if the actual
change then still needs to be done manually. There has been at least one case
elsewhere where the canary job changes were okay, but then copying them into
the real jobs was bungled such that they broke.

To avoid this, the shell script automates copying of changes. To use it, run
dra-sync.sh on a new, clean branch and submit the generated commit in a PR.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/config Issues or PRs related to code in /config labels Dec 18, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign endocrimes for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Dec 18, 2024
# on a kind cluster with containerd updated to a version with CDI support.
- name: ci-kind-dra-canary
cluster: eks-prow-build-cluster
interval: 1000000h # Run only once on creation and when manually triggered.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

I am not sure yet how this "manually triggered" could work. I asked on Slack: https://kubernetes.slack.com/archives/C09QZ4DQB/p1734506374693689

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 18, 2024
It's convenient to run `go test .` while editing the YAML files to see if there
are any parse errors.

Furthermore, `go test -v` prints the files in a normalized form. This can be
used for before/after comparisons when making larger changes.
These different jobs were all similar, with small variations.
Keeping them in sync was an on-going challenge that we lost:
- at some point, features were enabled differently in E2E node
  presubmits and periodics (didn't make a difference, but the
  they weren't the same anymore)
- resource settings for containerd vs CRI-O were different
  and it is unclear whether that was intentional (no comment
  about it)
- clusters where different

Defining common elements via YAML anchors once and reusing them via YAML
aliases avoids this.

`go test -v -run=TestYaml/dynamic-resource-allocation-canary.yaml .`
can be used to see the full job definitions.
#
# Lists cannot be extended the same way (https://github.com/yaml/yaml/issues/35).
#
# If unsure what the expanded jobs look like or to test parsing, run `go test -v .`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kannon92 @bart0sh: I went ahead and rewrote this file so that it avoids duplication between the different jobs. From 3942639:

dra: use YAML anchors and aliases to avoid duplication

These different jobs were all similar, with small variations.
Keeping them in sync was an on-going challenge that we lost:
- at some point, features were enabled differently in E2E node
  presubmits and periodics (didn't make a difference, but the
  they weren't the same anymore)
- resource settings for containerd vs CRI-O were different
  and it is unclear whether that was intentional (no comment
  about it)
- clusters where different

Defining common elements via YAML anchors once and reusing them via YAML
aliases avoids this.

`go test -v -run=TestYaml/dynamic-resource-allocation-canary.yaml .`
can be used to see the full job definitions.

What do you think?

I know @bart0sh pointed out the difficulty of keeping periodics and presubmits in sync. This approach solves that, but is it readable and usable enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some test failures that would need further work. But let's first hear whether this is worth it at all.

Perhaps there is some other templating mechanism that can be used instead to generate the YAML files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better than what we currently have.
However, I'd prefer to use existing solution suggested by @dims on the slack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly It would be even better to avoid having a lot of configuration details in the generation script (https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/kops/build_jobs.py) and have a yaml file with a set of values, templates for various job types and a generic build_jobs script that generates result by rendering templates based on values.yaml. I can play with this if you don't have time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this PR is going down the wrong track. Instead of two files which we need to keep in sync, we should have a single dynamic-resource-allocation.yaml which is fully generated. That file should also include canary presubmits where we can try out experimental changes before changing the generator to apply them also to the normal jobs.

We don't need canary periodic jobs, for two reasons:

  • If the difference between presubmit and periodic is guaranteed to be limited to the interval vs. branch settings, then testing the job spec with a canary presubmit is sufficient.
  • It is less critical if a periodic job breaks because it doesn't affect other developers in the PRs.

I can play with this if you don't have time.

Thanks, please do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly this is what I came up with so far, PTAL: #34010

@k8s-ci-robot
Copy link
Contributor

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-test-infra-prow-checkconfig 3942639 link true /test pull-test-infra-prow-checkconfig
pull-test-infra-unit-test 3942639 link true /test pull-test-infra-unit-test
pull-test-infra-unit-test-race-detector-nonblocking 3942639 link false /test pull-test-infra-unit-test-race-detector-nonblocking
pull-test-infra-verify-lint 3942639 link true /test pull-test-infra-verify-lint

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 19, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: PRs Waiting on Author
Development

Successfully merging this pull request may close these issues.

3 participants