Note: Currently, the test-infra lead has to be someone from Google GKE Engprod Team, in order to gain access to the prow cluster. This will change once we migrate our testing infrastructure under CNCF account. (xref kubernetes/test-infra#5085)
There are three major area that test-infra lead need to take care during the release cycle, which are:
You can work with @kubernetes/test-infra-maintainers or test infra oncall if you are blocked by anything.
Also feel free to ping the #sig-testing
and #testing-ops
Kubernetes Slack channels to reach out for help.
This step should happen in week 6-7, when we create the new release branch.
Most of the release blocking jobs are named with -beta|-stableX, which are mapped to our release channels.
Note that this section reflects the status of the world today, we are actively looking for simplify the process.
-
Bump build job branches for the k8s build jobs
-
Create kubekins images for the new release, add a new release target in the kubekins Makefile
-
Update release version in the image bump script and push new kubekins images by running the script. (Note that the runner need to have access to k8s-testimages gcp project)
-
Similarly, make a new Dockerfile for kubekins-test image, this is the image we used for our integration and verify jobs. Also bump the image tags in the kubernetes_verify scenario
-
grep for
manual-release-bump-required
under test-infra, those are the jobs that need to be manually bumped per release cycle, remap them to the up-to-date branches. Similar to 2, Fork a new version of kubernetes/kubernetes presubmit job, and remove references to the older branches. -
Okay, now let's update the Testgrid config. It's a manual work now, basically you want to find dashboard tabs for release-1.x, and bump that, and the jobs inside, to release-1.(x+1)
-
Finally, update the release target section
Not all the steps need to happen together, some new jobs, like bazel-build/integration/verify will require images to be pushed before they can work properly.
The code slush, code freeze, and code thaw dates in the release cycle mark points at which merge requirements for PRs in the master
branch and release-<current-release-number>
change. The remaining branches are release-X.X
branches for previous releases and are unaffected by the release cycle.
Code slush and freeze are the two phases of the release cycle with additional merge requirements. Code thaw marks the switch back to the development (normal) phase.
The tool that we use to automate merges is called Tide. Its configuration lives in config.yaml
. Tide identifies PRs that are mergeable using GitHub queries that correspond to the entries in the queries
field.
Here is an example of what the query config for kubernetes/kubernetes
looks like without additional constraints related to the release cycle:
- repos:
- kubernetes/kubernetes
labels:
- lgtm
- approved
- "cncf-cla: yes"
missingLabels:
- do-not-merge
- do-not-merge/blocked-paths
- do-not-merge/cherry-pick-not-approved
- do-not-merge/hold
- do-not-merge/invalid-owners-file
- do-not-merge/release-note-label-needed
- do-not-merge/work-in-progress
- needs-kind
- needs-rebase
- needs-sig
During code slush and freeze we use two queries instead of one for the kubernetes/kubernetes
repo. One query handles the master
and current release branches while the other query handles all other branches. The partition is achieved with the includedBranches
and excludedBranches
fields.
Code slush is when merge requirements for the master
and current release branch diverge from the requirements for the other branches so this is when we split the kubernetes/kubernetes
Tide query into two queries.
We only add one additional merge requirement for PRs to these two branches for code slush:
- PRs must be in the GitHub milestone for the current release (e.g.
v1.12
).
Milestone requirements are configured by adding milestone: foo
to a query config.
- repos:
- kubernetes/kubernetes
milestone: v1.12
includedBranches:
- master
- release-1.12
labels:
- lgtm
- approved
- "cncf-cla: yes"
missingLabels:
# as above...
- repos:
- kubernetes/kubernetes
excludedBranches:
- master
- release-1.12
labels:
- lgtm
- approved
- "cncf-cla: yes"
missingLabels:
# as above...
Code freeze adds one more merge requirement for PRs in the master
and current release branches:
- PRs must have the
priority/critical-urgent
label.
This label requirement is configured by adding priority/critical-urgent
to the list specified by the labels
field.
- repos:
- kubernetes/kubernetes
milestone: v1.12
includedBranches:
- master
- release-1.12
labels:
- lgtm
- approved
- priority/critical-urgent
- "cncf-cla: yes"
missingLabels:
# as above...
- repos:
- kubernetes/kubernetes
excludedBranches:
- master
- release-1.12
labels:
- lgtm
- approved
- "cncf-cla: yes"
missingLabels:
# as above...
Code thaw removes the release cycle merge restrictions and replaces the two queries with a single one. We remain in this state until the next code slush.
- repos:
- kubernetes/kubernetes
labels:
- lgtm
- approved
- "cncf-cla: yes"
missingLabels:
# as above...
During the release cycle, especially inside the code freeze, the test infra lead need to actively watch for
-
If the presubmit/CI is failing due to test infra issues (do some initial triage with CI Signal Lead)
-
If Tide is merging PRs into the master and release branches
We record test-infra commit SHAs in each Testgrid tab, and if CI starts to fail between two test-infra commits, test infra lead can diff the SHAs to triage if the failure is caused by a test-infra change.
The velodrome monitoring dashboard will be your good friends.
It is important to monitor Tide after config changes are made for code slush, freeze and thaw to ensure that the changes are having the intended effect.
Until the CNCF infra migration is complete, a member of Google's gke-engprod team will need to monitor Tide logs. However, most of Tide's behavior can be monitored without access to the cluster. The Tide dashboard and Velodrome monitoring dashboard provide insight into what Tide is currently doing, how much load it is handling, and how it is performing.
The stability of our test infra is critical to getting reliable testing signals throughout the release cycle, but the signal is most important at the end of the release cycle during code slush and freeze. While the kubernetes/test-infra
repo does not enforce additional merge restrictions related to the release cycle, we do try to limit the changes that are merged. Specifically, during slush and freeze, changes to test-infra should be limited to important fixes and work that doesn't impact critical infrastructure. Large changes should be delayed if possible.
In particular, bumping the kubekins-e2e images should be avoided unless a critical fix in necessary.