OSAC-854: add nightly vmaas snapshot build job#79377
Conversation
Add cluster-tool snapshot step and workflow that provisions a baremetal cluster via assisted-installer, installs OSAC, snapshots the cluster, and pushes the OCI image to quay.io. Runs nightly at 2am UTC.
|
@omer-vishlitzky: This pull request references OSAC-854 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR adds a new ChangesCluster Snapshot Workflow
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: omer-vishlitzky The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.sh`:
- Around line 50-55: The podman login call currently passes the password with -p
which exposes QUAY_PASS in the process list; change it to use --password-stdin
and pipe the password into podman login instead of using -p. Locate the podman
login invocation (the line using podman login --root ... "$(echo ${REGISTRY} |
cut -d/ -f1)" -u "${QUAY_USER}" -p "${QUAY_PASS}") and replace the -p usage by
supplying QUAY_PASS via stdin (keep the --root, registry host extraction, and -u
"${QUAY_USER}" as-is) so the password is not visible in process arguments.
- Around line 35-37: Download of the cluster-tool binary should include SHA256
verification: download to a temp path (instead of writing directly to
/usr/local/bin/cluster-tool), ensure CLUSTER_TOOL_SHA256 environment variable is
present, compute the SHA256 of the downloaded file (e.g., via sha256sum or
shasum -a 256), compare it to CLUSTER_TOOL_SHA256 and exit non‑zero on mismatch,
only move the verified file into /usr/local/bin/cluster-tool and then run chmod
+x on that path; update the curl -> temp file step and references to COMMIT and
/usr/local/bin/cluster-tool accordingly so the file is never executed or
installed unless the checksum matches.
- Around line 15-25: The script currently enables xtrace around the ssh
invocation and passes QUAY_PASS as a positional argument (QUAY_PASS and the
timeout/ssh invocation block), which risks leaking the password; change this by
saving the current xtrace state (e.g., store "$(set +x; false || true)" or check
$- for xtrace), then disable xtrace before reading/using QUAY_PASS and before
the ssh command that includes "${QUAY_PASS}", and finally restore the original
xtrace state immediately after; apply the identical save/disable/restore pattern
around the podman login invocation (the podman login block that uses QUAY_PASS)
so credentials are never printed to CI logs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 51bedd0b-c514-4e65-9a76-664080c5dcb6
⛔ Files ignored due to path filters (1)
ci-operator/jobs/osac-project/osac-test-infra/osac-project-osac-test-infra-main-periodics.yamlis excluded by!ci-operator/jobs/**
📒 Files selected for processing (8)
ci-operator/config/osac-project/osac-test-infra/osac-project-osac-test-infra-main.yamlci-operator/step-registry/osac-project/cluster-tool/snapshot-vmaas/OWNERSci-operator/step-registry/osac-project/cluster-tool/snapshot-vmaas/osac-project-cluster-tool-snapshot-vmaas-workflow.metadata.jsonci-operator/step-registry/osac-project/cluster-tool/snapshot-vmaas/osac-project-cluster-tool-snapshot-vmaas-workflow.yamlci-operator/step-registry/osac-project/cluster-tool/snapshot/OWNERSci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.shci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-ref.metadata.jsonci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-ref.yaml
| set +x | ||
| QUAY_PASS=$(cat /var/run/vault/osac-quay-creds/password) | ||
| set -x | ||
|
|
||
| timeout -s 9 90m ssh -F "${SHARED_DIR}/ssh_config" ci_machine bash -s \ | ||
| "${CLUSTER_TOOL_COMMIT}" \ | ||
| "${SNAPSHOT_REGISTRY}" \ | ||
| "${FLAVOR_NAME}" \ | ||
| "${QUAY_USER}" \ | ||
| "${QUAY_PASS}" \ | ||
| <<'REMOTE_EOF' |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -50 ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.shRepository: openshift/release
Length of output: 1570
🏁 Script executed:
tail -30 ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.shRepository: openshift/release
Length of output: 1221
🏁 Script executed:
# Let me verify the exact line numbers and check the context around the credential handling
wc -l ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.shRepository: openshift/release
Length of output: 172
🏁 Script executed:
cat -n ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.shRepository: openshift/release
Length of output: 2382
🏁 Script executed:
# Verify the fix syntax is correct by checking bash parameter expansion behavior
bash -c 'set -o nounset; was_xtrace=0; [[ "$-" == *x* ]] && was_xtrace=1; echo "was_xtrace=$was_xtrace"; set -x; echo "After set -x"; [[ "$-" == *x* ]] && was_xtrace=1; echo "was_xtrace=$was_xtrace"; (( was_xtrace )) && echo "Would re-enable xtrace"'Repository: openshift/release
Length of output: 272
Prevent Quay password leakage through xtrace during SSH invocation.
Line 17 enables xtrace, and the ssh command on lines 19–24 passes QUAY_PASS as an argument. With xtrace enabled, bash expands and outputs the full command including the password to CI logs before execution. Save the xtrace state before disabling it for credential operations, then restore it afterward, rather than always re-enabling.
Suggested fix
- set +x
- QUAY_PASS=$(cat /var/run/vault/osac-quay-creds/password)
- set -x
+was_xtrace=0
+[[ "$-" == *x* ]] && was_xtrace=1
+set +x
+QUAY_PASS=$(cat /var/run/vault/osac-quay-creds/password)
timeout -s 9 90m ssh -F "${SHARED_DIR}/ssh_config" ci_machine bash -s \
"${CLUSTER_TOOL_COMMIT}" \
"${SNAPSHOT_REGISTRY}" \
"${FLAVOR_NAME}" \
"${QUAY_USER}" \
"${QUAY_PASS}" \
<<'REMOTE_EOF'
@@
REMOTE_EOF
+
+(( was_xtrace )) && set -xPer coding guidelines, step registry command scripts should avoid accidental disclosure of passwords, tokens, API keys, and cluster URLs via xtrace. Apply the same pattern to the podman login invocation at lines 51–55.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.sh`
around lines 15 - 25, The script currently enables xtrace around the ssh
invocation and passes QUAY_PASS as a positional argument (QUAY_PASS and the
timeout/ssh invocation block), which risks leaking the password; change this by
saving the current xtrace state (e.g., store "$(set +x; false || true)" or check
$- for xtrace), then disable xtrace before reading/using QUAY_PASS and before
the ssh command that includes "${QUAY_PASS}", and finally restore the original
xtrace state immediately after; apply the identical save/disable/restore pattern
around the podman login invocation (the podman login block that uses QUAY_PASS)
so credentials are never printed to CI logs.
| curl -fsSL "https://raw.githubusercontent.com/omer-vishlitzky/cluster-tool/${COMMIT}/cluster-tool" \ | ||
| -o /usr/local/bin/cluster-tool | ||
| chmod +x /usr/local/bin/cluster-tool |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find the file
find . -name "osac-project-cluster-tool-snapshot-commands.sh" -type fRepository: openshift/release
Length of output: 171
🏁 Script executed:
#!/bin/bash
# Read the file to verify the exact content at lines 35-37
if [ -f "ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.sh" ]; then
echo "=== File content (full) ==="
wc -l ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.sh
echo ""
echo "=== Lines 30-45 for context ==="
sed -n '30,45p' ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.sh
fiRepository: openshift/release
Length of output: 899
🏁 Script executed:
#!/bin/bash
# Check for any environment variable definitions or setup related to CLUSTER_TOOL_SHA256
# Search across the directory structure for related references
rg -i "cluster_tool_sha256|CLUSTER_TOOL_SHA256" --type shRepository: openshift/release
Length of output: 43
🏁 Script executed:
#!/bin/bash
# Search for the COMMIT variable definition and usage context
rg -i "COMMIT" ci-operator/step-registry/osac-project/cluster-tool/snapshot/ -A 2 -B 2Repository: openshift/release
Length of output: 3880
🏁 Script executed:
#!/bin/bash
# Check the ref.yaml file to see what env variables are defined
cat ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-ref.yamlRepository: openshift/release
Length of output: 976
🏁 Script executed:
#!/bin/bash
# Check if there are any related steps that might define CLUSTER_TOOL_SHA256 or similar patterns
rg "SHA256|sha256|checksum" ci-operator/step-registry/osac-project/cluster-tool/ --type shRepository: openshift/release
Length of output: 43
🏁 Script executed:
#!/bin/bash
# Look at the full context of the commands.sh file to understand the overall flow
cat -n ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.shRepository: openshift/release
Length of output: 2382
Add integrity verification for downloaded cluster-tool binary.
Lines [35-37] download and execute a binary from a mutable GitHub ref (main by default) without checksum verification. This poses a supply-chain risk, particularly in a step with mounted credentials that could be compromised if the tool is hijacked.
Add SHA256 verification before execution:
Suggested hardening
curl -fsSL "https://raw.githubusercontent.com/omer-vishlitzky/cluster-tool/${COMMIT}/cluster-tool" \
-o /usr/local/bin/cluster-tool
+echo "${CLUSTER_TOOL_SHA256} /usr/local/bin/cluster-tool" | sha256sum -c -
chmod +x /usr/local/bin/cluster-toolNote: Requires defining CLUSTER_TOOL_SHA256 as an environment variable in the step reference.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.sh`
around lines 35 - 37, Download of the cluster-tool binary should include SHA256
verification: download to a temp path (instead of writing directly to
/usr/local/bin/cluster-tool), ensure CLUSTER_TOOL_SHA256 environment variable is
present, compute the SHA256 of the downloaded file (e.g., via sha256sum or
shasum -a 256), compare it to CLUSTER_TOOL_SHA256 and exit non‑zero on mismatch,
only move the verified file into /usr/local/bin/cluster-tool and then run chmod
+x on that path; update the curl -> temp file step and references to COMMIT and
/usr/local/bin/cluster-tool accordingly so the file is never executed or
installed unless the checksum matches.
| echo "=== Authenticating to registry ===" | ||
| set +x | ||
| podman login --root /home/cluster-tool/containers/storage \ | ||
| "$(echo ${REGISTRY} | cut -d/ -f1)" \ | ||
| -u "${QUAY_USER}" -p "${QUAY_PASS}" | ||
| set -x |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if the file exists and examine lines around 50-55
if [ -f "ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.sh" ]; then
echo "=== File exists, showing lines 40-65 for context ==="
sed -n '40,65p' "ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.sh"
else
echo "File not found, checking what exists in the directory"
find . -path "*/osac-project/cluster-tool/snapshot/*-commands.sh" 2>/dev/null || echo "No matching files found"
fiRepository: openshift/release
Length of output: 1022
🌐 Web query:
podman login --password-stdin option documentation
💡 Result:
The --password-stdin option for the podman login command is used to provide a registry password via standard input (STDIN) rather than through command-line arguments or interactive prompts [1][2]. This approach is commonly used in automated scripts and CI/CD pipelines to enhance security by avoiding the exposure of passwords in process lists or shell history [1][3]. Usage Examples: 1. Using a pipe: echo
Citations:
- 1: https://docs.podman.io/en/v5.2.1/markdown/podman-login.1.html
- 2: https://man.archlinux.org/man/podman-login.1.en.txt
- 3: https://docs.podman.io/en/stable/markdown/podman-login.1.html
🏁 Script executed:
# Check if printf is available and test the approach
which printf
echo "Testing printf syntax for password piping:"
QUAY_PASS="test_password"
# Verify the printf syntax doesn't add newline
printf '%s' "${QUAY_PASS}" | wc -c
echo "Password length:"
echo -n "${QUAY_PASS}" | wc -cRepository: openshift/release
Length of output: 189
Use --password-stdin instead of passing password via -p flag.
The current approach on line 54 exposes the password in the process list, which can be accessed via process inspection. Use stdin-based authentication as per the coding guidelines requiring sensitive data to bypass process visibility.
Suggested fix
set +x
-podman login --root /home/cluster-tool/containers/storage \
- "$(echo ${REGISTRY} | cut -d/ -f1)" \
- -u "${QUAY_USER}" -p "${QUAY_PASS}"
+printf '%s' "${QUAY_PASS}" | podman login --root /home/cluster-tool/containers/storage \
+ "$(echo "${REGISTRY}" | cut -d/ -f1)" \
+ -u "${QUAY_USER}" --password-stdin
set -x🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/osac-project/cluster-tool/snapshot/osac-project-cluster-tool-snapshot-commands.sh`
around lines 50 - 55, The podman login call currently passes the password with
-p which exposes QUAY_PASS in the process list; change it to use
--password-stdin and pipe the password into podman login instead of using -p.
Locate the podman login invocation (the line using podman login --root ...
"$(echo ${REGISTRY} | cut -d/ -f1)" -u "${QUAY_USER}" -p "${QUAY_PASS}") and
replace the -p usage by supplying QUAY_PASS via stdin (keep the --root, registry
host extraction, and -u "${QUAY_USER}" as-is) so the password is not visible in
process arguments.
Pass the same SNO cluster specs (CNV, LVM, 56GB RAM, 2x200GB disks, 24 vCPUs) that the existing periodic E2E jobs use. Without this, assisted-installer would provision a cluster with wrong defaults.
0e0c663 to
9f38a20
Compare
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse periodic-ci-osac-project-osac-test-infra-main-snapshot-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-osac-project-osac-test-infra-main-snapshot-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@omer-vishlitzky: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/pj-rehearse periodic-ci-osac-project-osac-test-infra-main-snapshot-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Summary
OSAC presubmit E2E jobs boot OpenShift clusters from pre-built snapshots using
cluster-tool, which brings
cluster boot time down from ~2 hours to ~6 minutes. Currently these snapshots
are built manually. This PR automates that with a nightly periodic job.
What this adds
osac-project-cluster-tool-snapshotstep-registry ref: installs cluster-toolon the provisioned machine, snapshots the running OSAC cluster, and pushes the
resulting OCI image to quay.io
osac-project-cluster-tool-snapshot-vmaasworkflow: chains the full pipeline —acquire baremetal machine, provision OCP via assisted-installer, install OSAC via
setup.sh, then snapshot and push
snapshot-vmaasperiodic job: runs the workflow nightly at 2am UTC fromosac-test-infra, keeping the snapshot current with the latest osac-installer
Flow
The pushed snapshot image is what all
e2e-vmaaspresubmit jobs pull viaCLUSTER_TOOL_FLAVOR_IMAGEto boot clusters in ~6 minutes.Test plan
/test snapshot-vmaasSummary
This PR updates OpenShift CI (openshift/release) configuration for the osac-project/osac-test-infra repository to add an automated nightly pipeline that builds and publishes VMAAS cluster snapshots used by OSAC E2E jobs.
What changed (practical effect)
New periodic test job: snapshot-vmaas (ci-operator/config/osac-project/osac-test-infra/osac-project-osac-test-infra-main.yaml)
New workflow: osac-project-cluster-tool-snapshot-vmaas (ci-operator/step-registry/osac-project/cluster-tool/snapshot-vmaas/osac-project-cluster-tool-snapshot-vmaas-workflow.yaml)
New reusable step-ref and commands to perform snapshot and push:
Ownership/metadata:
Impact / motivation
Testing notes
Robot feedback