Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -65,19 +65,30 @@ echo "Version: ${VERSION}, Stream: ${STREAM}"
echo "Release API: ${API_URL}"
echo "Automatic reverts enabled: ${ENABLE_PAYLOAD_REVERT}"

# Poll until all blocking jobs have finished (no Pending jobs remain).
# We can't wait for the payload to reach terminal state because this
# analysis job is itself part of the payload's verification jobs.
# Poll until blocking jobs finish OR the payload reaches a terminal state.
# The release controller can report jobs as Pending even after they complete
# in Prow, so also check the payload phase as a fallback. The phase can be
# briefly wrong at creation time (RC bug), so we require PHASE_GRACE_PERIOD
# to elapse before trusting a terminal phase.
MAX_WAIT=36000 # 10 hours in seconds
POLL_INTERVAL=300 # 5 minutes
PHASE_GRACE_PERIOD=3600 # 1 hour
ELAPSED=0
POLL_COUNT=0

# Extract payload creation time from the tag (e.g. 4.22.0-0.nightly-2026-02-25-152806)
PAYLOAD_TIMESTAMP=$(echo "${PAYLOAD_TAG}" | grep -oP '\d{4}-\d{2}-\d{2}-\d{6}$')
PAYLOAD_CREATED=$(date -u -d "${PAYLOAD_TIMESTAMP:0:10} ${PAYLOAD_TIMESTAMP:11:2}:${PAYLOAD_TIMESTAMP:13:2}:${PAYLOAD_TIMESTAMP:15:2}" +%s 2>/dev/null || echo "0")

PHASE_WAIT_START=$(date +%s)

echo ""
echo "=== Waiting for all blocking jobs to complete before analysis ==="
echo "=== Waiting for blocking jobs to complete (or payload to reach terminal state) ==="
echo "Polling every $((POLL_INTERVAL / 60)) minutes (timeout: $((MAX_WAIT / 3600)) hours)"
if [[ "${PAYLOAD_CREATED}" -gt 0 ]]; then
echo "Payload created at: $(date -u -d "@${PAYLOAD_CREATED}" '+%Y-%m-%d %H:%M:%S UTC' 2>/dev/null || echo "${PAYLOAD_TIMESTAMP}")"
echo "Terminal phase trusted after: $((PHASE_GRACE_PERIOD / 60)) minutes from creation"
fi
echo ""

while true; do
Expand All @@ -87,11 +98,12 @@ while true; do
FAILED=$(echo "${RELEASE_JSON}" | jq '[.results.blockingJobs // {} | to_entries[] | select(.value.state == "Failed")] | length')
SUCCEEDED=$(echo "${RELEASE_JSON}" | jq '[.results.blockingJobs // {} | to_entries[] | select(.value.state == "Succeeded")] | length')
TOTAL=$(echo "${RELEASE_JSON}" | jq '[.results.blockingJobs // {} | to_entries[]] | length')
PHASE=$(echo "${RELEASE_JSON}" | jq -r '.phase // "Unknown"')
# Guard against release controller race: jobs may briefly report as Succeeded before being dispatched
NOT_STARTED=$(echo "${RELEASE_JSON}" | jq '[.results.blockingJobs // {} | to_entries[] | select(.value.url == null or .value.url == "")] | length')

ELAPSED_MIN=$((ELAPSED / 60))
echo "[Poll #${POLL_COUNT} | ${ELAPSED_MIN}m elapsed] Blocking jobs: ${SUCCEEDED}/${TOTAL} succeeded, ${PENDING} pending, ${FAILED} failed"
echo "[Poll #${POLL_COUNT} | ${ELAPSED_MIN}m elapsed] Phase: ${PHASE} | Blocking jobs: ${SUCCEEDED}/${TOTAL} succeeded, ${PENDING} pending, ${FAILED} failed"

if [[ "${NOT_STARTED}" -gt 0 ]]; then
echo " ${NOT_STARTED} job(s) have no prow URL yet, waiting for them to be dispatched..."
Expand Down Expand Up @@ -125,13 +137,30 @@ _Agent: ${CLAUDE_MODEL}_"
fi
echo "All blocking jobs have completed. ${FAILED}/${TOTAL} failed. Starting analysis..."
break
elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then
NOW=$(date +%s)
PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED ))
if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then
REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 ))
echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..."
else
echo ""
echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending."
if [[ "${FAILED}" -gt 0 ]]; then
echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
break
elif [[ "${PHASE}" == "Accepted" ]]; then
echo "Payload was accepted. No analysis needed."
exit 0
fi
Comment on lines +140 to +155
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat Rejected as terminal without waiting for FAILED to converge.

This branch still requires FAILED > 0 before it breaks into analysis. If the release-controller phase has already advanced to Rejected while blockingJobs is still stale, the loop keeps polling until timeout, which is the failure mode this change is trying to avoid.

Suggested fix
-            if [[ "${FAILED}" -gt 0 ]]; then
+            if [[ "${PHASE}" == "Rejected" || "${FAILED}" -gt 0 ]]; then
                 echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
                 break
             elif [[ "${PHASE}" == "Accepted" ]]; then
                 echo "Payload was accepted. No analysis needed."
                 exit 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then
NOW=$(date +%s)
PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED ))
if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then
REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 ))
echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..."
else
echo ""
echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending."
if [[ "${FAILED}" -gt 0 ]]; then
echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
break
elif [[ "${PHASE}" == "Accepted" ]]; then
echo "Payload was accepted. No analysis needed."
exit 0
fi
elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then
NOW=$(date +%s)
PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED ))
if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then
REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 ))
echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..."
else
echo ""
echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending."
if [[ "${PHASE}" == "Rejected" || "${FAILED}" -gt 0 ]]; then
echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
break
elif [[ "${PHASE}" == "Accepted" ]]; then
echo "Payload was accepted. No analysis needed."
exit 0
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh`
around lines 140 - 155, The branch treating PHASE "Accepted" or "Rejected"
currently only starts analysis when FAILED > 0, causing the loop to keep polling
if the release-controller advanced to "Rejected" while blockingJobs are stale;
modify the logic in the PHASE handling (inside the loop where PHASE,
PAYLOAD_CREATED, PHASE_GRACE_PERIOD, PAYLOAD_AGE, REMAINING are computed) to
treat "Rejected" as immediately terminal: if [[ "${PHASE}" == "Rejected" ]];
then print the terminal message about ${PENDING} and ${FAILED}/${TOTAL} and
immediately start analysis (i.e., break into the analysis path) regardless of
FAILED, while preserving the existing early-exit for "Accepted" (exit 0) and the
grace-period check for both phases.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not a single job shows up as failed, there's little reason to analyze. The changes below this will still analyze on exit if we do hit the timeout just in case there is something to look at.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stbenjam, that's a fair point — if the release controller shows Rejected but FAILED=0, the job data is likely stale and there isn't a concrete failure signal to analyze anyway. Kicking off analysis without any failed jobs to examine would mostly generate noise. And with the timeout path now breaking into analysis rather than exiting, there's still a safety net if the stale state never resolves. Makes sense, I'll withdraw the suggestion.


🧠 Learnings used
Learnt from: tthvo
Repo: openshift/release PR: 79214
File: ci-operator/step-registry/upi/install/aws/usgov/private/cluster/upi-install-aws-usgov-private-cluster-commands.sh:834-836
Timestamp: 2026-05-13T04:27:10.115Z
Learning: When reviewing shell scripts that invoke the AWS CLI, do not flag use of `--filter` as incorrect if the intended argument is `--filters` (plural). AWS CLI supports prefix matching for argument names, so `--filter` is an accepted alias for `--filters` for relevant `aws ec2 describe-*` commands (e.g., `describe-availability-zones`, `describe-instances`, `describe-subnets`). Both forms should be treated as equivalent.

Learnt from: ggiguash
Repo: openshift/release PR: 79349
File: ci-operator/step-registry/openshift/edge-tooling/lvms-ci/post/openshift-edge-tooling-lvms-ci-post-commands.sh:3-3
Timestamp: 2026-05-18T09:59:38.396Z
Learning: When reviewing step-registry shell command scripts in this repository for `set -x`, only treat `set -x` as a potential security issue if the script handles sensitive data. Flag when `set -x` could cause secrets/credentials/tokens/API keys/kubeconfig contents (or other sensitive values) to be printed to logs—for example, when the script reads sensitive env vars or files, constructs/uses authenticated requests with tokens, processes kubeconfig, or exports/echoes values that are secret. If the script performs only non-sensitive operations (e.g., building URLs and downloading public artifacts with curl without credentials or secret material), `set -x` may be used without special concern and should not be singled out.

fi
fi

ELAPSED=$((ELAPSED + POLL_INTERVAL))
if [[ ${ELAPSED} -ge ${MAX_WAIT} ]]; then
echo ""
echo "Timed out after $((MAX_WAIT / 3600)) hours waiting for blocking jobs to complete (${PENDING} still pending)."
exit 1
echo "ERROR: Timed out after $((MAX_WAIT / 3600)) hours waiting for blocking jobs to complete (${PENDING} still pending). Proceeding to analyze what we have..."
break
fi

echo " Next check in $((POLL_INTERVAL / 60)) minutes..."
Expand Down