payload agent: also proceed when payload reaches terminal state#79400
payload agent: also proceed when payload reaches terminal state#79400stbenjam wants to merge 2 commits into
Conversation
The release controller can leave blocking jobs stuck as Pending even after they complete in Prow. Previously, the agent waited for all blocking jobs to leave Pending state, which could hang indefinitely. Now also check the payload phase (Accepted/Rejected) as a signal to proceed. A 1-hour grace period from payload creation guards against a release controller bug where the phase can be briefly wrong at creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughA single shell script's release-controller polling loop was updated to track the payload's reported phase and conditionally trust terminal phases only after a configurable grace period from payload creation, with extended logging and early-exit handling based on phase and blocking-job state. ChangesClaude Payload Agent Phase Tracking
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh`:
- Around line 140-155: The branch treating PHASE "Accepted" or "Rejected"
currently only starts analysis when FAILED > 0, causing the loop to keep polling
if the release-controller advanced to "Rejected" while blockingJobs are stale;
modify the logic in the PHASE handling (inside the loop where PHASE,
PAYLOAD_CREATED, PHASE_GRACE_PERIOD, PAYLOAD_AGE, REMAINING are computed) to
treat "Rejected" as immediately terminal: if [[ "${PHASE}" == "Rejected" ]];
then print the terminal message about ${PENDING} and ${FAILED}/${TOTAL} and
immediately start analysis (i.e., break into the analysis path) regardless of
FAILED, while preserving the existing early-exit for "Accepted" (exit 0) and the
grace-period check for both phases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 7513c6bc-b9d3-4a28-bb99-814f349ebbcf
📒 Files selected for processing (1)
ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh
| elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then | ||
| NOW=$(date +%s) | ||
| PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED )) | ||
| if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then | ||
| REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 )) | ||
| echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..." | ||
| else | ||
| echo "" | ||
| echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending." | ||
| if [[ "${FAILED}" -gt 0 ]]; then | ||
| echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..." | ||
| break | ||
| elif [[ "${PHASE}" == "Accepted" ]]; then | ||
| echo "Payload was accepted. No analysis needed." | ||
| exit 0 | ||
| fi |
There was a problem hiding this comment.
Treat Rejected as terminal without waiting for FAILED to converge.
This branch still requires FAILED > 0 before it breaks into analysis. If the release-controller phase has already advanced to Rejected while blockingJobs is still stale, the loop keeps polling until timeout, which is the failure mode this change is trying to avoid.
Suggested fix
- if [[ "${FAILED}" -gt 0 ]]; then
+ if [[ "${PHASE}" == "Rejected" || "${FAILED}" -gt 0 ]]; then
echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
break
elif [[ "${PHASE}" == "Accepted" ]]; then
echo "Payload was accepted. No analysis needed."
exit 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then | |
| NOW=$(date +%s) | |
| PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED )) | |
| if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then | |
| REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 )) | |
| echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..." | |
| else | |
| echo "" | |
| echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending." | |
| if [[ "${FAILED}" -gt 0 ]]; then | |
| echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..." | |
| break | |
| elif [[ "${PHASE}" == "Accepted" ]]; then | |
| echo "Payload was accepted. No analysis needed." | |
| exit 0 | |
| fi | |
| elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then | |
| NOW=$(date +%s) | |
| PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED )) | |
| if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then | |
| REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 )) | |
| echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..." | |
| else | |
| echo "" | |
| echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending." | |
| if [[ "${PHASE}" == "Rejected" || "${FAILED}" -gt 0 ]]; then | |
| echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..." | |
| break | |
| elif [[ "${PHASE}" == "Accepted" ]]; then | |
| echo "Payload was accepted. No analysis needed." | |
| exit 0 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh`
around lines 140 - 155, The branch treating PHASE "Accepted" or "Rejected"
currently only starts analysis when FAILED > 0, causing the loop to keep polling
if the release-controller advanced to "Rejected" while blockingJobs are stale;
modify the logic in the PHASE handling (inside the loop where PHASE,
PAYLOAD_CREATED, PHASE_GRACE_PERIOD, PAYLOAD_AGE, REMAINING are computed) to
treat "Rejected" as immediately terminal: if [[ "${PHASE}" == "Rejected" ]];
then print the terminal message about ${PENDING} and ${FAILED}/${TOTAL} and
immediately start analysis (i.e., break into the analysis path) regardless of
FAILED, while preserving the existing early-exit for "Accepted" (exit 0) and the
grace-period check for both phases.
There was a problem hiding this comment.
If not a single job shows up as failed, there's little reason to analyze. The changes below this will still analyze on exit if we do hit the timeout just in case there is something to look at.
There was a problem hiding this comment.
@stbenjam, that's a fair point — if the release controller shows Rejected but FAILED=0, the job data is likely stale and there isn't a concrete failure signal to analyze anyway. Kicking off analysis without any failed jobs to examine would mostly generate noise. And with the timeout path now breaking into analysis rather than exiting, there's still a safety net if the stale state never resolves. Makes sense, I'll withdraw the suggestion.
🧠 Learnings used
Learnt from: tthvo
Repo: openshift/release PR: 79214
File: ci-operator/step-registry/upi/install/aws/usgov/private/cluster/upi-install-aws-usgov-private-cluster-commands.sh:834-836
Timestamp: 2026-05-13T04:27:10.115Z
Learning: When reviewing shell scripts that invoke the AWS CLI, do not flag use of `--filter` as incorrect if the intended argument is `--filters` (plural). AWS CLI supports prefix matching for argument names, so `--filter` is an accepted alias for `--filters` for relevant `aws ec2 describe-*` commands (e.g., `describe-availability-zones`, `describe-instances`, `describe-subnets`). Both forms should be treated as equivalent.
Learnt from: ggiguash
Repo: openshift/release PR: 79349
File: ci-operator/step-registry/openshift/edge-tooling/lvms-ci/post/openshift-edge-tooling-lvms-ci-post-commands.sh:3-3
Timestamp: 2026-05-18T09:59:38.396Z
Learning: When reviewing step-registry shell command scripts in this repository for `set -x`, only treat `set -x` as a potential security issue if the script handles sensitive data. Flag when `set -x` could cause secrets/credentials/tokens/API keys/kubeconfig contents (or other sensitive values) to be printed to logs—for example, when the script reads sensitive env vars or files, constructs/uses authenticated requests with tokens, processes kubeconfig, or exports/echoes values that are secret. If the script performs only non-sensitive operations (e.g., building URLs and downloading public artifacts with curl without credentials or secret material), `set -x` may be used without special concern and should not be singled out.
|
[REHEARSALNOTIFIER] Note: If this PR includes changes to step registry files ( Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@stbenjam: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Pendingeven after they complete in Prow, causing the payload agent to wait indefinitely (up to 10h timeout)AcceptedorRejected, proceed with analysis even if some jobs are stillPendingper the release controllerfetch_payloads.pyfix that cross-checks stale Pending jobs against Prow: fetch-payloads: resolve stale Pending jobs against Prow openshift-eng/ai-helpers#481Test plan
Rejectedwith 1 stale Pending job, well past the grace period🤖 Generated with Claude Code
Payload Agent: proceed when payload reaches terminal phase instead of waiting on stale Pending jobs
This change updates the payload agent used by OpenShift's release CI (claude payload agent step in openshift/release) so it no longer remains blocked waiting for jobs that the release controller left
Pendingafter they actually finished in Prow.Practical effect
phase. If the payload reaches a terminal phase (AcceptedorRejected), the agent will move forward with analysis rather than waiting for potentially stalePendingjob statuses reported by the release controller.Why this matters
Pendingstatuses when the release controller has already marked the payload terminal.Related
Infrastructure affected