Dnm hcp runs for 422#79362
Conversation
The rebalanceInfra function was restarting prometheus-k8s statefulset without configuring node placement, causing pods to randomly land on worker nodes instead of infra nodes. This led to OOM issues on workers as prometheus workload is resource-intensive. Root cause: Missing nodeSelector and tolerations configuration for prometheus before pod restart. Previously, topologySpreadConstraints helped ensure at least one prometheus pod landed on infra nodes (as described in RFE-5107), but topologySpreadConstraints is no longer present in the current prometheus-k8s StatefulSet. Without explicit nodeSelector and tolerations, prometheus pods schedule to workers. Changes: - Apply cluster-monitoring-config ConfigMap with nodeSelector and tolerations for prometheusK8s to explicitly target infra nodes (other monitoring components consume minimal resources and remain on workers) - Wait for cluster-monitoring-operator to reconcile the StatefulSet template spec before restarting pods (poll up to 5 minutes using jq to verify nodeSelector and tolerations are present in the spec) - Add inline verification after rollout to ensure prometheus pods actually land on infra nodes (12 retries over 2 minutes) - Fail fast with explicit error if StatefulSet reconciliation times out or pods don't schedule to infra nodes, preventing silent OOM failures on workers Related: https://redhat.atlassian.net/browse/RFE-5107 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughUpdates perfscale CI job definitions (switch CLUSTER_SECTOR to perf4, add cron on rosa_hcp-4.21, add GC on 4.22, remove three test refs) and enhances hypershift infra script to apply monitoring config, wait for reconciliation, restart prometheus-k8s, and verify pod placement on infra nodes. ChangesPerfscale CI Configuration Updates
Prometheus Infra Migration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Sandeepyadav93 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-120nodes-onperfsector |
|
@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/uncc @mohit-sheth @rsevilla87 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`:
- Around line 107-121: The current check can set VERIFY_SUCCESS=true when no
prometheus pods were evaluated because ALL_ON_INFRA defaults to true; modify the
for node in $(oc get pods -n openshift-monitoring -o wide | grep -i
"prometheus-k8s-" | grep -i running | awk '{print$7}') loop to track whether any
pods were found (e.g., increment a POD_COUNT or set a FOUND_PROM_POD flag inside
the loop) and after the loop only set VERIFY_SUCCESS=true and break if BOTH
ALL_ON_INFRA is true and the pod count/FOUND_PROM_POD indicates at least one pod
was evaluated; ensure you update the conditional that currently checks [
"$ALL_ON_INFRA" = true ] to require the additional non-zero pod condition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 2ff3fde4-3dc1-47ab-8104-03514ededc9e
⛔ Files ignored due to path filters (2)
ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-periodics.yamlis excluded by!ci-operator/jobs/**ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-presubmits.yamlis excluded by!ci-operator/jobs/**
📒 Files selected for processing (3)
ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__rosa_hcp-4.21-nightly-x86.yamlci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__rosa_hcp-4.22-nightly-x86.yamlci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh
| while [ $RETRY -lt $MAX_RETRIES ]; do | ||
| ALL_ON_INFRA=true | ||
| for node in $(oc get pods -n openshift-monitoring -o wide | grep -i "prometheus-k8s-" | grep -i running | awk '{print$7}'); do | ||
| if [[ $(oc get nodes --no-headers -l node-role.kubernetes.io/infra | awk '{print$1}' | grep -w "$node") != "" ]]; then | ||
| log "$(date) - prometheus pod on $node (infra node) ✓" | ||
| else | ||
| log "$(date) - WARNING: prometheus pod on $node is NOT an infra node" | ||
| ALL_ON_INFRA=false | ||
| fi | ||
| done | ||
|
|
||
| if [ "$ALL_ON_INFRA" = true ]; then | ||
| log "$(date) - All prometheus-k8s pods are on infra nodes ✓" | ||
| VERIFY_SUCCESS=true | ||
| break |
There was a problem hiding this comment.
Avoid false success when no Prometheus pods are matched.
Line 118 can mark verification successful even when zero pods were evaluated, because ALL_ON_INFRA starts as true and the loop may not run. This can incorrectly pass migration validation.
Proposed fix
while [ $RETRY -lt $MAX_RETRIES ]; do
ALL_ON_INFRA=true
- for node in $(oc get pods -n openshift-monitoring -o wide | grep -i "prometheus-k8s-" | grep -i running | awk '{print$7}'); do
+ mapfile -t PROM_NODES < <(oc get pods -n openshift-monitoring -o wide | awk '/prometheus-k8s-/ && tolower($3)=="running" {print $7}')
+
+ if [ "${`#PROM_NODES`[@]}" -eq 0 ]; then
+ log "$(date) - WARNING: no running prometheus-k8s pods found yet"
+ ALL_ON_INFRA=false
+ fi
+
+ for node in "${PROM_NODES[@]}"; do
if [[ $(oc get nodes --no-headers -l node-role.kubernetes.io/infra | awk '{print$1}' | grep -w "$node") != "" ]]; then
log "$(date) - prometheus pod on $node (infra node) ✓"
else
log "$(date) - WARNING: prometheus pod on $node is NOT an infra node"
ALL_ON_INFRA=false
fi
done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| while [ $RETRY -lt $MAX_RETRIES ]; do | |
| ALL_ON_INFRA=true | |
| for node in $(oc get pods -n openshift-monitoring -o wide | grep -i "prometheus-k8s-" | grep -i running | awk '{print$7}'); do | |
| if [[ $(oc get nodes --no-headers -l node-role.kubernetes.io/infra | awk '{print$1}' | grep -w "$node") != "" ]]; then | |
| log "$(date) - prometheus pod on $node (infra node) ✓" | |
| else | |
| log "$(date) - WARNING: prometheus pod on $node is NOT an infra node" | |
| ALL_ON_INFRA=false | |
| fi | |
| done | |
| if [ "$ALL_ON_INFRA" = true ]; then | |
| log "$(date) - All prometheus-k8s pods are on infra nodes ✓" | |
| VERIFY_SUCCESS=true | |
| break | |
| while [ $RETRY -lt $MAX_RETRIES ]; do | |
| ALL_ON_INFRA=true | |
| mapfile -t PROM_NODES < <(oc get pods -n openshift-monitoring -o wide | awk '/prometheus-k8s-/ && tolower($3)=="running" {print $7}') | |
| if [ "${`#PROM_NODES`[@]}" -eq 0 ]; then | |
| log "$(date) - WARNING: no running prometheus-k8s pods found yet" | |
| ALL_ON_INFRA=false | |
| fi | |
| for node in "${PROM_NODES[@]}"; do | |
| if [[ $(oc get nodes --no-headers -l node-role.kubernetes.io/infra | awk '{print$1}' | grep -w "$node") != "" ]]; then | |
| log "$(date) - prometheus pod on $node (infra node) ✓" | |
| else | |
| log "$(date) - WARNING: prometheus pod on $node is NOT an infra node" | |
| ALL_ON_INFRA=false | |
| fi | |
| done | |
| if [ "$ALL_ON_INFRA" = true ]; then | |
| log "$(date) - All prometheus-k8s pods are on infra nodes ✓" | |
| VERIFY_SUCCESS=true | |
| break |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/openshift-qe/hypershift-infra/openshift-qe-hypershift-infra-commands.sh`
around lines 107 - 121, The current check can set VERIFY_SUCCESS=true when no
prometheus pods were evaluated because ALL_ON_INFRA defaults to true; modify the
for node in $(oc get pods -n openshift-monitoring -o wide | grep -i
"prometheus-k8s-" | grep -i running | awk '{print$7}') loop to track whether any
pods were found (e.g., increment a POD_COUNT or set a FOUND_PROM_POD flag inside
the loop) and after the loop only set VERIFY_SUCCESS=true and break if BOTH
ALL_ON_INFRA is true and the pod count/FOUND_PROM_POD indicates at least one pod
was evaluated; ensure you update the conditional that currently checks [
"$ALL_ON_INFRA" = true ] to require the additional non-zero pod condition.
2316b75 to
085ae44
Compare
|
/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-120nodes-onperfsector |
|
@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-120nodes-onperfsector |
|
@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-120nodes-onperfsector |
|
@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
085ae44 to
077b269
Compare
|
/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-control-plane-120nodes-onperfsector |
|
@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[REHEARSALNOTIFIER]
A total of 48 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-120nodes-onperfsector |
|
@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-control-plane-120nodes-onperfsector |
|
@Sandeepyadav93: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@Sandeepyadav93: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR updates OpenShift CI job definitions for ROSA HCP performance testing and enhances a Hypershift infra helper script used when rebalancing Prometheus onto infra nodes.
CI / job changes (ocp-qe-perfscale-ci)
Hypershift infra script changes (ci-operator step)
Other notes