Skip to content

SREP-4796: Allow CreateMustGather alert in conformance tests#31128

Open
dustman9000 wants to merge 3 commits intoopenshift:mainfrom
dustman9000:exclude-createmustgather-alert
Open

SREP-4796: Allow CreateMustGather alert in conformance tests#31128
dustman9000 wants to merge 3 commits intoopenshift:mainfrom
dustman9000:exclude-createmustgather-alert

Conversation

@dustman9000
Copy link
Copy Markdown
Member

@dustman9000 dustman9000 commented May 5, 2026

Summary

Add CreateMustGather to the AllowedAlertNames list so it does not fail the "shouldn't report any alerts in firing state" conformance test.

Why

CreateMustGather is a managed-cluster-config meta-alert that fires when any other alert has been active for 15 minutes. It triggers CAD (configuration-anomaly-detection) to collect a must-gather for debugging. On CI clusters, transient alerts during cluster install settling trigger CreateMustGather, which persists after the underlying alert resolves and fails conformance testing.

This has caused the alert to be reverted from MCC twice.

Context

Test plan

  • CI conformance jobs should no longer fail when CreateMustGather fires during cluster settling
  • The alert can then be re-added to MCC with an additional install-settling guard

Summary by CodeRabbit

  • New Features

    • Recognizes an additional alert type: CreateMustGather, expanding monitored alert coverage.
  • Bug Fixes / Behavior Change

    • Improved alert matching for skip decisions to use pattern-based matching, ensuring alerts are correctly identified against the allowed list.

CreateMustGather is a managed-cluster-config meta-alert that fires
when any other alert has been active for 15 minutes, triggering CAD
to collect a must-gather. It fires during cluster install settling
and fails the alerts-in-firing-state conformance test.

Jira: https://redhat.atlassian.net/browse/SREP-4796
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 5, 2026

@dustman9000: This pull request references SREP-4796 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add CreateMustGather to the AllowedAlertNames list so it does not fail the "shouldn't report any alerts in firing state" conformance test.

Why

CreateMustGather is a managed-cluster-config meta-alert that fires when any other alert has been active for 15 minutes. It triggers CAD (configuration-anomaly-detection) to collect a must-gather for debugging. On CI clusters, transient alerts during cluster install settling trigger CreateMustGather, which persists after the underlying alert resolves and fails conformance testing.

This has caused the alert to be reverted from MCC twice.

Context

Test plan

  • CI conformance jobs should no longer fail when CreateMustGather fires during cluster settling
  • The alert can then be re-added to MCC with an additional install-settling guard

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Walkthrough

Adds a regex-style allowed alert pattern CreateMustGather.* to pkg/monitortestlibrary/allowedalerts/types.go and changes isSkippedAlert in pkg/monitortests/testframework/legacytestframeworkmonitortests/alerts.go to perform regex matching (anchored) against the allowed patterns instead of direct string equality.

Changes

Allowed-alert matching and skip decision

Layer / File(s) Summary
Data Shape
pkg/monitortestlibrary/allowedalerts/types.go
AllowedAlertNames gains a new pattern entry: CreateMustGather.* (comment: https://issues.redhat.com/browse/SREP-4796).
Core Matching Logic
pkg/monitortests/testframework/legacytestframeworkmonitortests/alerts.go
isSkippedAlert now imports regexp and matches alerts by compiling ^<allowedName>$ and using MatchString instead of direct equality comparison.
Behavioral Impact
pkg/.../alerts.go + pkg/.../allowedalerts/types.go
Skip-decision now uses regex patterns from AllowedAlertNames, enabling pattern-based matches (e.g., prefixes) rather than only exact names.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: allowing CreateMustGather alert in conformance tests, matching the primary objective of adding it to AllowedAlertNames.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not contain any Ginkgo test definitions. Changes are limited to alert allowlisting configuration (types.go) and alert filtering logic (alerts.go). No test names added or modified.
Test Structure And Quality ✅ Passed Custom check requires reviewing Ginkgo test code. This PR modifies alert utility code but contains no Ginkgo test code, making the check not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes are limited to alert allowlist configuration and regex matching logic in non-test code.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The changes only modify an alert allowlist and update the matching logic in test infrastructure. The check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only test infrastructure and alert allowlists. No deployment manifests, operator code, controllers, or scheduling constraints present. Topology-aware scheduling check does not apply.
Ote Binary Stdout Contract ✅ Passed The PR changes do not introduce stdout writes at process level. Modifications add a regex pattern to data and use regexp.MatchString in test evaluation context, not process-level code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes are limited to test infrastructure utilities and allowed alert configuration. The check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dustman9000
Once this PR has been reviewed and has the lgtm label, please assign dgoodwin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from p0lyn0mial and sjenning May 5, 2026 14:12
@openshift-merge-bot openshift-merge-bot Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 5, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@dustman9000
Copy link
Copy Markdown
Member Author

/retest

The CAD team plans to add a suffix to the alert name. Use a regex
pattern so the allowlist matches both current and future variants.

Jira: https://redhat.atlassian.net/browse/SREP-4796
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/testframework/legacytestframeworkmonitortests/alerts.go`:
- Line 243: The loop in isSkippedAlert currently ignores the error from
regexp.MatchString which hides malformed regexes in AllowedAlertNames; change
isSkippedAlert to check the error returned by regexp.MatchString("^"+a+"$",
alertName) and fail fast by returning the error (i.e., change signature to
isSkippedAlert(...) (bool, error) or panic/log.Fatalf in test code), update all
callers accordingly to propagate or handle the error, and ensure any malformed
pattern causes an immediate test failure rather than being silently skipped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 17d9042d-bd24-4629-aa13-473c77eed93a

📥 Commits

Reviewing files that changed from the base of the PR and between 88d7dc6 and 925b1f8.

📒 Files selected for processing (2)
  • pkg/monitortestlibrary/allowedalerts/types.go
  • pkg/monitortests/testframework/legacytestframeworkmonitortests/alerts.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/monitortestlibrary/allowedalerts/types.go

// Some alerts we always skip over in CI:
for _, a := range allowedalerts.AllowedAlertNames {
if a == alertName {
if matched, _ := regexp.MatchString("^"+a+"$", alertName); matched {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Verify ignored regexp errors in isSkippedAlert:"
rg -n --type=go 'regexp\.MatchString\(' pkg/monitortests/testframework/legacytestframeworkmonitortests/alerts.go
rg -n --type=go 'if matched,\s*_\s*:=' pkg/monitortests/testframework/legacytestframeworkmonitortests/alerts.go

echo
echo "Inspect AllowedAlertNames patterns to audit regex validity/intent:"
awk '/AllowedAlertNames/,/\}/ {print NR ":" $0}' pkg/monitortestlibrary/allowedalerts/types.go

Repository: openshift/origin

Length of output: 1156


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check imports in the alerts.go file
echo "Checking imports in alerts.go:"
head -20 pkg/monitortests/testframework/legacytestframeworkmonitortests/alerts.go

echo
echo "Checking full isSkippedAlert function:"
sed -n '238,260p' pkg/monitortests/testframework/legacytestframeworkmonitortests/alerts.go

Repository: openshift/origin

Length of output: 1577


Handle regex errors explicitly in isSkippedAlert.

Line 243 ignores regexp.MatchString errors, so an invalid allowlist pattern fails silently and can cause hard-to-debug invariant test failures. Since AllowedAlertNames contains regex patterns like "CreateMustGather.*", malformed patterns must be caught early. Please fail fast instead of dropping the error.

Proposed fix
 func isSkippedAlert(alertName string) bool {
 	// Some alerts we always skip over in CI:
 	for _, a := range allowedalerts.AllowedAlertNames {
-		if matched, _ := regexp.MatchString("^"+a+"$", alertName); matched {
+		pattern := "^" + a + "$"
+		matched, err := regexp.MatchString(pattern, alertName)
+		if err != nil {
+			panic(fmt.Sprintf("invalid allowed alert regex %q: %v", pattern, err))
+		}
+		if matched {
 			return true
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if matched, _ := regexp.MatchString("^"+a+"$", alertName); matched {
pattern := "^" + a + "$"
matched, err := regexp.MatchString(pattern, alertName)
if err != nil {
panic(fmt.Sprintf("invalid allowed alert regex %q: %v", pattern, err))
}
if matched {
return true
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/testframework/legacytestframeworkmonitortests/alerts.go` at
line 243, The loop in isSkippedAlert currently ignores the error from
regexp.MatchString which hides malformed regexes in AllowedAlertNames; change
isSkippedAlert to check the error returned by regexp.MatchString("^"+a+"$",
alertName) and fail fast by returning the error (i.e., change signature to
isSkippedAlert(...) (bool, error) or panic/log.Fatalf in test code), update all
callers accordingly to propagate or handle the error, and ensure any malformed
pattern causes an immediate test failure rather than being silently skipped.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@dustman9000: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images 925b1f8 link true /test images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants