OBSINTA-1219: add e2e tests for UIPlugin incident detection#1038
OBSINTA-1219: add e2e tests for UIPlugin incident detection#1038DavidRajnoha wants to merge 1 commit intorhobs:mainfrom
Conversation
|
Hi @DavidRajnoha. Thanks for your PR. I'm waiting for a rhobs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
89ac678 to
0768ae2
Compare
|
/ok-to-test |
|
/retest |
|
@DavidRajnoha: This pull request references OBSINTA-1219 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
970ac8c to
90b891c
Compare
|
/hold |
|
/test observability-operator-e2e |
|
/retest |
90b891c to
23c1e1d
Compare
|
/unhold |
|
Thank you! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: DavidRajnoha, tremes The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
simonpasquier
left a comment
There was a problem hiding this comment.
I've checked the test output and it adds 3m30s to the total duration which is not negligible. Do we really need to deploy a crashlooping pod? Can't we write a "dummy" alerting rule with a static expression (vector(1)) and hardcoded labels?
|
|
||
| t.Log("Waiting for health-analyzer deployment to become ready...") | ||
| haDeployment := appsv1.Deployment{} | ||
| f.GetResourceWithRetry(t, healthAnalyzerDeploymentName, uiPluginInstallNS, &haDeployment) |
There was a problem hiding this comment.
(nit) do we really to get the resource since we check that it's ready just after?
There was a problem hiding this comment.
The getResourceWithRetry has a shorter timeout and provides explicit error message in case when the deployment is never created.
| Expr: intstr.FromString(fmt.Sprintf( | ||
| `max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace="%s", pod=~"%s.*", job="kube-state-metrics"}[5m]) >= 1`, | ||
| e2eTestNamespace, podPrefix)), | ||
| For: ptr.To(monv1.Duration("1m")), |
There was a problem hiding this comment.
removing the for clause would speed up the test.
| return *p | ||
| } | ||
|
|
||
| func skipIfClusterVersionBelow(t *testing.T, minVersion string) { |
There was a problem hiding this comment.
(nit) could it be a framework function?
| t.Log("=== END DEBUG DUMP ===") | ||
| } | ||
|
|
||
| func ptrInt32(p *int32) int32 { |
There was a problem hiding this comment.
(nit) ptr.To() already exists
| } | ||
| } | ||
|
|
||
| func dumpClusterHealthAnalyzerDebug(t *testing.T, pluginName string) { |
There was a problem hiding this comment.
eventually this should be generalized for reuse across different tests.
| assert.NilError(t, err, "alert %s never fired", alertName) | ||
|
|
||
| t.Log("Waiting for cluster-health-analyzer to expose incident metric...") | ||
| incidentQuery := fmt.Sprintf(`cluster_health_components_map{src_alertname="%s"}`, alertName) |
There was a problem hiding this comment.
apart from checking the metric generated by the cluster health analyzer, is there any other outcome that we should verify?
| return fmt.Errorf("expected incident metric, got: %v", v) | ||
| } | ||
| for _, sample := range vec { | ||
| if string(sample.Metric["src_alertname"]) != alertName { |
There was a problem hiding this comment.
already checked by the PromQL expression
| return fmt.Errorf("expected src_alertname=%s, got %s", alertName, sample.Metric["src_alertname"]) | ||
| } | ||
| if string(sample.Metric["src_severity"]) != "warning" { | ||
| return fmt.Errorf("expected src_severity=warning, got %s", sample.Metric["src_severity"]) |
There was a problem hiding this comment.
could also be checked in the promql expression
|
New changes are detected. LGTM label has been removed. |
|
/hold |
📝 WalkthroughWalkthroughAdds a configurable PromQL assertion helper, framework utilities to skip tests by cluster version and dump namespace diagnostics, and a new E2E test that deploys an always-firing PrometheusRule to verify the cluster health analyzer produces an incident metric. Also renames a UIPlugin deletion wait helper and wires the new test into the UIPlugin test table. Changes
Sequence Diagram(s)sequenceDiagram
participant TestRunner as Test Runner
participant Framework as E2E Framework
participant K8s as Kubernetes API
participant Prom as Prometheus
participant Analyzer as Health Analyzer
TestRunner->>Framework: clusterHealthAnalyzer()
Framework->>K8s: Get ClusterVersion (skip if < minVersion)
Framework->>K8s: Ensure health-analyzer Deployment ready
TestRunner->>K8s: Ensure monitoring UIPlugin exists/enabled
TestRunner->>K8s: Create PrometheusRule (always-firing)
loop Poll until alert fires
TestRunner->>Prom: Query ALERTS for rule (via AssertPromQLResultWithOptions)
Prom-->>TestRunner: alert state
end
loop Poll until incident metric appears
TestRunner->>Prom: Query cluster_health_components_map by alert
Prom-->>TestRunner: incident metric
end
TestRunner->>K8s: Cleanup PrometheusRule & resources
alt on failure
TestRunner->>Framework: DumpNamespaceDebug(...) and emit diagnostics
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/test observability-operator-e2e |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
test/e2e/uiplugin_cluster_health_analyzer_test.go (2)
101-109: Consider checking the Delete error for pre-existing UIPlugin.While rare, the
Deletecall on line 105 could fail for reasons other than the resource being already gone. Logging or checking this error would improve debuggability.Proposed fix
if err == nil { t.Log("UIPlugin 'monitoring' already exists, deleting before recreation...") - f.K8sClient.Delete(t.Context(), existing) + if err := f.K8sClient.Delete(t.Context(), existing); err != nil && !errors.IsNotFound(err) { + t.Logf("Warning: failed to delete existing UIPlugin: %v", err) + } waitForUIPluginDeletion(existing)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 101 - 109, The Delete call on the pre-existing UIPlugin is ignored; update the block that finds an existing UIPlugin (variable existing) to capture and handle the error returned from f.K8sClient.Delete(t.Context(), existing) — after calling Delete check the returned error and either t.Logf the failure with the error details or t.Fatalf if deletion must succeed; keep waitForUIPluginDeletion(existing) only after a successful delete, and reference the existing variable and f.K8sClient.Delete to locate the change.
43-45:GetResourceWithRetryis redundant beforeAssertDeploymentReady.
AssertDeploymentReadyalready polls until the deployment exists and is ready, so the priorGetResourceWithRetrycall adds unnecessary overhead without additional guarantees.Proposed simplification
t.Log("Waiting for health-analyzer deployment to become ready...") - haDeployment := appsv1.Deployment{} - f.GetResourceWithRetry(t, healthAnalyzerDeploymentName, uiPluginInstallNS, &haDeployment) f.AssertDeploymentReady(healthAnalyzerDeploymentName, uiPluginInstallNS, framework.WithTimeout(5*time.Minute))(t)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 43 - 45, Remove the redundant call to GetResourceWithRetry: the test currently calls f.GetResourceWithRetry(t, healthAnalyzerDeploymentName, uiPluginInstallNS, &haDeployment) before f.AssertDeploymentReady(...), but AssertDeploymentReady already waits for the deployment to exist and be ready; delete the GetResourceWithRetry invocation and keep AssertDeploymentReady(healthAnalyzerDeploymentName, uiPluginInstallNS, framework.WithTimeout(5*time.Minute))(t); if you actually need the deployment object (haDeployment) after readiness, call a single non-retrying getter (e.g., GetResource) or populate haDeployment after AssertDeploymentReady returns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Line 30: monv1.AddToScheme currently ignores its returned error; update the
call to capture and handle the error (e.g., err :=
monv1.AddToScheme(f.K8sClient.Scheme()); if err != nil { t.Fatalf("failed to add
monv1 to scheme: %v", err) }) so the test fails loudly on scheme registration
problems; refer to the monv1.AddToScheme call and f.K8sClient.Scheme() to locate
where to add the error check and use the test context (t) to report the failure.
---
Nitpick comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 101-109: The Delete call on the pre-existing UIPlugin is ignored;
update the block that finds an existing UIPlugin (variable existing) to capture
and handle the error returned from f.K8sClient.Delete(t.Context(), existing) —
after calling Delete check the returned error and either t.Logf the failure with
the error details or t.Fatalf if deletion must succeed; keep
waitForUIPluginDeletion(existing) only after a successful delete, and reference
the existing variable and f.K8sClient.Delete to locate the change.
- Around line 43-45: Remove the redundant call to GetResourceWithRetry: the test
currently calls f.GetResourceWithRetry(t, healthAnalyzerDeploymentName,
uiPluginInstallNS, &haDeployment) before f.AssertDeploymentReady(...), but
AssertDeploymentReady already waits for the deployment to exist and be ready;
delete the GetResourceWithRetry invocation and keep
AssertDeploymentReady(healthAnalyzerDeploymentName, uiPluginInstallNS,
framework.WithTimeout(5*time.Minute))(t); if you actually need the deployment
object (haDeployment) after readiness, call a single non-retrying getter (e.g.,
GetResource) or populate haDeployment after AssertDeploymentReady returns.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 47e2c224-67bf-4d35-b07e-a68060e8765c
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
bfb1cf6 to
7dc043f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/e2e/uiplugin_cluster_health_analyzer_test.go (1)
48-50: Use a collision-resistant suffix for test resource names.
UnixNano()%100000gives only 100k possibilities; parallel/retried CI runs can collide and fail withAlreadyExists. Prefer UUID/full nanoseconds.Proposed fix
- suffix := strconv.FormatInt(time.Now().UnixNano()%100000, 10) + suffix := strconv.FormatInt(time.Now().UnixNano(), 10)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 48 - 50, The test uses a short, collision-prone suffix (suffix := strconv.FormatInt(time.Now().UnixNano()%100000, 10)) to build ruleName and alertName; replace that suffix with a collision-resistant value (e.g., uuid.New().String() or the full time.Now().UnixNano() without modulus) and update imports accordingly so ruleName ("e2e-health-analyzer-"+suffix) and alertName ("E2EHealthAnalyzer"+suffix) use the new suffix generation to avoid AlreadyExists failures in parallel CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 106-107: The Delete calls like f.K8sClient.Delete(t.Context(),
existing) currently ignore returned errors which can mask API failures; update
each deletion (the calls that precede waitForUIPluginDeletion(existing) and the
similar calls later) to capture the error (e.g., err :=
f.K8sClient.Delete(t.Context(), existing)) and fail the test immediately on
error (use t.Fatalf or your test helper like require.NoError(t, err)) before
calling waitForUIPluginDeletion(existing) so the real delete failure is reported
instead of timing out waiting for deletion.
- Around line 102-116: Currently the test permanently deletes a pre-existing
UIPlugin named "monitoring" and only cleans up the recreated plugin; instead,
detect the existing resource (existing from f.K8sClient.Get), make a DeepCopy or
clone of it into a variable like originalPlugin, then proceed to delete it to
allow recreation but register a cleanup that will restore originalPlugin (using
f.K8sClient.Create or Update) after the test finishes; also keep the existing
cleanup for the newly created plugin (plugin) and ensure waitForUIPluginDeletion
is called as needed and errors from restore/create are handled and logged.
---
Nitpick comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 48-50: The test uses a short, collision-prone suffix (suffix :=
strconv.FormatInt(time.Now().UnixNano()%100000, 10)) to build ruleName and
alertName; replace that suffix with a collision-resistant value (e.g.,
uuid.New().String() or the full time.Now().UnixNano() without modulus) and
update imports accordingly so ruleName ("e2e-health-analyzer-"+suffix) and
alertName ("E2EHealthAnalyzer"+suffix) use the new suffix generation to avoid
AlreadyExists failures in parallel CI.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 370d48ba-2d12-4849-b8bd-1d792776f68f
📒 Files selected for processing (2)
test/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.go
✅ Files skipped from review due to trivial changes (1)
- test/e2e/framework/framework.go
|
/retest |
25f4b52 to
a0245db
Compare
|
@simonpasquier I updated the test to use a static rule. For reference, here is the current timing breakdown of the test:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/framework/framework.go`:
- Around line 297-302: The DumpNamespaceDebug method uses t.Context() which is
cancelled before t.Cleanup runs; replace the cancelled context with a
non-cancelled one (e.g. use context.Background() or a background context with a
timeout) so the List calls inside DumpNamespaceDebug succeed during cleanup.
Update the ctx variable in Framework.DumpNamespaceDebug (replace ctx :=
t.Context()) to use a persistent background context and ensure the context
package is imported; keep all List calls (deployments, pods, events) using this
new ctx.
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 42-46: The cleanup callbacks and the debug helper are using
t.Context() which is already canceled when t.Cleanup runs; replace usages of
t.Context() in the t.Cleanup closures and inside dumpClusterHealthAnalyzerDebug
by creating an uncancelable context first (ctx :=
context.WithoutCancel(t.Context())) and pass that ctx into any API calls
(delete/get/list) and into dumpClusterHealthAnalyzerDebug instead of
t.Context(); update the anonymous t.Cleanup funcs and the
dumpClusterHealthAnalyzerDebug function to accept/use the new ctx so
cleanup-time API calls are not given a canceled context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 99ba088d-c938-4352-9e06-056a58ac5d82
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- test/e2e/uiplugin_test.go
- test/e2e/framework/assertions.go
| // DumpNamespaceDebug logs deployments (with conditions), pods (with container | ||
| // statuses), and events for the given namespace. Useful as a t.Cleanup or | ||
| // on-failure diagnostic helper. | ||
| func (f *Framework) DumpNamespaceDebug(t *testing.T, namespace string) { | ||
| t.Helper() | ||
| ctx := t.Context() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n test/e2e/framework/framework.go | sed -n '297,354p'Repository: rhobs/observability-operator
Length of output: 2987
🏁 Script executed:
head -20 test/e2e/framework/framework.goRepository: rhobs/observability-operator
Length of output: 391
🌐 Web query:
Go testing.T.Context() cleanup callbacks behavior when context is canceled
💡 Result:
In Go 1.24 and later, testing.T.Context returns a context.Context that is canceled just before any T.Cleanup-registered functions are called. This allows cleanup functions to wait for resources (e.g., goroutines) that shut down when listening on ctx.Done= before the test or subtest completes. When the context is canceled (ctx.Done= is closed): - Cleanup callbacks registered via T.Cleanup are executed after the cancellation. - The context error (ctx.Err)= is context.Canceled (not DeadlineExceeded, unless test timeout interacts). - Goroutines or resources watching <-ctx.Done= can perform shutdown before cleanups run, ensuring proper ordering. Example usage: func TestFoo(t *testing.T) { ctx := t.Context var wg sync.WaitGroup wg.Add(1) go func { <-ctx.Done // Detects cancellation // Cleanup resources wg.Done } t.Cleanup(wg.Wait) // Waits after ctx canceled } This behavior was finalized in proposal golang/go#36532 (accepted Aug 2024, Go 1.24). Note: Do not use t.Context inside Cleanup functions, as it may lead to unexpected cancellation of cleanup logic itself (observed in practice). Use context.Background= or manual contexts for cleanup internals.
Citations:
- 1: testing: reconsider adding Context method to testing.T golang/go#36532
- 2: https://pkg.go.dev/testing
- 3: https://pkg.go.dev/testing@go1.25.6
- 4: https://pkg.go.dev/testing/
- 5: https://pkg.go.dev/testing@go1.24.3
- 6: https://pkg.go.dev/testing@go1.24.2
🏁 Script executed:
rg "DumpNamespaceDebug" --type go -B 2 -A 5Repository: rhobs/observability-operator
Length of output: 1156
🏁 Script executed:
# Check Go version requirement in go.mod
cat go.mod | head -5Repository: rhobs/observability-operator
Length of output: 140
🏁 Script executed:
rg "t\.Cleanup" test/e2e/ -B 3 -A 3 --type goRepository: rhobs/observability-operator
Length of output: 1524
🏁 Script executed:
cat -n test/e2e/uiplugin_cluster_health_analyzer_test.go | sed -n '1,50p'Repository: rhobs/observability-operator
Length of output: 1983
🏁 Script executed:
# Check if context.WithoutCancel is available in Go 1.25
python3 <<'EOF'
# context.WithoutCancel was introduced in Go 1.21
# The project uses Go 1.25.5, so it's definitely available
print("context.WithoutCancel is available in Go 1.21+")
print("Project requires Go 1.25.5, so the API is available.")
EOFRepository: rhobs/observability-operator
Length of output: 171
🏁 Script executed:
# Verify if the imports include context
rg "import.*context" test/e2e/framework/framework.goRepository: rhobs/observability-operator
Length of output: 54
🏁 Script executed:
# Check the imports in framework.go to see if context is imported
head -30 test/e2e/framework/framework.go | grep -E "^import|context"Repository: rhobs/observability-operator
Length of output: 91
Use a non-cancelled context in DumpNamespaceDebug.
This helper is documented for t.Cleanup, but testing.T.Context() is cancelled before cleanup callbacks run (Go 1.24+). When called from cleanup, the List calls degrade to context canceled, so the namespace dump fails exactly when diagnostic output is needed most.
Suggested fix
func (f *Framework) DumpNamespaceDebug(t *testing.T, namespace string) {
t.Helper()
- ctx := t.Context()
+ ctx := context.WithoutCancel(t.Context())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // DumpNamespaceDebug logs deployments (with conditions), pods (with container | |
| // statuses), and events for the given namespace. Useful as a t.Cleanup or | |
| // on-failure diagnostic helper. | |
| func (f *Framework) DumpNamespaceDebug(t *testing.T, namespace string) { | |
| t.Helper() | |
| ctx := t.Context() | |
| // DumpNamespaceDebug logs deployments (with conditions), pods (with container | |
| // statuses), and events for the given namespace. Useful as a t.Cleanup or | |
| // on-failure diagnostic helper. | |
| func (f *Framework) DumpNamespaceDebug(t *testing.T, namespace string) { | |
| t.Helper() | |
| ctx := context.WithoutCancel(t.Context()) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/framework/framework.go` around lines 297 - 302, The
DumpNamespaceDebug method uses t.Context() which is cancelled before t.Cleanup
runs; replace the cancelled context with a non-cancelled one (e.g. use
context.Background() or a background context with a timeout) so the List calls
inside DumpNamespaceDebug succeed during cleanup. Update the ctx variable in
Framework.DumpNamespaceDebug (replace ctx := t.Context()) to use a persistent
background context and ensure the context package is imported; keep all List
calls (deployments, pods, events) using this new ctx.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/framework/framework.go (1)
300-303:⚠️ Potential issue | 🟠 MajorUse a non-cancelled context in cleanup diagnostics (already reported).
Line 302 uses
t.Context(). When this helper is invoked fromt.Cleanup, list calls can fail withcontext canceled, which defeats failure-time diagnostics.Suggested fix
func (f *Framework) DumpNamespaceDebug(t *testing.T, namespace string) { t.Helper() - ctx := t.Context() + ctx := context.WithoutCancel(t.Context())In Go 1.24+ testing package behavior: is testing.T.Context canceled before T.Cleanup callbacks run?As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/framework/framework.go` around lines 300 - 303, The helper DumpNamespaceDebug currently grabs ctx from t.Context() which can be canceled when run inside t.Cleanup; change it to use a non-cancelled context (e.g., ctx := context.Background() or context.TODO()) so diagnostic list calls in DumpNamespaceDebug won’t fail with “context canceled”; update the function (DumpNamespaceDebug) to import context if needed and replace the t.Context() usage with the background context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/framework/framework.go`:
- Around line 300-303: The helper DumpNamespaceDebug currently grabs ctx from
t.Context() which can be canceled when run inside t.Cleanup; change it to use a
non-cancelled context (e.g., ctx := context.Background() or context.TODO()) so
diagnostic list calls in DumpNamespaceDebug won’t fail with “context canceled”;
update the function (DumpNamespaceDebug) to import context if needed and replace
the t.Context() usage with the background context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 566b5429-753e-4378-bcfa-22e74ddb0c0d
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- test/e2e/uiplugin_test.go
- test/e2e/framework/assertions.go
- test/e2e/uiplugin_cluster_health_analyzer_test.go
Validate that the Monitoring UIPlugin with ClusterHealthAnalyzer enabled deploys the health-analyzer and correctly processes alerts into incident metrics. The test creates a crashing pod, waits for the corresponding PrometheusRule alert to fire, then verifies the cluster_health_components_map metric is exposed with the expected labels. Also extends the framework's AssertPromQLResult with configurable timeout and poll interval options (AssertPromQLResultWithOptions). Made-with: Cursor
a0245db to
f5b49a6
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/e2e/framework/framework.go (1)
262-295: Minor:-0suffix on canonical versions may skew comparisons for pre-releases.
semver.Canonical("v4.19")returns"v4.19.0", so appending-0produces"v4.19.0-0"(a pre-release lower thanv4.19.0). For plain release versions on both sides this is harmless since the suffix cancels out, but for a cluster on something likev4.19.0-rc.1the result becomes"v4.19.0-rc.1-0", which compares below"v4.19.0-0"and would cause release candidates of the minimum version to be treated as below minimum. If that's intentional (require GA), a comment explaining the trick would help; otherwise drop the-0suffix.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/framework/framework.go` around lines 262 - 295, The code in SkipIfClusterVersionBelow builds canonicalActual and canonicalMin by calling semver.Canonical(...) then appending "-0", which incorrectly downgrades pre-release versions (e.g., v4.19.0-rc.1 becomes v4.19.0-rc.1-0) and can cause RCs to be treated as older than the minimum; remove the "-0" suffix so canonicalActual := semver.Canonical(actual) and canonicalMin := semver.Canonical(minVersion) (or, if the intent is to require GA only, instead add a clear comment near the semver.Canonical calls explaining the deliberate "-0" trick).test/e2e/uiplugin_cluster_health_analyzer_test.go (1)
49-51: Low-entropy suffix can collide across reruns.
time.Now().UnixNano()%100000gives ~100ms wrap-around and can collide with a leftoverPrometheusRule/alert from a recent rerun if cleanup failed. Consider a stronger unique suffix (full nanos, orrand.Int63()) to avoid "AlreadyExists" flakes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 49 - 51, The current low-entropy suffix generation (suffix := strconv.FormatInt(time.Now().UnixNano()%100000, 10)) can collide; replace it with a higher-entropy value (e.g., use full nanos or rand.Int63()) and update the variables that use it (suffix, ruleName, alertName). Specifically, change the suffix assignment to something like strconv.FormatInt(rand.Int63(), 10) or strconv.FormatInt(time.Now().UnixNano(), 10), ensure math/rand is imported and seeded appropriately in the test setup, and leave ruleName := "e2e-health-analyzer-"+suffix and alertName := "E2EHealthAnalyzer"+suffix unchanged so they get the stronger unique suffix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/e2e/framework/framework.go`:
- Around line 262-295: The code in SkipIfClusterVersionBelow builds
canonicalActual and canonicalMin by calling semver.Canonical(...) then appending
"-0", which incorrectly downgrades pre-release versions (e.g., v4.19.0-rc.1
becomes v4.19.0-rc.1-0) and can cause RCs to be treated as older than the
minimum; remove the "-0" suffix so canonicalActual := semver.Canonical(actual)
and canonicalMin := semver.Canonical(minVersion) (or, if the intent is to
require GA only, instead add a clear comment near the semver.Canonical calls
explaining the deliberate "-0" trick).
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 49-51: The current low-entropy suffix generation (suffix :=
strconv.FormatInt(time.Now().UnixNano()%100000, 10)) can collide; replace it
with a higher-entropy value (e.g., use full nanos or rand.Int63()) and update
the variables that use it (suffix, ruleName, alertName). Specifically, change
the suffix assignment to something like strconv.FormatInt(rand.Int63(), 10) or
strconv.FormatInt(time.Now().UnixNano(), 10), ensure math/rand is imported and
seeded appropriately in the test setup, and leave ruleName :=
"e2e-health-analyzer-"+suffix and alertName := "E2EHealthAnalyzer"+suffix
unchanged so they get the stronger unique suffix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: ffbb481e-3ce6-4f90-9884-e0f3363dc6c8
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/uiplugin_test.go
|
/unhold |
|
/retest |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
test/e2e/uiplugin_cluster_health_analyzer_test.go (2)
38-42: Cleanup ordering note.
f.CleanUpcalls insidenewAlwaysFiringRule(registered after this block) will run before this dump, so the rule is gone by the time diagnostics are captured. If the rule state matters for debugging, register the dump cleanup after rule creation, or capture rule status inside the dump helper before resource deletion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 38 - 42, The current t.Cleanup that calls dumpClusterHealthAnalyzerDebug runs before cleanup registered inside newAlwaysFiringRule, so the rule may be deleted before diagnostics are captured; to fix, either move this t.Cleanup registration to after newAlwaysFiringRule is called (so the dump runs last) or modify dumpClusterHealthAnalyzerDebug to first capture and store the rule's status/state (e.g., query the rule created by newAlwaysFiringRule using plugin.Name) before other cleanups run; update references in the test to ensure newAlwaysFiringRule, dumpClusterHealthAnalyzerDebug, and the t.Cleanup call order reflect this change.
59-85: Long timeouts (10m + 15m) dominate test duration.PR description reports alert firing in ~32s and incident metric in ~32s. The 10/15 min timeouts are ~20x real latency and will slow failure feedback substantially. Consider tightening to something like 3–5 minutes to fail fast while keeping headroom.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 59 - 85, The test uses f.AssertPromQLResultWithOptions for alertQuery and incidentQuery with overly long timeouts (framework.WithTimeout(10*time.Minute) and framework.WithTimeout(15*time.Minute)) and a 30s poll interval; tighten these to fail faster while allowing headroom—update the two calls to use shorter timeouts (e.g., framework.WithTimeout(5*time.Minute) for both) and reduce framework.WithPollInterval from 30*time.Second to something like 10*time.Second so AssertPromQLResultWithOptions (used with alertQuery, incidentQuery, and alertName) returns quicker on failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/framework/framework.go`:
- Around line 282-294: Check the outputs of semver.Canonical for both actual and
minVersion and if either returns an empty string treat the version as non-semver
instead of proceeding with the bogus comparison: after computing canonicalActual
and canonicalMin, if canonicalActual == "" or canonicalMin == "" call t.Skipf
with a clear message (e.g., include cv.Status.Desired.Version and minVersion)
indicating the cluster version could not be parsed, so the test explicitly
skips/logs non-semver versions rather than silently making a meaningless
comparison.
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 49-51: The suffix generation using time.Now().UnixNano()%100000 is
too small and risks collisions; update the code that sets suffix (used to build
ruleName and alertName) to produce a larger-entropy identifier — e.g., use
rand.Int63() (seeded once with time.Now().UnixNano()) or concatenate
time.Now().UnixNano() with t.Name() — so the generated suffix is effectively
unique across parallel/retried runs and prevents PrometheusRule/alert name
collisions.
---
Nitpick comments:
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go`:
- Around line 38-42: The current t.Cleanup that calls
dumpClusterHealthAnalyzerDebug runs before cleanup registered inside
newAlwaysFiringRule, so the rule may be deleted before diagnostics are captured;
to fix, either move this t.Cleanup registration to after newAlwaysFiringRule is
called (so the dump runs last) or modify dumpClusterHealthAnalyzerDebug to first
capture and store the rule's status/state (e.g., query the rule created by
newAlwaysFiringRule using plugin.Name) before other cleanups run; update
references in the test to ensure newAlwaysFiringRule,
dumpClusterHealthAnalyzerDebug, and the t.Cleanup call order reflect this
change.
- Around line 59-85: The test uses f.AssertPromQLResultWithOptions for
alertQuery and incidentQuery with overly long timeouts
(framework.WithTimeout(10*time.Minute) and
framework.WithTimeout(15*time.Minute)) and a 30s poll interval; tighten these to
fail faster while allowing headroom—update the two calls to use shorter timeouts
(e.g., framework.WithTimeout(5*time.Minute) for both) and reduce
framework.WithPollInterval from 30*time.Second to something like 10*time.Second
so AssertPromQLResultWithOptions (used with alertQuery, incidentQuery, and
alertName) returns quicker on failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: bbf117ff-7493-4ae1-8ddb-7df0eb5b0b18
📒 Files selected for processing (4)
test/e2e/framework/assertions.gotest/e2e/framework/framework.gotest/e2e/uiplugin_cluster_health_analyzer_test.gotest/e2e/uiplugin_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/uiplugin_test.go
| if !strings.HasPrefix(actual, "v") { | ||
| actual = "v" + actual | ||
| } | ||
| if !strings.HasPrefix(minVersion, "v") { | ||
| minVersion = "v" + minVersion | ||
| } | ||
|
|
||
| canonicalActual := fmt.Sprintf("%s-0", semver.Canonical(actual)) | ||
| canonicalMin := fmt.Sprintf("%s-0", semver.Canonical(minVersion)) | ||
|
|
||
| if semver.Compare(canonicalActual, canonicalMin) < 0 { | ||
| t.Skipf("Skipping: cluster version %s is below minimum required %s", cv.Status.Desired.Version, minVersion) | ||
| } |
There was a problem hiding this comment.
Non-semver cluster versions silently pass the gate.
semver.Canonical returns "" for inputs it cannot parse (e.g., OpenShift pre-release strings like 4.19.0-ec.2 normalize fine, but 4.19.0-0.nightly-2025-... does not canonicalize). When that happens, canonicalActual becomes "-0" and the comparison yields a stable but meaningless result — the skip may fire or not regardless of actual version. Consider logging/skipping when semver.Canonical returns empty so the intent is explicit.
canonicalActual := fmt.Sprintf("%s-0", semver.Canonical(actual))
canonicalMin := fmt.Sprintf("%s-0", semver.Canonical(minVersion))
+ if semver.Canonical(actual) == "" || semver.Canonical(minVersion) == "" {
+ t.Skipf("Skipping: unable to parse version (actual=%q, min=%q)", actual, minVersion)
+ return
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if !strings.HasPrefix(actual, "v") { | |
| actual = "v" + actual | |
| } | |
| if !strings.HasPrefix(minVersion, "v") { | |
| minVersion = "v" + minVersion | |
| } | |
| canonicalActual := fmt.Sprintf("%s-0", semver.Canonical(actual)) | |
| canonicalMin := fmt.Sprintf("%s-0", semver.Canonical(minVersion)) | |
| if semver.Compare(canonicalActual, canonicalMin) < 0 { | |
| t.Skipf("Skipping: cluster version %s is below minimum required %s", cv.Status.Desired.Version, minVersion) | |
| } | |
| if !strings.HasPrefix(actual, "v") { | |
| actual = "v" + actual | |
| } | |
| if !strings.HasPrefix(minVersion, "v") { | |
| minVersion = "v" + minVersion | |
| } | |
| canonicalActual := fmt.Sprintf("%s-0", semver.Canonical(actual)) | |
| canonicalMin := fmt.Sprintf("%s-0", semver.Canonical(minVersion)) | |
| if semver.Canonical(actual) == "" || semver.Canonical(minVersion) == "" { | |
| t.Skipf("Skipping: unable to parse version (actual=%q, min=%q)", actual, minVersion) | |
| return | |
| } | |
| if semver.Compare(canonicalActual, canonicalMin) < 0 { | |
| t.Skipf("Skipping: cluster version %s is below minimum required %s", cv.Status.Desired.Version, minVersion) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/framework/framework.go` around lines 282 - 294, Check the outputs of
semver.Canonical for both actual and minVersion and if either returns an empty
string treat the version as non-semver instead of proceeding with the bogus
comparison: after computing canonicalActual and canonicalMin, if canonicalActual
== "" or canonicalMin == "" call t.Skipf with a clear message (e.g., include
cv.Status.Desired.Version and minVersion) indicating the cluster version could
not be parsed, so the test explicitly skips/logs non-semver versions rather than
silently making a meaningless comparison.
| suffix := strconv.FormatInt(time.Now().UnixNano()%100000, 10) | ||
| ruleName := "e2e-health-analyzer-" + suffix | ||
| alertName := "E2EHealthAnalyzer" + suffix |
There was a problem hiding this comment.
Suffix collision risk across repeated test runs.
time.Now().UnixNano() % 100000 yields only 5 digits; with parallel/retried runs or fast successive executions this can collide with a still-terminating PrometheusRule or a lingering alert series, producing confusing results. Prefer a wider range (e.g., no modulo, or use rand.Int63()) or include t.Name()-derived entropy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/uiplugin_cluster_health_analyzer_test.go` around lines 49 - 51, The
suffix generation using time.Now().UnixNano()%100000 is too small and risks
collisions; update the code that sets suffix (used to build ruleName and
alertName) to produce a larger-entropy identifier — e.g., use rand.Int63()
(seeded once with time.Now().UnixNano()) or concatenate time.Now().UnixNano()
with t.Name() — so the generated suffix is effectively unique across
parallel/retried runs and prevents PrometheusRule/alert name collisions.
|
/ok-to-test |
simonpasquier
left a comment
There was a problem hiding this comment.
Most of my comments are non-blocking and can be addressed in a follow-up (my main concern is the failure mode if/when the version can't be determined).
| // It returns an error if the request fails. Otherwise the result is passed to | ||
| // the callback function for additional checks. | ||
| func (f *Framework) AssertPromQLResult(t *testing.T, expr string, callback func(model.Value) error) error { | ||
| return f.AssertPromQLResultWithOptions(t, expr, callback) |
There was a problem hiding this comment.
| return f.AssertPromQLResultWithOptions(t, expr, callback) | |
| t.Helper() | |
| return f.AssertPromQLResultWithOptions(t, expr, callback) |
| // SkipIfClusterVersionBelow skips the test if the cluster version is below | ||
| // minVersion. The minVersion string should be a semver-compatible version | ||
| // (e.g. "4.19" or "v4.19"). | ||
| func (f *Framework) SkipIfClusterVersionBelow(t *testing.T, minVersion string) { |
There was a problem hiding this comment.
in general, I'd rather skip only when we're sure that we don't meet the min version criteria. Any unhandled situation should default to test failure.
| f.GetResourceWithRetry(t, healthAnalyzerDeploymentName, uiPluginInstallNS, &haDeployment) | ||
| f.AssertDeploymentReady(healthAnalyzerDeploymentName, uiPluginInstallNS, framework.WithTimeout(5*time.Minute))(t) | ||
|
|
||
| suffix := strconv.FormatInt(time.Now().UnixNano()%100000, 10) |
There was a problem hiding this comment.
is this to ensure that we can run the test multiple times locally and not hit conflicts?
| assert.NilError(t, err, "incident metric for %s never appeared", alertName) | ||
| } | ||
|
|
||
| func newMonitoringUIPlugin(t *testing.T) *uiv1.UIPlugin { |
There was a problem hiding this comment.
The name is a bit misleading since it will delete the plugin if one already exists. IIUC and given that UI plugin resources are singletons, we can't test a given plugin in parallel. It might be better to enforce this invariant in the structure of TestUIPlugin but it can be a follow-up exercise.
|
|
||
| const ( | ||
| healthAnalyzerDeploymentName = "health-analyzer" | ||
| prometheusRuleNamespace = "openshift-monitoring" |
There was a problem hiding this comment.
(nit) I'd rather use a throw-away namespace than openshift-monitoring.
Add end-to-end tests that validate the monitoring UIPlugin with cluster-health-analyzer: a deployment readiness check and a functional test that triggers a CrashLoopBackOff alert and verifies the cluster_health_components_map incident metric is produced.
Also introduce AssertPromQLResultWithOptions to allow callers to override the default poll interval and timeout, and generalize waitForDBUIPluginDeletion to waitForUIPluginDeletion.