fix: add retry logic to kuadrant metrics tests by silvi-t · Pull Request #910 · Kuadrant/testsuite

silvi-t · 2026-03-26T14:22:29Z

Description

Fix intermittent failures observed in nightly test pipeline for test_metric_kuadrant_policies_enforced where kuadrant_policies_enforced metric shows 0.0 for RateLimitPolicy and TokenRateLimitPolicy
Fix intermittent failures observed in nightly test pipeline for test_metric_policy_lifecycle where metrics don't reflect policy deletion in time
Possible root cause: the kuadrant-operator updates policy metrics from an in-memory topology snapshot that may lag one reconciliation cycle behind the actual policy state in etcd, causing a race condition between policy
enforcement and metric updates. Policies committed last (RateLimitPolicy, TokenRateLimitPolicy) are most affected as they have less time for the operator to complete the follow-up reconciliation cycle

Changes

New Features

Add wait_for_metric method to Prometheus class that polls a metric value with backoff retries, supporting custom comparisons via operator module (e.g. operator.eq, operator.ge, operator.lt)

Bug Fixes

Replace single-shot metric assertions with prometheus.wait_for_metric in test_metric_kuadrant_policies_enforced to wait for Prometheus to reflect enforced status
Replace wait_for_scrape with prometheus.wait_for_metric in test_metric_policy_lifecycle for both policy creation and deletion metric checks
Add request.addfinalizer(policy.delete) to test_metric_policy_lifecycle to prevent policy resource leaks when assertions fail before policy.delete() is reached

Refactoring

Move service_monitor fixture to conftest.py as kuadrant_service_monitor for reuse across test modules
Add wait_for_scrape to kuadrant_operator_metrics fixture in conftest.py to ensure fresh metrics after policy commit

Summary by CodeRabbit

Tests
- Added a polling-based Prometheus wait that retries with configurable comparison, handles absent or multiple series, and surfaces latest observed values on failure.
- Tests now explicitly wait for the operator service scrape readiness before validating metrics and use the new polling approach for lifecycle assertions.
- Improved teardown sequencing to ensure resources are deleted reliably.
Refactor
- Consolidated and centralised metric label construction for consistent observability queries.

coderabbitai · 2026-03-27T11:44:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b78862d-2fdf-4802-b2bc-81030fb63808

📥 Commits

Reviewing files that changed from the base of the PR and between 80eb587 and 7e2a290.

📒 Files selected for processing (4)

testsuite/prometheus.py
testsuite/tests/singlecluster/observability/conftest.py
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py

🚧 Files skipped from review as they are similar to previous changes (2)

testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py
testsuite/tests/singlecluster/observability/conftest.py

📝 Walkthrough

Walkthrough

Adds a polling-based Prometheus helper (Prometheus.wait_for_metric) and updates observability tests/fixtures to explicitly wait for a ServiceMonitor scrape and to poll metrics (with constructed label sets and comparison predicates) instead of immediate assertions.

Changes

Cohort / File(s)	Summary
Core Prometheus Utility `testsuite/prometheus.py`	Added `wait_for_metric(metric_name, metric_value, labels=None, compare=operator.eq) -> bool`: polls `get_metrics` on fixed backoff (interval=10s, max_tries=5); treats missing series as value 0; raises if >1 series; returns predicate result. Added `import operator`.
Test Fixtures `testsuite/tests/singlecluster/observability/conftest.py`	Added module-scoped `kuadrant_service_monitor` fixture; updated `kuadrant_operator_metrics` to depend on it and to call `prometheus.wait_for_scrape(kuadrant_service_monitor, "/metrics")` before fetching metrics.
Policy Metrics Tests `testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`, `testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py`	Tests now use `prometheus.wait_for_metric(...)` with constructed label dictionaries (including `service`, `namespace`, `kind`, optional `status`) and `operator.ge` where appropriate; added `_metric_labels()` helper and lifecycle test uses `request.addfinalizer(policy.delete)` for teardown instead of a module service_monitor fixture. Added `operator` and `settings` imports where needed.

Sequence Diagram(s)

sequenceDiagram
    participant Test
    participant Prometheus
    participant RetryLoop as Backoff Retry Loop
    participant MetricsDB as Prometheus DB

    Test->>Prometheus: wait_for_metric(metric, expected, labels, compare_fn)
    Prometheus->>RetryLoop: start polling (interval=10s, max_tries=5)
    loop until predicate true or max tries
        RetryLoop->>Prometheus: get_metrics(key=metric, labels)
        Prometheus->>MetricsDB: query series for metric+labels
        MetricsDB-->>Prometheus: series (0..n) and values
        Prometheus-->>RetryLoop: current_value (0 if no series) or raise on >1 series
        RetryLoop->>RetryLoop: evaluate compare_fn(current_value, expected)
        alt compare_fn true
            RetryLoop-->>Prometheus: return True
        else retries remain
            RetryLoop->>RetryLoop: wait 10s and retry
        else no retries
            RetryLoop-->>Prometheus: return False
        end
    end
    Prometheus-->>Test: boolean result

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰✨ I nibble at metrics, I wait and I peer,
Polling each value until clarity's near.
With hops and with patience I tally each score,
Prometheus hums — now the tests worry no more.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'fix: add retry logic to kuadrant metrics tests' accurately summarises the main change—adding retry/polling logic to metric assertions.
Description check	✅ Passed	The PR description is comprehensive, covering root cause, specific changes (new features, bug fixes, refactoring), and follows good structure; all key sections are present.
Docstring Coverage	✅ Passed	Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py (1)
123-128: Consider centralising policy-metric label construction.

This label map duplicates logic already implemented in _metric_labels(...) in testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py (lines 13-22). A shared helper would prevent drift.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`
around lines 123 - 128, The labels dict in this test duplicates logic already
implemented in _metric_labels(...) — replace the inline labels construction with
a call to the shared helper to avoid drift: remove the hardcoded labels
assignment and call _metric_labels(policy_kind,
settings["service_protection"]["system_project"], "kuadrant-operator-metrics")
(or adapt parameter order to match _metric_labels signature) so both tests use
the same helper; ensure you import or reference _metric_labels from the module
where it’s defined and adjust any variable names accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@testsuite/prometheus.py`:
- Around line 124-126: The wait_for_metric() implementation currently takes only
the first series from get_metrics() (metrics.values[0]) causing first-series
bias and flaky results when Prometheus returns multiple series; change
wait_for_metric() to examine all returned series from
get_metrics(key=metric_name, labels=labels) and return success if any series'
value satisfies compare(series_value, metric_value) (i.e., replace the
single-index lookup with an any/all-style check over metrics.values or iterate
through each metric in the returned collection and apply compare), ensuring you
still handle the case of no series returned by treating it as a non-match.
- Line 115: The parameter annotation for labels is using an implicit optional
(dict[str, str] = None); update the function/method signature that declares the
labels parameter to use an explicit optional union type (dict[str, str] | None =
None) to satisfy RUF013 — locate the function that defines the labels parameter
(the signature containing "labels: dict[str, str] = None") and change its type
annotation to "dict[str, str] | None = None".

In `@testsuite/tests/singlecluster/observability/conftest.py`:
- Around line 100-103: The fixture kuadrant_service_monitor currently uses
next(...) without a default which raises StopIteration if no match is found;
change it to search safely (e.g., use next(..., None) or iterate) and
assert/raise a clear test failure when no ServiceMonitor contains
"kuadrant-operator-monitor". Update the kuadrant_service_monitor function to
obtain the matching item into a variable (e.g., match = next(..., None)) and
then call pytest.fail or an assert with a helpful message that includes the
available service_monitors names if match is None.

---

Nitpick comments:
In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`:
- Around line 123-128: The labels dict in this test duplicates logic already
implemented in _metric_labels(...) — replace the inline labels construction with
a call to the shared helper to avoid drift: remove the hardcoded labels
assignment and call _metric_labels(policy_kind,
settings["service_protection"]["system_project"], "kuadrant-operator-metrics")
(or adapt parameter order to match _metric_labels signature) so both tests use
the same helper; ensure you import or reference _metric_labels from the module
where it’s defined and adjust any variable names accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 98205cfe-985f-464b-b941-a2787a9a7bdd

📥 Commits

Reviewing files that changed from the base of the PR and between 99179be and 321b98b.

📒 Files selected for processing (4)

testsuite/prometheus.py
testsuite/tests/singlecluster/observability/conftest.py
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py

testsuite/prometheus.py

testsuite/tests/singlecluster/observability/conftest.py

averevki · 2026-03-27T12:17:18Z

@coderabbitai help

coderabbitai · 2026-03-27T12:17:24Z

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
- @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
- @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve merge conflict to automatically resolve merge conflicts.
@coderabbitai autofix to automatically fix issues identified in unresolved review comments.
@coderabbitai resolve to resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit configuration file (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
You can also validate your configuration using the online YAML validator.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, support, documentation and community

Visit our status page to check the current availability of CodeRabbit.
Create a ticket on our support page for assistance with any issues or questions.
Visit our documentation site for detailed information on how to use CodeRabbit.
Join our Discord community to connect with other users and get help from the community.
Follow us on X/Twitter for updates and announcements.

Signed-off-by: Silvia Tarabova <starabov@redhat.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py (1)
123-128: Consider extracting shared label construction to conftest.py.

The label construction pattern here is similar to _metric_labels() in test_kuadrant_policy_metrics_lifecycle.py. If more tests adopt this pattern, extracting a shared helper to conftest.py would reduce duplication.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`
around lines 123 - 128, Extract the duplicated label-building logic into a
shared helper (e.g., metric_labels or a pytest fixture) placed in conftest.py
and replace the inline dict in
tests/singlecluster/observability/test_kuadrant_policy_metrics.py (the labels
variable) and the similar _metric_labels() in
test_kuadrant_policy_metrics_lifecycle.py to call that shared helper; ensure the
helper accepts parameters used above (service name, namespace from
settings["service_protection"]["system_project"], policy_kind, and status) so
both tests can call metric_labels(service="kuadrant-operator-metrics",
namespace=..., kind=policy_kind, status="true") or use a fixture to supply
defaults.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@testsuite/prometheus.py`:
- Around line 122-132: Run the code formatter on the failing block: format the
_wait function (decorated with backoff.on_predicate) with Black by running the
repository formatter (e.g., poetry run black testsuite/prometheus.py) so
spacing, indentation and line breaks match Black’s style; ensure the decorator,
def _wait(): line, and the subsequent if/return/raise lines (including the
f-string in the AssertionError and the return compare(...) calls) are
reformatted as Black would produce to unblock CI.

---

Nitpick comments:
In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`:
- Around line 123-128: Extract the duplicated label-building logic into a shared
helper (e.g., metric_labels or a pytest fixture) placed in conftest.py and
replace the inline dict in
tests/singlecluster/observability/test_kuadrant_policy_metrics.py (the labels
variable) and the similar _metric_labels() in
test_kuadrant_policy_metrics_lifecycle.py to call that shared helper; ensure the
helper accepts parameters used above (service name, namespace from
settings["service_protection"]["system_project"], policy_kind, and status) so
both tests can call metric_labels(service="kuadrant-operator-metrics",
namespace=..., kind=policy_kind, status="true") or use a fixture to supply
defaults.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdf95a64-dd3f-4dae-875d-74ea5b98177e

📥 Commits

Reviewing files that changed from the base of the PR and between 321b98b and 80eb587.

📒 Files selected for processing (4)

testsuite/prometheus.py
testsuite/tests/singlecluster/observability/conftest.py
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py

🚧 Files skipped from review as they are similar to previous changes (1)

testsuite/tests/singlecluster/observability/conftest.py

testsuite/prometheus.py

crstrn13 · 2026-03-27T13:26:17Z

I think this could have been an occasion to step back and re-think the way we test the metrics in kuadrant. From my experience with the kua-* environments is that their network is very brittle and prometheus scrapping is very slow. Especially if you are running the whole testsuite, there is no real guarantee this approach with retrying for 5 times every 10 seconds will be bulletproof on a bad day.

The question still remains, do we really want to test if prometheus is able to scrape some metric or if kuadrant is exposing that metric in prometheus format on that endpoint? I believe there is little benefit in querying prometheus, given the drawbacks it adds.

silvi-t · 2026-03-27T14:32:41Z

@crstrn13 I'm not sure what you're suggesting here. Should we remove these tests, replace the approach, or something else? This PR addresses the immediate flakiness we've been seeing in nightly runs. From my observation, the race condition window is small (one operator reconciliation cycle), and the retry logic handles it reliably.

silvi-t requested review from averevki, azgabur and emmaaroche March 26, 2026 14:22

silvi-t self-assigned this Mar 26, 2026

silvi-t added this to Kuadrant and Kuadrant QE Mar 26, 2026

github-project-automation bot moved this to 🆕 New in Kuadrant QE Mar 26, 2026

silvi-t moved this to In Review in Kuadrant Mar 26, 2026

silvi-t force-pushed the kuadrant-metrics-fix branch from 11dbe02 to 321b98b Compare March 27, 2026 11:44

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

testsuite/prometheus.py Show resolved Hide resolved

testsuite/prometheus.py Outdated Show resolved Hide resolved

testsuite/tests/singlecluster/observability/conftest.py Show resolved Hide resolved

silvi-t force-pushed the kuadrant-metrics-fix branch from 321b98b to 80eb587 Compare March 27, 2026 12:28

fix: add retry logic to kuadrant metrics tests

7e2a290

Signed-off-by: Silvia Tarabova <starabov@redhat.com>

silvi-t force-pushed the kuadrant-metrics-fix branch from 80eb587 to 7e2a290 Compare March 27, 2026 12:30

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

testsuite/prometheus.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add retry logic to kuadrant metrics tests#910

fix: add retry logic to kuadrant metrics tests#910
silvi-t wants to merge 1 commit intoKuadrant:mainfrom
silvi-t:kuadrant-metrics-fix

silvi-t commented Mar 26, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

averevki commented Mar 27, 2026

Uh oh!

coderabbitai bot commented Mar 27, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

crstrn13 commented Mar 27, 2026 •

edited

Loading

Uh oh!

silvi-t commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

silvi-t commented Mar 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

New Features

Bug Fixes

Refactoring

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

averevki commented Mar 27, 2026

Uh oh!

coderabbitai bot commented Mar 27, 2026

Chat

CodeRabbit commands

Other keywords and placeholders

CodeRabbit configuration file (.coderabbit.yaml)

Status, support, documentation and community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

crstrn13 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silvi-t commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

silvi-t commented Mar 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

CodeRabbit configuration file (`.coderabbit.yaml`)

crstrn13 commented Mar 27, 2026 •

edited

Loading