Skip to content

fix: add retry logic to kuadrant metrics tests#910

Open
silvi-t wants to merge 1 commit intoKuadrant:mainfrom
silvi-t:kuadrant-metrics-fix
Open

fix: add retry logic to kuadrant metrics tests#910
silvi-t wants to merge 1 commit intoKuadrant:mainfrom
silvi-t:kuadrant-metrics-fix

Conversation

@silvi-t
Copy link
Copy Markdown
Contributor

@silvi-t silvi-t commented Mar 26, 2026

Description

  • Fix intermittent failures observed in nightly test pipeline for test_metric_kuadrant_policies_enforced where kuadrant_policies_enforced metric shows 0.0 for RateLimitPolicy and TokenRateLimitPolicy
  • Fix intermittent failures observed in nightly test pipeline for test_metric_policy_lifecycle where metrics don't reflect policy deletion in time
  • Possible root cause: the kuadrant-operator updates policy metrics from an in-memory topology snapshot that may lag one reconciliation cycle behind the actual policy state in etcd, causing a race condition between policy
    enforcement and metric updates. Policies committed last (RateLimitPolicy, TokenRateLimitPolicy) are most affected as they have less time for the operator to complete the follow-up reconciliation cycle

Changes

New Features

  • Add wait_for_metric method to Prometheus class that polls a metric value with backoff retries, supporting custom comparisons via operator module (e.g. operator.eq, operator.ge, operator.lt)

Bug Fixes

  • Replace single-shot metric assertions with prometheus.wait_for_metric in test_metric_kuadrant_policies_enforced to wait for Prometheus to reflect enforced status
  • Replace wait_for_scrape with prometheus.wait_for_metric in test_metric_policy_lifecycle for both policy creation and deletion metric checks
  • Add request.addfinalizer(policy.delete) to test_metric_policy_lifecycle to prevent policy resource leaks when assertions fail before policy.delete() is reached

Refactoring

  • Move service_monitor fixture to conftest.py as kuadrant_service_monitor for reuse across test modules
  • Add wait_for_scrape to kuadrant_operator_metrics fixture in conftest.py to ensure fresh metrics after policy commit

Summary by CodeRabbit

  • Tests

    • Added a polling-based Prometheus wait that retries with configurable comparison, handles absent or multiple series, and surfaces latest observed values on failure.
    • Tests now explicitly wait for the operator service scrape readiness before validating metrics and use the new polling approach for lifecycle assertions.
    • Improved teardown sequencing to ensure resources are deleted reliably.
  • Refactor

    • Consolidated and centralised metric label construction for consistent observability queries.

@silvi-t silvi-t self-assigned this Mar 26, 2026
@github-project-automation github-project-automation bot moved this to 🆕 New in Kuadrant QE Mar 26, 2026
@silvi-t silvi-t moved this to In Review in Kuadrant Mar 26, 2026
@silvi-t silvi-t force-pushed the kuadrant-metrics-fix branch from 11dbe02 to 321b98b Compare March 27, 2026 11:44
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b78862d-2fdf-4802-b2bc-81030fb63808

📥 Commits

Reviewing files that changed from the base of the PR and between 80eb587 and 7e2a290.

📒 Files selected for processing (4)
  • testsuite/prometheus.py
  • testsuite/tests/singlecluster/observability/conftest.py
  • testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py
  • testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py
  • testsuite/tests/singlecluster/observability/conftest.py

📝 Walkthrough

Walkthrough

Adds a polling-based Prometheus helper (Prometheus.wait_for_metric) and updates observability tests/fixtures to explicitly wait for a ServiceMonitor scrape and to poll metrics (with constructed label sets and comparison predicates) instead of immediate assertions.

Changes

Cohort / File(s) Summary
Core Prometheus Utility
testsuite/prometheus.py
Added wait_for_metric(metric_name, metric_value, labels=None, compare=operator.eq) -> bool: polls get_metrics on fixed backoff (interval=10s, max_tries=5); treats missing series as value 0; raises if >1 series; returns predicate result. Added import operator.
Test Fixtures
testsuite/tests/singlecluster/observability/conftest.py
Added module-scoped kuadrant_service_monitor fixture; updated kuadrant_operator_metrics to depend on it and to call prometheus.wait_for_scrape(kuadrant_service_monitor, "/metrics") before fetching metrics.
Policy Metrics Tests
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py, testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py
Tests now use prometheus.wait_for_metric(...) with constructed label dictionaries (including service, namespace, kind, optional status) and operator.ge where appropriate; added _metric_labels() helper and lifecycle test uses request.addfinalizer(policy.delete) for teardown instead of a module service_monitor fixture. Added operator and settings imports where needed.

Sequence Diagram(s)

sequenceDiagram
    participant Test
    participant Prometheus
    participant RetryLoop as Backoff Retry Loop
    participant MetricsDB as Prometheus DB

    Test->>Prometheus: wait_for_metric(metric, expected, labels, compare_fn)
    Prometheus->>RetryLoop: start polling (interval=10s, max_tries=5)
    loop until predicate true or max tries
        RetryLoop->>Prometheus: get_metrics(key=metric, labels)
        Prometheus->>MetricsDB: query series for metric+labels
        MetricsDB-->>Prometheus: series (0..n) and values
        Prometheus-->>RetryLoop: current_value (0 if no series) or raise on >1 series
        RetryLoop->>RetryLoop: evaluate compare_fn(current_value, expected)
        alt compare_fn true
            RetryLoop-->>Prometheus: return True
        else retries remain
            RetryLoop->>RetryLoop: wait 10s and retry
        else no retries
            RetryLoop-->>Prometheus: return False
        end
    end
    Prometheus-->>Test: boolean result
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰✨ I nibble at metrics, I wait and I peer,
Polling each value until clarity's near.
With hops and with patience I tally each score,
Prometheus hums — now the tests worry no more.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix: add retry logic to kuadrant metrics tests' accurately summarises the main change—adding retry/polling logic to metric assertions.
Description check ✅ Passed The PR description is comprehensive, covering root cause, specific changes (new features, bug fixes, refactoring), and follows good structure; all key sections are present.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py (1)

123-128: Consider centralising policy-metric label construction.

This label map duplicates logic already implemented in _metric_labels(...) in testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py (lines 13-22). A shared helper would prevent drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`
around lines 123 - 128, The labels dict in this test duplicates logic already
implemented in _metric_labels(...) — replace the inline labels construction with
a call to the shared helper to avoid drift: remove the hardcoded labels
assignment and call _metric_labels(policy_kind,
settings["service_protection"]["system_project"], "kuadrant-operator-metrics")
(or adapt parameter order to match _metric_labels signature) so both tests use
the same helper; ensure you import or reference _metric_labels from the module
where it’s defined and adjust any variable names accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@testsuite/prometheus.py`:
- Around line 124-126: The wait_for_metric() implementation currently takes only
the first series from get_metrics() (metrics.values[0]) causing first-series
bias and flaky results when Prometheus returns multiple series; change
wait_for_metric() to examine all returned series from
get_metrics(key=metric_name, labels=labels) and return success if any series'
value satisfies compare(series_value, metric_value) (i.e., replace the
single-index lookup with an any/all-style check over metrics.values or iterate
through each metric in the returned collection and apply compare), ensuring you
still handle the case of no series returned by treating it as a non-match.
- Line 115: The parameter annotation for labels is using an implicit optional
(dict[str, str] = None); update the function/method signature that declares the
labels parameter to use an explicit optional union type (dict[str, str] | None =
None) to satisfy RUF013 — locate the function that defines the labels parameter
(the signature containing "labels: dict[str, str] = None") and change its type
annotation to "dict[str, str] | None = None".

In `@testsuite/tests/singlecluster/observability/conftest.py`:
- Around line 100-103: The fixture kuadrant_service_monitor currently uses
next(...) without a default which raises StopIteration if no match is found;
change it to search safely (e.g., use next(..., None) or iterate) and
assert/raise a clear test failure when no ServiceMonitor contains
"kuadrant-operator-monitor". Update the kuadrant_service_monitor function to
obtain the matching item into a variable (e.g., match = next(..., None)) and
then call pytest.fail or an assert with a helpful message that includes the
available service_monitors names if match is None.

---

Nitpick comments:
In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`:
- Around line 123-128: The labels dict in this test duplicates logic already
implemented in _metric_labels(...) — replace the inline labels construction with
a call to the shared helper to avoid drift: remove the hardcoded labels
assignment and call _metric_labels(policy_kind,
settings["service_protection"]["system_project"], "kuadrant-operator-metrics")
(or adapt parameter order to match _metric_labels signature) so both tests use
the same helper; ensure you import or reference _metric_labels from the module
where it’s defined and adjust any variable names accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 98205cfe-985f-464b-b941-a2787a9a7bdd

📥 Commits

Reviewing files that changed from the base of the PR and between 99179be and 321b98b.

📒 Files selected for processing (4)
  • testsuite/prometheus.py
  • testsuite/tests/singlecluster/observability/conftest.py
  • testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py
  • testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py

@averevki
Copy link
Copy Markdown
Contributor

@coderabbitai help

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve merge conflict to automatically resolve merge conflicts.
  • @coderabbitai autofix to automatically fix issues identified in unresolved review comments.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit configuration file (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@silvi-t silvi-t force-pushed the kuadrant-metrics-fix branch from 321b98b to 80eb587 Compare March 27, 2026 12:28
Signed-off-by: Silvia Tarabova <starabov@redhat.com>
@silvi-t silvi-t force-pushed the kuadrant-metrics-fix branch from 80eb587 to 7e2a290 Compare March 27, 2026 12:30
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py (1)

123-128: Consider extracting shared label construction to conftest.py.

The label construction pattern here is similar to _metric_labels() in test_kuadrant_policy_metrics_lifecycle.py. If more tests adopt this pattern, extracting a shared helper to conftest.py would reduce duplication.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`
around lines 123 - 128, Extract the duplicated label-building logic into a
shared helper (e.g., metric_labels or a pytest fixture) placed in conftest.py
and replace the inline dict in
tests/singlecluster/observability/test_kuadrant_policy_metrics.py (the labels
variable) and the similar _metric_labels() in
test_kuadrant_policy_metrics_lifecycle.py to call that shared helper; ensure the
helper accepts parameters used above (service name, namespace from
settings["service_protection"]["system_project"], policy_kind, and status) so
both tests can call metric_labels(service="kuadrant-operator-metrics",
namespace=..., kind=policy_kind, status="true") or use a fixture to supply
defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@testsuite/prometheus.py`:
- Around line 122-132: Run the code formatter on the failing block: format the
_wait function (decorated with backoff.on_predicate) with Black by running the
repository formatter (e.g., poetry run black testsuite/prometheus.py) so
spacing, indentation and line breaks match Black’s style; ensure the decorator,
def _wait(): line, and the subsequent if/return/raise lines (including the
f-string in the AssertionError and the return compare(...) calls) are
reformatted as Black would produce to unblock CI.

---

Nitpick comments:
In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`:
- Around line 123-128: Extract the duplicated label-building logic into a shared
helper (e.g., metric_labels or a pytest fixture) placed in conftest.py and
replace the inline dict in
tests/singlecluster/observability/test_kuadrant_policy_metrics.py (the labels
variable) and the similar _metric_labels() in
test_kuadrant_policy_metrics_lifecycle.py to call that shared helper; ensure the
helper accepts parameters used above (service name, namespace from
settings["service_protection"]["system_project"], policy_kind, and status) so
both tests can call metric_labels(service="kuadrant-operator-metrics",
namespace=..., kind=policy_kind, status="true") or use a fixture to supply
defaults.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdf95a64-dd3f-4dae-875d-74ea5b98177e

📥 Commits

Reviewing files that changed from the base of the PR and between 321b98b and 80eb587.

📒 Files selected for processing (4)
  • testsuite/prometheus.py
  • testsuite/tests/singlecluster/observability/conftest.py
  • testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py
  • testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • testsuite/tests/singlecluster/observability/conftest.py

@crstrn13
Copy link
Copy Markdown
Contributor

crstrn13 commented Mar 27, 2026

I think this could have been an occasion to step back and re-think the way we test the metrics in kuadrant. From my experience with the kua-* environments is that their network is very brittle and prometheus scrapping is very slow. Especially if you are running the whole testsuite, there is no real guarantee this approach with retrying for 5 times every 10 seconds will be bulletproof on a bad day.

The question still remains, do we really want to test if prometheus is able to scrape some metric or if kuadrant is exposing that metric in prometheus format on that endpoint? I believe there is little benefit in querying prometheus, given the drawbacks it adds.

@silvi-t
Copy link
Copy Markdown
Contributor Author

silvi-t commented Mar 27, 2026

@crstrn13 I'm not sure what you're suggesting here. Should we remove these tests, replace the approach, or something else? This PR addresses the immediate flakiness we've been seeing in nightly runs. From my observation, the race condition window is small (one operator reconciliation cycle), and the retry logic handles it reliably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review
Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

3 participants