fix: add retry logic to kuadrant metrics tests#910
fix: add retry logic to kuadrant metrics tests#910silvi-t wants to merge 1 commit intoKuadrant:mainfrom
Conversation
11dbe02 to
321b98b
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds a polling-based Prometheus helper ( Changes
Sequence Diagram(s)sequenceDiagram
participant Test
participant Prometheus
participant RetryLoop as Backoff Retry Loop
participant MetricsDB as Prometheus DB
Test->>Prometheus: wait_for_metric(metric, expected, labels, compare_fn)
Prometheus->>RetryLoop: start polling (interval=10s, max_tries=5)
loop until predicate true or max tries
RetryLoop->>Prometheus: get_metrics(key=metric, labels)
Prometheus->>MetricsDB: query series for metric+labels
MetricsDB-->>Prometheus: series (0..n) and values
Prometheus-->>RetryLoop: current_value (0 if no series) or raise on >1 series
RetryLoop->>RetryLoop: evaluate compare_fn(current_value, expected)
alt compare_fn true
RetryLoop-->>Prometheus: return True
else retries remain
RetryLoop->>RetryLoop: wait 10s and retry
else no retries
RetryLoop-->>Prometheus: return False
end
end
Prometheus-->>Test: boolean result
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py (1)
123-128: Consider centralising policy-metric label construction.This label map duplicates logic already implemented in
_metric_labels(...)intestsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py(lines 13-22). A shared helper would prevent drift.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py` around lines 123 - 128, The labels dict in this test duplicates logic already implemented in _metric_labels(...) — replace the inline labels construction with a call to the shared helper to avoid drift: remove the hardcoded labels assignment and call _metric_labels(policy_kind, settings["service_protection"]["system_project"], "kuadrant-operator-metrics") (or adapt parameter order to match _metric_labels signature) so both tests use the same helper; ensure you import or reference _metric_labels from the module where it’s defined and adjust any variable names accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@testsuite/prometheus.py`:
- Around line 124-126: The wait_for_metric() implementation currently takes only
the first series from get_metrics() (metrics.values[0]) causing first-series
bias and flaky results when Prometheus returns multiple series; change
wait_for_metric() to examine all returned series from
get_metrics(key=metric_name, labels=labels) and return success if any series'
value satisfies compare(series_value, metric_value) (i.e., replace the
single-index lookup with an any/all-style check over metrics.values or iterate
through each metric in the returned collection and apply compare), ensuring you
still handle the case of no series returned by treating it as a non-match.
- Line 115: The parameter annotation for labels is using an implicit optional
(dict[str, str] = None); update the function/method signature that declares the
labels parameter to use an explicit optional union type (dict[str, str] | None =
None) to satisfy RUF013 — locate the function that defines the labels parameter
(the signature containing "labels: dict[str, str] = None") and change its type
annotation to "dict[str, str] | None = None".
In `@testsuite/tests/singlecluster/observability/conftest.py`:
- Around line 100-103: The fixture kuadrant_service_monitor currently uses
next(...) without a default which raises StopIteration if no match is found;
change it to search safely (e.g., use next(..., None) or iterate) and
assert/raise a clear test failure when no ServiceMonitor contains
"kuadrant-operator-monitor". Update the kuadrant_service_monitor function to
obtain the matching item into a variable (e.g., match = next(..., None)) and
then call pytest.fail or an assert with a helpful message that includes the
available service_monitors names if match is None.
---
Nitpick comments:
In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`:
- Around line 123-128: The labels dict in this test duplicates logic already
implemented in _metric_labels(...) — replace the inline labels construction with
a call to the shared helper to avoid drift: remove the hardcoded labels
assignment and call _metric_labels(policy_kind,
settings["service_protection"]["system_project"], "kuadrant-operator-metrics")
(or adapt parameter order to match _metric_labels signature) so both tests use
the same helper; ensure you import or reference _metric_labels from the module
where it’s defined and adjust any variable names accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 98205cfe-985f-464b-b941-a2787a9a7bdd
📒 Files selected for processing (4)
testsuite/prometheus.pytestsuite/tests/singlecluster/observability/conftest.pytestsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.pytestsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
CodeRabbit configuration file (
|
321b98b to
80eb587
Compare
Signed-off-by: Silvia Tarabova <starabov@redhat.com>
80eb587 to
7e2a290
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py (1)
123-128: Consider extracting shared label construction to conftest.py.The label construction pattern here is similar to
_metric_labels()intest_kuadrant_policy_metrics_lifecycle.py. If more tests adopt this pattern, extracting a shared helper toconftest.pywould reduce duplication.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py` around lines 123 - 128, Extract the duplicated label-building logic into a shared helper (e.g., metric_labels or a pytest fixture) placed in conftest.py and replace the inline dict in tests/singlecluster/observability/test_kuadrant_policy_metrics.py (the labels variable) and the similar _metric_labels() in test_kuadrant_policy_metrics_lifecycle.py to call that shared helper; ensure the helper accepts parameters used above (service name, namespace from settings["service_protection"]["system_project"], policy_kind, and status) so both tests can call metric_labels(service="kuadrant-operator-metrics", namespace=..., kind=policy_kind, status="true") or use a fixture to supply defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@testsuite/prometheus.py`:
- Around line 122-132: Run the code formatter on the failing block: format the
_wait function (decorated with backoff.on_predicate) with Black by running the
repository formatter (e.g., poetry run black testsuite/prometheus.py) so
spacing, indentation and line breaks match Black’s style; ensure the decorator,
def _wait(): line, and the subsequent if/return/raise lines (including the
f-string in the AssertionError and the return compare(...) calls) are
reformatted as Black would produce to unblock CI.
---
Nitpick comments:
In `@testsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.py`:
- Around line 123-128: Extract the duplicated label-building logic into a shared
helper (e.g., metric_labels or a pytest fixture) placed in conftest.py and
replace the inline dict in
tests/singlecluster/observability/test_kuadrant_policy_metrics.py (the labels
variable) and the similar _metric_labels() in
test_kuadrant_policy_metrics_lifecycle.py to call that shared helper; ensure the
helper accepts parameters used above (service name, namespace from
settings["service_protection"]["system_project"], policy_kind, and status) so
both tests can call metric_labels(service="kuadrant-operator-metrics",
namespace=..., kind=policy_kind, status="true") or use a fixture to supply
defaults.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bdf95a64-dd3f-4dae-875d-74ea5b98177e
📒 Files selected for processing (4)
testsuite/prometheus.pytestsuite/tests/singlecluster/observability/conftest.pytestsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics.pytestsuite/tests/singlecluster/observability/test_kuadrant_policy_metrics_lifecycle.py
🚧 Files skipped from review as they are similar to previous changes (1)
- testsuite/tests/singlecluster/observability/conftest.py
|
I think this could have been an occasion to step back and re-think the way we test the metrics in kuadrant. From my experience with the kua-* environments is that their network is very brittle and prometheus scrapping is very slow. Especially if you are running the whole testsuite, there is no real guarantee this approach with retrying for 5 times every 10 seconds will be bulletproof on a bad day. The question still remains, do we really want to test if prometheus is able to scrape some metric or if kuadrant is exposing that metric in prometheus format on that endpoint? I believe there is little benefit in querying prometheus, given the drawbacks it adds. |
|
@crstrn13 I'm not sure what you're suggesting here. Should we remove these tests, replace the approach, or something else? This PR addresses the immediate flakiness we've been seeing in nightly runs. From my observation, the race condition window is small (one operator reconciliation cycle), and the retry logic handles it reliably. |
Description
test_metric_kuadrant_policies_enforcedwherekuadrant_policies_enforcedmetric shows0.0forRateLimitPolicyandTokenRateLimitPolicytest_metric_policy_lifecyclewhere metrics don't reflect policy deletion in timeenforcement and metric updates. Policies committed last (
RateLimitPolicy,TokenRateLimitPolicy) are most affected as they have less time for the operator to complete the follow-up reconciliation cycleChanges
New Features
wait_for_metricmethod toPrometheusclass that polls a metric value with backoff retries, supporting custom comparisons viaoperatormodule (e.g.operator.eq,operator.ge,operator.lt)Bug Fixes
prometheus.wait_for_metricintest_metric_kuadrant_policies_enforcedto wait for Prometheus to reflect enforced statuswait_for_scrapewithprometheus.wait_for_metricintest_metric_policy_lifecyclefor both policy creation and deletion metric checksrequest.addfinalizer(policy.delete)totest_metric_policy_lifecycleto prevent policy resource leaks when assertions fail beforepolicy.delete()is reachedRefactoring
service_monitorfixture toconftest.pyaskuadrant_service_monitorfor reuse across test moduleswait_for_scrapetokuadrant_operator_metricsfixture inconftest.pyto ensure fresh metrics after policy commitSummary by CodeRabbit
Tests
Refactor