Fix calico E2E flake by waiting for default FelixConfiguration before patching#23674
Open
AAraKKe wants to merge 1 commit into
Open
Fix calico E2E flake by waiting for default FelixConfiguration before patching#23674AAraKKe wants to merge 1 commit into
AAraKKe wants to merge 1 commit into
Conversation
…setup The default FelixConfiguration is created asynchronously by calico-node after pods report Ready, so the immediate calicoctl patch sometimes ran before the resource existed. The patch failed silently (run_command defaults to check=False), Prometheus metrics were never enabled, and the downstream metrics endpoint check timed out 200s later with a generic connection-refused error. Wait for the resource to exist before patching, and use check=True on the patch so any future regression fails fast and visibly.
Contributor
Validation ReportAll 20 validations passed. Show details
|
Contributor
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: c775e9d | Docs | Datadog PR Page | Give us feedback! |
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
In
calico/tests/conftest.py::setup_calico, wait for the defaultFelixConfigurationresource to exist before patching it to enable Prometheus metrics, and usecheck=Trueon the patch so any future regression fails loudly at the patch step rather than as an opaque downstream timeout.Motivation
The Calico E2E job has been flaking intermittently on PR and
masterruns (e.g. master run 25589885652, this PR's parent commit, and others). Each time, the failure looks like:The root cause:
kubectl wait --for=condition=Ready pods ...returns as soon as Calico pods pass their readiness probe, but thedefaultFelixConfiguration CR is created asynchronously bycalico-nodeafter it starts. There is a small window where pods are Ready butdefaultdoes not yet exist.calicoctl patch felixConfiguration default ...then fails withnot found. Becausedatadog_checks.dev.subprocess.run_commanddefaults tocheck=False, the failure is silently swallowed.CheckEndpointsretries for ~200s and then raisesRetryError, failing the entire session-scoped fixture and every test in the run.The fix waits for the resource to exist before patching, and makes the patch loud on failure so this class of regression cannot recur silently.
Review checklist (to be filled by reviewers)