Fix calico E2E flake by waiting for default FelixConfiguration before patching by AAraKKe · Pull Request #23674 · DataDog/integrations-core

AAraKKe · 2026-05-12T10:52:19Z

What does this PR do?

In calico/tests/conftest.py::setup_calico, wait for the default FelixConfiguration resource to exist before patching it to enable Prometheus metrics, and use check=True on the patch so any future regression fails loudly at the patch step rather than as an opaque downstream timeout.

Motivation

The Calico E2E job has been flaking intermittently on PR and master runs (e.g. master run 25589885652, this PR's parent commit, and others). Each time, the failure looks like:

Hit error: resource does not exist: FelixConfiguration(default) with error:
  felixconfigurations.crd.projectcalico.org "default" not found
command terminated with exit code 1
...
ERROR at setup of test_check - datadog_checks.dev.errors.RetryError:
  Endpoint: http://10.1.0.x:port/metrics
  Error: <urlopen error [Errno 111] Connection refused>

The root cause:

kubectl wait --for=condition=Ready pods ... returns as soon as Calico pods pass their readiness probe, but the default FelixConfiguration CR is created asynchronously by calico-node after it starts. There is a small window where pods are Ready but default does not yet exist.
The immediate calicoctl patch felixConfiguration default ... then fails with not found. Because datadog_checks.dev.subprocess.run_command defaults to check=False, the failure is silently swallowed.
Setup proceeds, port-forward succeeds (the Service exists), but Felix never starts serving Prometheus on port 9091 because the patch never landed.
CheckEndpoints retries for ~200s and then raises RetryError, failing the entire session-scoped fixture and every test in the run.

The fix waits for the resource to exist before patching, and makes the patch loud on failure so this class of regression cannot recur silently.

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
Add the `qa/skip-qa` label if the PR doesn't need to be tested during QA.
If you need to backport this PR to another branch, you can add the `backport/` label to the PR and it will automatically open a backport PR once this one is merged

…setup The default FelixConfiguration is created asynchronously by calico-node after pods report Ready, so the immediate calicoctl patch sometimes ran before the resource existed. The patch failed silently (run_command defaults to check=False), Prometheus metrics were never enabled, and the downstream metrics endpoint check timed out 200s later with a generic connection-refused error. Wait for the resource to exist before patching, and use check=True on the patch so any future regression fails fast and visibly.

dd-octo-sts · 2026-05-12T10:54:03Z

Validation Report

All 20 validations passed.

Show details

Validation	Description	Status
`agent-reqs`	Verify check versions match the Agent requirements file	✅
`ci`	Validate CI configuration and Codecov settings	✅
`codeowners`	Validate every integration has a CODEOWNERS entry	✅
`config`	Validate default configuration files against spec.yaml	✅
`dep`	Verify dependency pins are consistent and Agent-compatible	✅
`http`	Validate integrations use the HTTP wrapper correctly	✅
`imports`	Validate check imports do not use deprecated modules	✅
`integration-style`	Validate check code style conventions	✅
`jmx-metrics`	Validate JMX metrics definition files and config	✅
`labeler`	Validate PR labeler config matches integration directories	✅
`legacy-signature`	Validate no integration uses the legacy Agent check signature	✅
`license-headers`	Validate Python files have proper license headers	✅
`licenses`	Validate third-party license attribution list	✅
`metadata`	Validate metadata.csv metric definitions	✅
`models`	Validate configuration data models match spec.yaml	✅
`openmetrics`	Validate OpenMetrics integrations disable the metric limit	✅
`package`	Validate Python package metadata and naming	✅
`readmes`	Validate README files have required sections	✅
`saved-views`	Validate saved view JSON file structure and fields	✅
`version`	Validate version consistency between package and changelog	✅

View full run

datadog-official · 2026-05-12T10:56:54Z

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
• Patch Coverage: 40.00%
• Overall Coverage: 81.71% (-5.52%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: c775e9d | Docs | Datadog PR Page | Give us feedback!}

codecov · 2026-05-12T10:56:56Z

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.02%. Comparing base (57b7286) to head (c775e9d).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dd-octo-sts Bot added the integration/calico label May 12, 2026

AAraKKe added the qa/skip-qa Automatically skip this PR for the next QA label May 12, 2026

AAraKKe marked this pull request as ready for review May 12, 2026 10:54

AAraKKe requested a review from a team as a code owner May 12, 2026 10:54

dd-octo-sts Bot added the team/agent-integrations label May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix calico E2E flake by waiting for default FelixConfiguration before patching#23674

Fix calico E2E flake by waiting for default FelixConfiguration before patching#23674
AAraKKe wants to merge 1 commit into
masterfrom
aarakke/fix-calico-e2e-flake

AAraKKe commented May 12, 2026

Uh oh!

dd-octo-sts Bot commented May 12, 2026

Uh oh!

datadog-official Bot commented May 12, 2026 •

edited by datadog-prod-us1-6 Bot

Loading

Uh oh!

codecov Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AAraKKe commented May 12, 2026

What does this PR do?

Motivation

Review checklist (to be filled by reviewers)

Uh oh!

dd-octo-sts Bot commented May 12, 2026

Validation Report

Uh oh!

datadog-official Bot commented May 12, 2026 • edited by datadog-prod-us1-6 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 12, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

datadog-official Bot commented May 12, 2026 •

edited by datadog-prod-us1-6 Bot

Loading