Skip to content

Fix flaky OTel integration test with DNS health check (#61070)#61242

Merged
jason810496 merged 2 commits intoapache:mainfrom
Abhishekmishra2808:main
Jan 31, 2026
Merged

Fix flaky OTel integration test with DNS health check (#61070)#61242
jason810496 merged 2 commits intoapache:mainfrom
Abhishekmishra2808:main

Conversation

@Abhishekmishra2808
Copy link
Contributor

Description

This PR fixes a flaky integration test: test_scheduler_change_after_the_first_task_finishes in tests/integration/otel/test_otel.py.

The Problem:
The test frequently failed in CI and local Breeze environments with an AssertionError (missing task2 span) and a urllib3.exceptions.NameResolutionError for the host breeze-otel-collector.

This was caused by a race condition where the Airflow test components attempted to connect to the OpenTelemetry (OTel) collector before Docker's internal DNS had fully propagated or before the collector service was ready to accept connections. This resulted in dropped spans and failed assertions.

The Fix:
I implemented a robust health check mechanism, wait_for_otel_collector(), within the TestOtelIntegration class.

  • The function uses socket.create_connection to poll the collector's availability.
  • It specifically handles socket.gaierror (DNS resolution) and ConnectionRefusedError with a 60-second timeout.
  • The setup_class method now calls this health check before any tests execute, ensuring the infrastructure is stable.

This is a targeted fix that addresses the root cause of the network flakiness and infrastructure timing issues without modifying core production code.


Related Issues


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

@boring-cyborg
Copy link

boring-cyborg bot commented Jan 30, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thank you for the PR.
I will wait until CI pass.

@Abhishekmishra2808
Copy link
Contributor Author

Abhishekmishra2808 commented Jan 30, 2026

@jason810496
Previously 1 CI failed so,
I have updated the timeout to 120 seconds as suggested (and actually pushed it as the default).

Quick follow-up on the approach: Currently, if the collector isn't reachable after the timeout, the setup_class continues but the individual OTel tests will eventually fail assertions.

Would you prefer if I modified this to return a boolean and use pytest.skip in setup_class if the collector is unreachable? This would turn those hard failures into "Skipped" status, keeping the CI cleaner when the OTel infrastructure has transient issues. Let me know what you think!

@henry3260
Copy link
Contributor

@jason810496 Previously 1 CI failed so, I have updated the timeout to 120 seconds as suggested (and actually pushed it as the default).

Quick follow-up on the approach: Currently, if the collector isn't reachable after the timeout, the setup_class continues but the individual OTel tests will eventually fail assertions.

Would you prefer if I modified this to return a boolean and use pytest.skip in setup_class if the collector is unreachable? This would turn those hard failures into "Skipped" status, keeping the CI cleaner when the OTel infrastructure has transient issues. Let me know what you think!

+1 on returning a boolean. However, I'm not entirely sure about skipping the tests. If the collector is supposed to be running but isn't reachable, I feel it might be better to let the tests fail (or fail explicitly) so we don't overlook infrastructure issues.

@Abhishekmishra2808
Copy link
Contributor Author

@jason810496
All CI checks have now passed, confirming the stability of the 120s timeout and the health check implementation. So, Is there anything further required from my side, or is this ready for a final review?

Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM, I will merge after the CI pass again, as I just triggered rebased to latest main on GitHub.

@jason810496 jason810496 added area:dev-env CI, pre-commit, pylint and other changes that do not change the behavior of the final code backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch labels Jan 31, 2026
@Abhishekmishra2808
Copy link
Contributor Author

Thank you @jason810496 and @henry3260 for the guidance and reviews! Happy to see this stabilized.

@jason810496 jason810496 merged commit 8ac25dd into apache:main Jan 31, 2026
71 checks passed
@boring-cyborg
Copy link

boring-cyborg bot commented Jan 31, 2026

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

github-actions bot pushed a commit that referenced this pull request Jan 31, 2026
…1070) (#61242)

* Fix flaky OTel integration test with DNS health check (#61070)

* Update airflow-core/tests/integration/otel/test_otel.py

Co-authored-by: Henry Chen <henryhenry0512@gmail.com>

---------
(cherry picked from commit 8ac25dd)

Co-authored-by: Abhishek Mishra <mishra.abhishek2808@gmail.com>
Co-authored-by: Henry Chen <henryhenry0512@gmail.com>
@github-actions
Copy link

Backport successfully created: v3-1-test

Status Branch Result
v3-1-test PR Link

github-actions bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Jan 31, 2026
…ache#61070) (apache#61242)

* Fix flaky OTel integration test with DNS health check (apache#61070)

* Update airflow-core/tests/integration/otel/test_otel.py

Co-authored-by: Henry Chen <henryhenry0512@gmail.com>

---------
(cherry picked from commit 8ac25dd)

Co-authored-by: Abhishek Mishra <mishra.abhishek2808@gmail.com>
Co-authored-by: Henry Chen <henryhenry0512@gmail.com>
shahar1 pushed a commit that referenced this pull request Jan 31, 2026
…1070) (#61242) (#61286)

* Fix flaky OTel integration test with DNS health check (#61070)

* Update airflow-core/tests/integration/otel/test_otel.py



---------
(cherry picked from commit 8ac25dd)

Co-authored-by: Abhishek Mishra <mishra.abhishek2808@gmail.com>
Co-authored-by: Henry Chen <henryhenry0512@gmail.com>
morelgeorge pushed a commit to morelgeorge/airflow that referenced this pull request Feb 1, 2026
…pache#61242)

* Fix flaky OTel integration test with DNS health check (apache#61070)

* Update airflow-core/tests/integration/otel/test_otel.py

Co-authored-by: Henry Chen <henryhenry0512@gmail.com>

---------

Co-authored-by: Henry Chen <henryhenry0512@gmail.com>
shashbha14 pushed a commit to shashbha14/airflow that referenced this pull request Feb 2, 2026
…pache#61242)

* Fix flaky OTel integration test with DNS health check (apache#61070)

* Update airflow-core/tests/integration/otel/test_otel.py

Co-authored-by: Henry Chen <henryhenry0512@gmail.com>

---------

Co-authored-by: Henry Chen <henryhenry0512@gmail.com>
ephraimbuddy pushed a commit that referenced this pull request Feb 3, 2026
…1070) (#61242) (#61286)

* Fix flaky OTel integration test with DNS health check (#61070)

* Update airflow-core/tests/integration/otel/test_otel.py



---------
(cherry picked from commit 8ac25dd)

Co-authored-by: Abhishek Mishra <mishra.abhishek2808@gmail.com>
Co-authored-by: Henry Chen <henryhenry0512@gmail.com>
potiuk pushed a commit that referenced this pull request Feb 3, 2026
* [v3-1-test] Add Keycloak token documentation to Security/API (#61228) (#61248)

(cherry picked from commit bb04b5d)

Co-authored-by: Bugra Ozturk <bugraoz93@users.noreply.github.com>

* [v3-1-test] Fix language selector state not updating on change (#61060) (#61263)

(cherry picked from commit 975cfe6)

* [v3-1-test] Clarify template context for asset-triggered DAGs in airflow-core docs (#61258) (#61282)

(cherry picked from commit f7aa502)

Co-authored-by: Rachana Dutta <rupss2105@gmail.com>
Co-authored-by: kevinhongzl <zhenlun.hong01@gmail.com>

* [v3-1-test] Fix flaky OTel integration test with DNS health check (#61070) (#61242) (#61286)

* Fix flaky OTel integration test with DNS health check (#61070)

* Update airflow-core/tests/integration/otel/test_otel.py



---------
(cherry picked from commit 8ac25dd)

Co-authored-by: Abhishek Mishra <mishra.abhishek2808@gmail.com>
Co-authored-by: Henry Chen <henryhenry0512@gmail.com>

* [v3-1-test] Update pmc verification docs (#61271) (#61294)

* Update Helm Chart release instructions for PMC Checks

* Update KEY download instructions for PMC Checks

* Update dev/README_RELEASE_HELM_CHART.md
(cherry picked from commit c74b24a)

* [v3-1-test] update version for release command (#61260) (#61328)

(cherry picked from commit 7790482)

Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>

* CI: Upgrade important CI environment (#61327)

* [v3-1-test] Fix JWT token generation with unset issuer/audience config (#61278) (#61331)

* Fix JWT token generation with unset issuer/audience config
(cherry picked from commit a440d1d)

Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>

* [v3-1-test] Remove empty `apache_airflow_site.py` file (#61308)
(cherry picked from commit d65ff01)

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Bugra Ozturk <bugraoz93@users.noreply.github.com>
Co-authored-by: Guan-Ming (Wesley) Chiu <105915352+guan404ming@users.noreply.github.com>
Co-authored-by: Shahar Epstein <60007259+shahar1@users.noreply.github.com>
Co-authored-by: Rachana Dutta <rupss2105@gmail.com>
Co-authored-by: kevinhongzl <zhenlun.hong01@gmail.com>
Co-authored-by: Abhishek Mishra <mishra.abhishek2808@gmail.com>
Co-authored-by: Henry Chen <henryhenry0512@gmail.com>
Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
jason810496 pushed a commit to abhijeets25012-tech/airflow that referenced this pull request Feb 3, 2026
…pache#61242)

* Fix flaky OTel integration test with DNS health check (apache#61070)

* Update airflow-core/tests/integration/otel/test_otel.py

Co-authored-by: Henry Chen <henryhenry0512@gmail.com>

---------

Co-authored-by: Henry Chen <henryhenry0512@gmail.com>
jhgoebbert pushed a commit to jhgoebbert/airflow_Owen-CH-Leung that referenced this pull request Feb 8, 2026
…pache#61242)

* Fix flaky OTel integration test with DNS health check (apache#61070)

* Update airflow-core/tests/integration/otel/test_otel.py

Co-authored-by: Henry Chen <henryhenry0512@gmail.com>

---------

Co-authored-by: Henry Chen <henryhenry0512@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-env CI, pre-commit, pylint and other changes that do not change the behavior of the final code backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants