Find the pytest test(s) that poison a flaky target.
You have a test that passes when you run it alone but fails as part of the full
suite. Some other test mutates global state — os.environ, a singleton, a
module-level cache, a database row, the current working directory, a registered
signal handler — and your target is the one that notices. flake-bisect
narrows the polluter down to a minimal set using
delta-debugging over the
test ordering, so you stop guessing and start reading the right diff.
$ python -m flake_bisect --workdir examples/polluting_demo \
--target test_target.py::test_assumes_clean_env
flake-bisect 0.1.0
workdir : .../examples/polluting_demo
target : test_target.py::test_assumes_clean_env
Collecting tests...
Collected 8 tests (7 candidates).
Sanity check: target alone...
OK (passes alone)
Sanity check: target after full suite...
OK (target outcome: FAILED)
Bisecting 7 candidate predecessors...
Minimal poisoning set (1 test):
test_pollute.py::test_sets_env_flag
Reproduce locally:
pytest test_pollute.py::test_sets_env_flag test_target.py::test_assumes_clean_env
pytest invocations during bisect: 3 (cap: 200)
A naive linear search across N candidate predecessors would take up to N runs
of the suite. flake-bisect typically converges in O(log N) pytest
invocations when there is a single polluter, and stays sub-linear with a small
number of polluters.
The Python testing & debugging community has converged on a clear playbook for
non-flaky suites: ban global state in tests, use fixtures with monkeypatch,
isolate the DB per test, run with pytest-randomly in CI to
surface ordering bugs early. The hard part is what to do when CI catches one.
The failing line tells you what broke; it never tells you who set the
landmine 200 tests earlier.
flake-bisect does that last mile: given a known-flaky target, it points at
the test that poisons it.
flake-bisect is a self-contained Python package with no third-party
dependencies. It needs pytest available in the same Python environment as
the project you're bisecting (it shells out to python -m pytest).
Clone the repo and run from source:
git clone https://github.com/python-testing-debugging/flake-bisect.git
cd flake-bisect
python -m flake_bisect --helpTo use it against your own project, point --workdir at your project root and
add flake-bisect to PYTHONPATH so the module is importable:
PYTHONPATH=/path/to/flake-bisect python -m flake_bisect \
--workdir /path/to/your/project \
--target tests/test_widgets.py::test_render_safelyOr run it from inside the flake-bisect checkout with an absolute --workdir.
| Flag | Purpose |
|---|---|
--target |
Required. The flaky test's nodeid as pytest reports it. |
--testpaths |
Limit candidate predecessors to specific paths (otherwise full suite). |
--workdir |
Run pytest from this directory (default: cwd). |
--max-runs |
Hard cap on pytest invocations during bisect (default: 200). |
--pytest-arg |
Forward an arg to every pytest invocation. Repeat to pass multiple. |
-v |
Show per-iteration progress. |
If your project needs particular pytest options to even collect (a -p plugin,
-o override, marker filter, etc.), forward them with repeated --pytest-arg:
python -m flake_bisect \
--target tests/test_x.py::test_y \
--pytest-arg -o --pytest-arg "addopts=" \
--pytest-arg -m --pytest-arg "not slow"- Collect all nodeids in the suite via
pytest --collect-only. - Sanity check 1: run the target alone; bail out if it fails (then the issue isn't ordering, it's the test itself).
- Sanity check 2: run
[…all other tests…, target]in order; bail out if the target passes (no reproducible pollution to bisect). - Delta-debug the predecessor list with Zeller's
ddmin. Each candidate subset is run aspytest <subset…> <target>in a fresh subprocess, with collection order pinned by a bundled internal plugin so the result doesn't depend onpytest-randomlyor alphabetical ordering surprises. - Report the minimal subset that still reproduces the failure plus a
copy-pasteable
pytestcommand to reproduce locally.
Determinism note: flake-bisect cannot help with flakes caused by time,
threads, networking, or RNG without a fixed seed. Those are not ordering
bugs. If sanity check 2 doesn't reproduce the failure deterministically, the
bug is somewhere else and the CLI will say so.
| Code | Meaning |
|---|---|
| 0 | Bisect completed; poisoning set printed. |
| 2 | Collection problem (no tests, target nodeid not found, ...). |
| 3 | Target fails when run alone — not an ordering issue. |
| 4 | Target passes in the full ordered run — no pollution reproduced. |
| 5 | --max-runs budget exhausted. |
These are stable; CI can branch on them.
The examples/polluting_demo/ directory contains a six-test suite with one
polluter and one target. Use it to verify the tool runs in your environment:
python -m flake_bisect \
--workdir examples/polluting_demo \
--target test_target.py::test_assumes_clean_envYou should see test_pollute.py::test_sets_env_flag named as the culprit.
Deeper material on flaky tests, pytest internals, isolation patterns, and the delta-debugging algorithm lives at python-testing-debugging.com.