Flaky Test Evaluation Repository

A purpose-built repository for evaluating AI-powered flaky test detection and fix generation systems like UnfoldCI.

Overview

This repository contains intentionally flaky tests across four categories, each with known root causes and expected fix patterns. It's designed to evaluate:

Detection accuracy - Can the system identify which tests are flaky?
Root cause analysis - Can the system identify WHY tests are flaky?
Fix quality - Are the generated fixes correct and comprehensive?

Test Categories

Category	File	Failure Rate	Fix Difficulty	Expected AI Success
Order-Dependent	`test_order_dependent.py`	~30%	Easy	79%
Timing Issues	`test_timing_issues.py`	~25%	Easy-Medium	70%
Async Race Conditions	`test_async_race.py`	~35%	Medium	58%
Resource Cleanup	`test_resource_cleanup.py`	~20%	Medium	60%

Flaky Patterns & Expected Fixes

1. Order-Dependent Tests (Easy)

Root Cause: Tests share state through the Database singleton without proper isolation.

Symptoms:

Tests pass when run individually but fail in certain orders
Database contains unexpected data from previous tests
Transaction counts don't match expectations

Expected Fix:

@pytest.fixture(autouse=True)
def cleanup_database():
    from database import Database
    db = Database()
    db.clear_all()
    db.disconnect()
    yield
    db.clear_all()
    db.disconnect()

2. Timing Issues (Easy-Medium)

Root Cause: Hardcoded timeouts and timing assertions that don't account for system load variation.

Symptoms:

Tests fail intermittently with "timeout" or "too slow" errors
Assertions on response time fail during CI load spikes
Fixed sleep durations cause race conditions

Expected Fixes:

Replace fixed sleeps with polling/retry patterns
Increase timeout margins (e.g., 100ms → 500ms)
Use wait_for_condition with appropriate timeouts

3. Async Race Conditions (Medium)

Root Cause: Missing locks, non-atomic operations, and uncontrolled concurrency in async code.

Symptoms:

Request count doesn't match expected value
Cache inconsistencies in concurrent access
Results arrive in unexpected order

Expected Fixes:

# Add asyncio.Lock for shared state
self._cache_lock = asyncio.Lock()

async def fetch_user(self, user_id: int):
    async with self._cache_lock:
        if user_id in self._cache:
            return self._cache[user_id]

4. Resource Cleanup (Medium)

Root Cause: Missing teardown, class-level state pollution, improper file/lock handling.

Symptoms:

Open file count accumulates across tests
Temp files not cleaned up
Lock files left behind

Expected Fixes:

Add proper fixtures with cleanup
Use context managers
Clear class-level state in teardown

Running Tests

Local Execution

# Install dependencies
pip install -r requirements.txt

# Run all tests (some will fail - that's expected!)
pytest tests/ -v

# Run with random order to expose flakiness
pytest tests/ -p pytest_randomly -v

# Run multiple times to catch flaky failures
for i in {1..10}; do pytest tests/ -p pytest_randomly --randomly-seed=$i; done

# Run with JUnit output for UnfoldCI
pytest tests/ --junitxml=test-results/junit-results.xml

CI/CD

The repository includes a GitHub Actions workflow that:

Runs tests 10 times in parallel with different random seeds
Generates JUnit XML reports
Uploads results to UnfoldCI for analysis

Evaluating Fix Quality

When evaluating AI-generated fixes, check:

Correctness - Does the fix actually prevent the flaky behavior?
Completeness - Does it fix all instances of the pattern?
Minimal Impact - Does it avoid changing unrelated code?
Best Practices - Does it follow Python/pytest conventions?

Scoring Rubric

Score	Criteria
5	Perfect fix, follows best practices
4	Correct fix, minor style issues
3	Partially correct, fixes main issue
2	Incorrect but reasonable attempt
1	Wrong approach entirely

Project Structure

flaky-test-eval/
├── src/
│   ├── __init__.py
│   ├── database.py       # Simulated database with singleton issues
│   ├── api_client.py     # API client with timing/async issues
│   └── file_handler.py   # File operations with cleanup issues
├── tests/
│   ├── __init__.py
│   ├── conftest.py       # Pytest fixtures (intentionally minimal)
│   ├── test_order_dependent.py    # Order-dependent tests
│   ├── test_timing_issues.py      # Timing-related tests
│   ├── test_async_race.py         # Async race condition tests
│   └── test_resource_cleanup.py   # Resource cleanup tests
├── .github/
│   └── workflows/
│       └── unfoldci-test.yml      # CI workflow
├── pytest.ini
├── requirements.txt
└── README.md

Contributing

This repository is designed for evaluation purposes. If you find additional flaky patterns that would be valuable to test, please open an issue or PR.

License

MIT License - Feel free to use this for your own testing tool evaluations.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flaky Test Evaluation Repository

Overview

Test Categories

Flaky Patterns & Expected Fixes

1. Order-Dependent Tests (Easy)

2. Timing Issues (Easy-Medium)

3. Async Race Conditions (Medium)

4. Resource Cleanup (Medium)

Running Tests

Local Execution

CI/CD

Evaluating Fix Quality

Scoring Rubric

Project Structure

Contributing

License

About

Uh oh!

Releases

Packages

Languages

codemachine8/test-val-eval

Folders and files

Latest commit

History

Repository files navigation

Flaky Test Evaluation Repository

Overview

Test Categories

Flaky Patterns & Expected Fixes

1. Order-Dependent Tests (Easy)

2. Timing Issues (Easy-Medium)

3. Async Race Conditions (Medium)

4. Resource Cleanup (Medium)

Running Tests

Local Execution

CI/CD

Evaluating Fix Quality

Scoring Rubric

Project Structure

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages