Skip to content

feat: Mutation caching and transitive dependency tracking#509

Open
nicklafleur wants to merge 3 commits intoboxed:mainfrom
lyft:nicklafleur/function_hashing
Open

feat: Mutation caching and transitive dependency tracking#509
nicklafleur wants to merge 3 commits intoboxed:mainfrom
lyft:nicklafleur/function_hashing

Conversation

@nicklafleur
Copy link
Copy Markdown
Collaborator

@nicklafleur nicklafleur commented Apr 26, 2026

Summary

Adds incremental mutation testing to mutmut by skipping mutants in unchanged code, with transitive invalidation via a runtime call graph. On re-runs, only mutants in functions whose source (or whose dependencies' source) changed are re-tested.

High-level

  • Incremental mutation testing which cuts down mutation run duration ~linearly relative to the ratio of code changed (less code is changed, faster the run goes).
    • In practice in large codebases this means a >95% reduction in runtime on average as the amount of code not changed far outweighs the amount of code changed
    • Utility functions are particularly susceptible to "cache busting", even a noop syntactic change that modifies the AST will cause invalidation of all call chains which rely on those functions (technically correct since the code did change, but something to be aware of)
  • UI support will come in a future PR

Commit Breakdown:

  1. feat: add function hashing for incremental mutation testing
  • Foundation: hash each function's source (SHA-256, 12 chars) and persist via hash_by_function_name on SourceFileMutationData. On subsequent runs, compare old vs. new hashes and reset mutant results to None for changed functions only.
  • Introduces MutationMetadata (line number, mutation type, human-readable description) carried on every Mutation and serialized to JSON, plus an OPERATOR_TO_TYPE mapping and helpers (_determine_mutation_type, _describe_mutation).
  • Tightens naming conventions (private helpers prefixed with _, MUTATION_OPERATORS constant) and adds explicit type annotations.
  • New e2e_projects/benchmark_1k/ project (1000 mutants across a broad range of mutation types) with configurable delays via BENCHMARK_IMPORT_DELAY, BENCHMARK_CONFTEST_DELAY, and BENCHMARK_TEST_DELAY.
  1. refactor: relocate formatting utils
  • Consolidates formatting helpers previously scattered across __main__.py, file_mutation.py, and trampoline_templates.py into src/mutmut/utils/format_utils.py.
  • Pure code move with no behavior change; tests updated to import from the new location.
  1. feat: Add dependency tracking with function hash persistence
    • Builds the transitive invalidation layer on top of (1):
    • New MutmutState dataclass + state() singleton (state.py) consolidating old_function_hashes, current_function_hashes, and function_dependencies instead of leaking module-level globals.
    • New core.py with MutmutCallStack (backed by ContextVar for async/thread safety) and a relocated record_trampoline_hit that now records caller→callee edges during stats collection.
    • Trampoline updated to track call depth and emit dependency edges.
    • load_stats/save_stats extended to persist function_hashes and dependencies in mutmut-stats.json.
    • _cleanup_stale_stats and _invalidate_stale_dependency_edges prune state for functions that no longer exist or whose hashes changed, and transitively invalidate callers of changed callees.
    • New config options: track_dependencies and dependency_tracking_depth.
    • README/docs updated to describe the feature.

Known Issues

  • Because we only track dependencies at runtime through the trampoline logic, un-mutated function are omitted in the dependency graph that is built. The call graph represents the call graph of mutated functions not the global one.
  • We end up looping on all walkable files a few times, pushing time complexity higher than before. This is still a smaller penalty than the caching gain but definitely something that can be improves
  • The "cache" is in the form of a json file right now, which is horrifically inefficient for the sparse reads/writes which is typical in this workfow, moving to an sqlite-based store of the state could unlock some significant storage and parallelism breakthroughs
    • I have a follow-up PR that will branch out into different forking strategies that could be extended to include easy hookups for this kind of reporting strategy.

This commit implements function-level hashing to skip re-testing unchanged
mutants, along with fixes for mypy type errors and architectural improvements.

A follow-up commit will implement transitive invalidation of mutants based on
function call graphs and the new hashing mechanism.

INCREMENTAL MUTATION TESTING
- Add _compute_function_hashes() in file_mutation.py to generate SHA-256 hashes
  (truncated to 12 chars) for each mutated function's source code
- Store hash_by_function_name in SourceFileMutationData for persistence
- On subsequent runs, compare old vs new hashes to identify changed functions
- Reset mutant results to None (needs re-testing) when function hash changes
- Return changed_functions and current_hashes from create_mutants_for_file()

MUTATION METADATA TRACKING
- Add MutationMetadata dataclass with line_number, mutation_type, and description
- Each Mutation now carries metadata about what changed and where
- Add OPERATOR_TO_TYPE mapping to categorize mutations (number, string, boolean, etc.)
- Add _determine_mutation_type() to disambiguate operator categories
- Add _describe_mutation() for human-readable mutation descriptions
- Serialize/deserialize metadata to JSON via to_dict()/from_dict()

NAMING AND CONVENTIONS
- Rename public functions to private (_create_mutations, _combine_mutations_to_source, etc.)
- Rename mutation_operators to MUTATION_OPERATORS (constant naming convention)
- Add explicit type annotations throughout (dict[str, MutationMetadata], etc.)

NEW BENCHMARK PROJECT
- Add e2e_projects/benchmark_1k/ with ~1000 mutants for testing
- Includes modules: numbers, strings, booleans, operators, comparisons,
  arguments, returns, complex (recursion, higher-order functions)
- Configurable delays via BENCHMARK_IMPORT_DELAY, BENCHMARK_CONFTEST_DELAY,
  BENCHMARK_TEST_DELAY environment variables
Introduce MutmutState class to more easily manage runtime state for dependency
tracking (old_function_hashes, current_function_hashes, function_dependencies).
Persist hashes and dependencies to mutmut-stats.json for incremental runs.

Changes:
- Add state.py with MutmutState dataclass and state() singleton accessor
- Add core.py with MutmutCallStack (ContextVar-based) for async-safe tracking
- Move record_trampoline_hit to core.py, now tracks caller->callee edges
- Update trampoline to track call depth and record dependencies during stats
- Extend load_stats/save_stats to persist function_hashes and dependencies
- Add _cleanup_stale_stats and _invalidate_stale_dependency_edges functions
- Add track_dependencies and dependency_tracking_depth config options
- Update documentation describing the dependency tracking feature
@nicklafleur nicklafleur changed the title Nicklafleur/function hashing feat: Mutation caching and transitive dependency tracking Apr 26, 2026
@Otto-AA
Copy link
Copy Markdown
Collaborator

Otto-AA commented May 1, 2026

Hi, thanks for the PR, I think this will improve working with mutmut in general :)

I think I would fix #477 before taking the time to review this PR (because I think it would be nice to fix the regression some time soon, also because I'd like to unify the external / "normal" method injection setup a bit to reduce complexity, and tbh also because currently I'm more in the mood of writing code myself rather than reviewing, as I only spend little time on open source currently).

Some initial thoughts on this PR:

I guess a (reasonable) limitation is, that caching will only notice changes within functions/methods. So all of the following would not trigger mutant reruns:

  • external library changes (dependency updates)
  • configuration changes (pyproject.toml, yaml files, etc.)
  • data file changes (my_query.sql, etc.)
  • import-time code changes (dataclass/pydantic model change, import statements, etc.)

All of these cannot be tied to some function/method, so we would need some other system than callstacks for tracking dependencies. I think it's fair to say these are out of scope.

What happens when mutmut configs change? e.g. in the first run we set the filter to only mutate some files and in the next run other files? Or we add a new pytest flag. Should we simply keep the cache, or clear it, or ask the user?

Introduces MutationMetadata (line number, mutation type, human-readable description) carried on every Mutation and serialized to JSON, plus an OPERATOR_TO_TYPE mapping and helpers (_determine_mutation_type, _describe_mutation).

Is this relevant to caching or an additional feature? The _describe_mutation method feels like the git diff of the mutmut browse

@nicklafleur
Copy link
Copy Markdown
Collaborator Author

nicklafleur commented May 1, 2026

yeah #477 and the unification of the trampoline patterns seem like great candidates to merge before this work, the dependency change thing is something that I don't really have a great answer to. My personal view here is that generally people should be proactive about doing full reruns when making big library changes, but having a "false cache" is definitely not the kind of things that most people would clue into.

The naive approach would be to detect these things in some way and simply force a rerun in those cases, which is effectively the status quo today so there's no regression in that sense.

The mutation metadata is something I've been kinda messing with in the context of LLM-driven testing. There's been a big industry push to having unit tests be written by AI, but there isn't really a mechanism to give AI meaningful feedback on the quality of passing tests. One can imagine that a math-focused lib may want to kill all calculation-based/boolean mutants but not care as much for string mutations for example. Having this kind of metadata is what would be needed to be able to filter for/express this data.

I believe (I'd have to go back and check, been a while since I made the changes) that I've included this information in my updates to the browser in the TMP branch, but I'll be sure to include that if not.

On a more general note, if you're having reviewer burnout please take some time to just do some code changes, I've been blessed by @boxed as a collaborator, and will be happy to take on the review burden of your (and other's) changes in the short term and leave mine to sit on the sidelines for a bit, you've reviewed more than enough of my code to have earned that break, especially given the size and density of my changes :), though if you have the opportunity to test out this branch to get a feel for the speed increases and workflows, I would love to hear your hands-on experience.

@Otto-AA
Copy link
Copy Markdown
Collaborator

Otto-AA commented May 2, 2026

On a more general note, if you're having reviewer burnout please take some time to just do some code changes, I've been blessed by @boxed as a collaborator, and will be happy to take on the review burden of your (and other's) changes in the short term and leave mine to sit on the sidelines for a bit, you've reviewed more than enough of my code to have earned that break, especially given the size and density of my changes :), though if you have the opportunity to test out this branch to get a feel for the speed increases and workflows, I would love to hear your hands-on experience.

Thanks for your offer ❤️ I am already taking it slow, only looking at open source a few times a month and then doing only the work I feel happy doing right now. Regarding reviewing other PRs, feel free to do so but no pressure. You could also review and ask me if there are any open questions.

My personal view here is that generally people should be proactive about doing full reruns when making big library changes, but having a "false cache" is definitely not the kind of things that most people would clue into.
The naive approach would be to detect these things in some way and simply force a rerun in those cases, which is effectively the status quo today so there's no regression in that sense.

If we want to pull in git as a dependency, we could:

  • on a full run: store the commit hash (+ changes? not sure if that's possible)
  • on a cached run
    • make a git diff to the last full run
    • inform the users about changed non-python files

So something like this (just a first idea, feel free to redesign):

# initial full run
mutmut run

# modify some files
vim src/main.py
vim src/config.yml
vim pyproject.toml

# partially cached run
mutmut run
[info] following files changed since the last full run, but cannot be tracked for changes:
[info] src/config.yml pyproject.toml (not displaying src/main.py, because we track changes there)
[info] Consider clearing the mutants cache if the changes are relevant for your tests

This would help the external files issue. I think only the import-time code caching would still be a blind spot.

I already previously thought about using git archive to setup the mutants directory, instead of the source_paths and also_include configs. So maybe adding git as an (optional?) dependency could be nice anyway.

Also somewhat related is the git option by infection: https://infection.github.io/guide/command-line-options.html#git-diff-filter (probably useful for CI; could be added in addition to this PR imo)

The mutation metadata is something I've been kinda messing with in the context of LLM-driven testing. There's been a big industry push to having unit tests be written by AI, but there isn't really a mechanism to give AI meaningful feedback on the quality of passing tests. One can imagine that a math-focused lib may want to kill all calculation-based/boolean mutants but not care as much for string mutations for example. Having this kind of metadata is what would be needed to be able to filter for/express this data.

I've been thinking about a setting to enable/disable specific types of mutations (disable_mutation_operators = [ 'string.case', 'number' ] or something like this), maybe that would be helpful for this use case as well? Though the mutation operators are also changing more frequently, so the identifiers are probably not 100% stable.

I haven't given a lot of though yet, how mutmut can be used by agents. I'd guess the git diff could work well (diffing the old and new function), and we could also output a short description in the mutation operators in node_mutation.py. But I'm pretty sure you have more AI experience, so take it just as input :)

@nicklafleur
Copy link
Copy Markdown
Collaborator Author

nicklafleur commented May 2, 2026

Thanks for your offer ❤️ I am already taking it slow, only looking at open source a few times a month and then doing only the work I feel happy doing right now. Regarding reviewing other PRs, feel free to do so but no pressure. You could also review and ask me if there are any open questions.

Glad to hear you're prioritizing yourself, I've been merging the easy ones like dependabot, I plan on checking out some of the more recent ones without conflicts and potentially poking the older ones for signs of life.

I've been thinking about a setting to enable/disable specific types of mutations (disable_mutation_operators = [ 'string.case', 'number' ] or something like this), maybe that would be helpful for this use case as well? Though the mutation operators are also changing more frequently, so the identifiers are probably not 100% stable.

I haven't given a lot of though yet, how mutmut can be used by agents. I'd guess the git diff could work well (diffing the old and new function), and we could also output a short description in the mutation operators in node_mutation.py. But I'm pretty sure you have more AI experience, so take it just as input :)

Having agents use mutmut is actually a big reason why I worked on the caching. For the workflow loop to be somewhat reasonable for our larger repos we needed to bring the runtime as low as possible so that it could be driven by subagent-type flows. I figured that diff style workflows are a large part of modern agent training data so wanted to make use of that in the way we report uncaught mutations to the LLMs instead of the json results which would require a lot of parsing and token spend to extract semantic meaning.

on a cached run

  • make a git diff to the last full run
  • inform the users about changed non-python files

That's an interesting idea, we could pretty reliably capture most typical python configs (toml, reqs.txt, manifests, etc) and potentially even offer a mechanism for people to register their own in case they have some custom internal tooling. That way it's assumes that no change happened (bumping a lib patch version practically never affects behaviour in a meaningful way) but also avoiding a completely silent pass.

btw, I plan on taking on #404 sometime soon, just need to set it up on my personal setup and I'll get a working windows impl that doesn't require wsl.

@Otto-AA
Copy link
Copy Markdown
Collaborator

Otto-AA commented May 3, 2026

That's an interesting idea, we could pretty reliably capture most typical python configs (toml, reqs.txt, manifests, etc) and potentially even offer a mechanism for people to register their own in case they have some custom internal tooling. That way it's assumes that no change happened (bumping a lib patch version practically never affects behaviour in a meaningful way) but also avoiding a completely silent pass.

I think simply informing the user about changed files (excluding ones ending with .py) would be good enough. Usually not many files change, so the user should be able to decide if that's worth a full re-run or they want to continue with cached runs.

btw, I plan on taking on #404 sometime soon, just need to set it up on my personal setup and I'll get a working windows impl that doesn't require wsl.

The main reason I discontinued working on this is, that re-using workers from a pool is more brittle to errors. If I run mutant A in a process and this mutant breaks some global setup, then running mutant B in the process will produce wrong results. The fork method executes each mutant in their own sandbox, so if mutant A breaks some global setup, mutant B won't be affected by this.

@boxed
Copy link
Copy Markdown
Owner

boxed commented May 3, 2026

A method to handle the brittleness is to ensure that a full test run runs cleanly inside the recycled worker before it gets a new process, but I think that will destroy the performance gains anyway. I just don't see how to get away from using fork and keep all the upsides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants