feat: Mutation caching and transitive dependency tracking#509
feat: Mutation caching and transitive dependency tracking#509nicklafleur wants to merge 3 commits intoboxed:mainfrom
Conversation
This commit implements function-level hashing to skip re-testing unchanged mutants, along with fixes for mypy type errors and architectural improvements. A follow-up commit will implement transitive invalidation of mutants based on function call graphs and the new hashing mechanism. INCREMENTAL MUTATION TESTING - Add _compute_function_hashes() in file_mutation.py to generate SHA-256 hashes (truncated to 12 chars) for each mutated function's source code - Store hash_by_function_name in SourceFileMutationData for persistence - On subsequent runs, compare old vs new hashes to identify changed functions - Reset mutant results to None (needs re-testing) when function hash changes - Return changed_functions and current_hashes from create_mutants_for_file() MUTATION METADATA TRACKING - Add MutationMetadata dataclass with line_number, mutation_type, and description - Each Mutation now carries metadata about what changed and where - Add OPERATOR_TO_TYPE mapping to categorize mutations (number, string, boolean, etc.) - Add _determine_mutation_type() to disambiguate operator categories - Add _describe_mutation() for human-readable mutation descriptions - Serialize/deserialize metadata to JSON via to_dict()/from_dict() NAMING AND CONVENTIONS - Rename public functions to private (_create_mutations, _combine_mutations_to_source, etc.) - Rename mutation_operators to MUTATION_OPERATORS (constant naming convention) - Add explicit type annotations throughout (dict[str, MutationMetadata], etc.) NEW BENCHMARK PROJECT - Add e2e_projects/benchmark_1k/ with ~1000 mutants for testing - Includes modules: numbers, strings, booleans, operators, comparisons, arguments, returns, complex (recursion, higher-order functions) - Configurable delays via BENCHMARK_IMPORT_DELAY, BENCHMARK_CONFTEST_DELAY, BENCHMARK_TEST_DELAY environment variables
Introduce MutmutState class to more easily manage runtime state for dependency tracking (old_function_hashes, current_function_hashes, function_dependencies). Persist hashes and dependencies to mutmut-stats.json for incremental runs. Changes: - Add state.py with MutmutState dataclass and state() singleton accessor - Add core.py with MutmutCallStack (ContextVar-based) for async-safe tracking - Move record_trampoline_hit to core.py, now tracks caller->callee edges - Update trampoline to track call depth and record dependencies during stats - Extend load_stats/save_stats to persist function_hashes and dependencies - Add _cleanup_stale_stats and _invalidate_stale_dependency_edges functions - Add track_dependencies and dependency_tracking_depth config options - Update documentation describing the dependency tracking feature
|
Hi, thanks for the PR, I think this will improve working with mutmut in general :) I think I would fix #477 before taking the time to review this PR (because I think it would be nice to fix the regression some time soon, also because I'd like to unify the external / "normal" method injection setup a bit to reduce complexity, and tbh also because currently I'm more in the mood of writing code myself rather than reviewing, as I only spend little time on open source currently). Some initial thoughts on this PR: I guess a (reasonable) limitation is, that caching will only notice changes within functions/methods. So all of the following would not trigger mutant reruns:
All of these cannot be tied to some function/method, so we would need some other system than callstacks for tracking dependencies. I think it's fair to say these are out of scope. What happens when mutmut configs change? e.g. in the first run we set the filter to only mutate some files and in the next run other files? Or we add a new pytest flag. Should we simply keep the cache, or clear it, or ask the user?
Is this relevant to caching or an additional feature? The |
|
yeah #477 and the unification of the trampoline patterns seem like great candidates to merge before this work, the dependency change thing is something that I don't really have a great answer to. My personal view here is that generally people should be proactive about doing full reruns when making big library changes, but having a "false cache" is definitely not the kind of things that most people would clue into. The naive approach would be to detect these things in some way and simply force a rerun in those cases, which is effectively the status quo today so there's no regression in that sense. The mutation metadata is something I've been kinda messing with in the context of LLM-driven testing. There's been a big industry push to having unit tests be written by AI, but there isn't really a mechanism to give AI meaningful feedback on the quality of passing tests. One can imagine that a math-focused lib may want to kill all calculation-based/boolean mutants but not care as much for string mutations for example. Having this kind of metadata is what would be needed to be able to filter for/express this data. I believe (I'd have to go back and check, been a while since I made the changes) that I've included this information in my updates to the browser in the TMP branch, but I'll be sure to include that if not. On a more general note, if you're having reviewer burnout please take some time to just do some code changes, I've been blessed by @boxed as a collaborator, and will be happy to take on the review burden of your (and other's) changes in the short term and leave mine to sit on the sidelines for a bit, you've reviewed more than enough of my code to have earned that break, especially given the size and density of my changes :), though if you have the opportunity to test out this branch to get a feel for the speed increases and workflows, I would love to hear your hands-on experience. |
Thanks for your offer ❤️ I am already taking it slow, only looking at open source a few times a month and then doing only the work I feel happy doing right now. Regarding reviewing other PRs, feel free to do so but no pressure. You could also review and ask me if there are any open questions.
If we want to pull in git as a dependency, we could:
So something like this (just a first idea, feel free to redesign): # initial full run
mutmut run
# modify some files
vim src/main.py
vim src/config.yml
vim pyproject.toml
# partially cached run
mutmut run
[info] following files changed since the last full run, but cannot be tracked for changes:
[info] src/config.yml pyproject.toml (not displaying src/main.py, because we track changes there)
[info] Consider clearing the mutants cache if the changes are relevant for your testsThis would help the external files issue. I think only the import-time code caching would still be a blind spot. I already previously thought about using Also somewhat related is the git option by infection: https://infection.github.io/guide/command-line-options.html#git-diff-filter (probably useful for CI; could be added in addition to this PR imo)
I've been thinking about a setting to enable/disable specific types of mutations ( I haven't given a lot of though yet, how mutmut can be used by agents. I'd guess the git diff could work well (diffing the old and new function), and we could also output a short description in the mutation operators in node_mutation.py. But I'm pretty sure you have more AI experience, so take it just as input :) |
Glad to hear you're prioritizing yourself, I've been merging the easy ones like dependabot, I plan on checking out some of the more recent ones without conflicts and potentially poking the older ones for signs of life.
Having agents use mutmut is actually a big reason why I worked on the caching. For the workflow loop to be somewhat reasonable for our larger repos we needed to bring the runtime as low as possible so that it could be driven by subagent-type flows. I figured that diff style workflows are a large part of modern agent training data so wanted to make use of that in the way we report uncaught mutations to the LLMs instead of the json results which would require a lot of parsing and token spend to extract semantic meaning.
That's an interesting idea, we could pretty reliably capture most typical python configs (toml, reqs.txt, manifests, etc) and potentially even offer a mechanism for people to register their own in case they have some custom internal tooling. That way it's assumes that no change happened (bumping a lib patch version practically never affects behaviour in a meaningful way) but also avoiding a completely silent pass. btw, I plan on taking on #404 sometime soon, just need to set it up on my personal setup and I'll get a working windows impl that doesn't require wsl. |
I think simply informing the user about changed files (excluding ones ending with
The main reason I discontinued working on this is, that re-using workers from a pool is more brittle to errors. If I run mutant A in a process and this mutant breaks some global setup, then running mutant B in the process will produce wrong results. The |
|
A method to handle the brittleness is to ensure that a full test run runs cleanly inside the recycled worker before it gets a new process, but I think that will destroy the performance gains anyway. I just don't see how to get away from using fork and keep all the upsides. |
Summary
Adds incremental mutation testing to mutmut by skipping mutants in unchanged code, with transitive invalidation via a runtime call graph. On re-runs, only mutants in functions whose source (or whose dependencies' source) changed are re-tested.
High-level
Commit Breakdown:
feat: add function hashing for incremental mutation testinghash_by_function_nameonSourceFileMutationData. On subsequent runs, compare old vs. new hashes and reset mutant results toNonefor changed functions only.MutationMetadata(line number, mutation type, human-readable description) carried on everyMutationand serialized to JSON, plus anOPERATOR_TO_TYPEmapping and helpers (_determine_mutation_type,_describe_mutation)._,MUTATION_OPERATORSconstant) and adds explicit type annotations.e2e_projects/benchmark_1k/project (1000 mutants across a broad range of mutation types) with configurable delays viaBENCHMARK_IMPORT_DELAY,BENCHMARK_CONFTEST_DELAY, andBENCHMARK_TEST_DELAY.refactor: relocate formatting utils__main__.py,file_mutation.py, andtrampoline_templates.pyintosrc/mutmut/utils/format_utils.py.feat: Add dependency tracking with function hash persistenceMutmutStatedataclass +state()singleton (state.py) consolidatingold_function_hashes,current_function_hashes, andfunction_dependenciesinstead of leaking module-level globals.core.pywithMutmutCallStack(backed byContextVarfor async/thread safety) and a relocatedrecord_trampoline_hitthat now records caller→callee edges during stats collection.load_stats/save_statsextended to persistfunction_hashesanddependenciesinmutmut-stats.json._cleanup_stale_statsand_invalidate_stale_dependency_edgesprune state for functions that no longer exist or whose hashes changed, and transitively invalidate callers of changed callees.track_dependenciesanddependency_tracking_depth.Known Issues