Add NumPy-aware default hashing, regression test, and xxhash benchmark#337
Add NumPy-aware default hashing, regression test, and xxhash benchmark#337
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #337 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 11 11
Lines 1477 1508 +31
Branches 185 192 +7
=========================================
+ Hits 1477 1508 +31
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This pull request addresses issue #43 by implementing NumPy-aware hashing for the default hash function, replacing the previous inefficient pickle-based approach for large NumPy arrays. The changes optimize cache key generation for NumPy arrays while maintaining lazy imports to avoid breaking installations without NumPy.
Changes:
- Implemented incremental blake2b-based hashing with special handling for NumPy arrays, lists, tuples, and dicts in
src/cachier/config.py - Added regression test to verify cache hits for identical arrays and misses for modified arrays across memory and pickle backends
- Added benchmark script to compare default hasher performance against xxhash reference implementation
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
src/cachier/config.py |
Replaced SHA256+pickle hash with blake2b incremental hashing; added NumPy-aware helpers _is_numpy_array, _hash_numpy_array, and _update_hash_for_value |
tests/test_numpy_hash.py |
Added parametrized test verifying cache behavior with large NumPy arrays across memory and pickle backends |
scripts/benchmark_numpy_hash.py |
Added benchmark comparing default hasher against xxhash for large arrays with configurable size and run count |
| np = pytest.importorskip("numpy") | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("backend", ["memory", "pickle"]) |
There was a problem hiding this comment.
The test should be decorated with either @pytest.mark.memory or @pytest.mark.pickle for each backend being tested. According to the codebase conventions, tests should be marked with the appropriate backend marker so they can be selectively run in the CI matrix. Consider splitting this into two separate test functions, one for each backend with the appropriate marker, or using pytest.mark.parametrize with marks on each parameter value as demonstrated in tests/test_defaults.py lines 188-194.
| @pytest.mark.parametrize("backend", ["memory", "pickle"]) | |
| @pytest.mark.parametrize( | |
| "backend", | |
| [ | |
| pytest.param("memory", marks=pytest.mark.memory), | |
| pytest.param("pickle", marks=pytest.mark.pickle), | |
| ], | |
| ) |
|
|
||
| if isinstance(value, dict): | ||
| hasher.update(b"dict") | ||
| for dict_key in sorted(value): |
There was a problem hiding this comment.
Calling sorted(value) on a dict with non-comparable keys will raise a TypeError. While dict keys in kwargs are typically strings (and thus sortable), users could pass dicts with arbitrary key types as function arguments. Consider wrapping the sort in a try-except block that falls back to sorting by the string representation of keys, or by using sorted(value.items(), key=lambda x: str(x[0])) to ensure robustness.
| def _update_hash_for_value(hasher: "hashlib._Hash", value: Any) -> None: | ||
| """Update hasher with a stable representation of a Python value. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| hasher : hashlib._Hash | ||
| The hasher to update. | ||
| value : Any | ||
| Value to encode. | ||
|
|
||
| """ | ||
| if _is_numpy_array(value): | ||
| _hash_numpy_array(hasher, value) | ||
| return | ||
|
|
||
| if isinstance(value, tuple): | ||
| hasher.update(b"tuple") | ||
| for item in value: | ||
| _update_hash_for_value(hasher, item) | ||
| return | ||
|
|
||
| if isinstance(value, list): | ||
| hasher.update(b"list") | ||
| for item in value: | ||
| _update_hash_for_value(hasher, item) | ||
| return | ||
|
|
||
| if isinstance(value, dict): | ||
| hasher.update(b"dict") | ||
| for dict_key in sorted(value): | ||
| _update_hash_for_value(hasher, dict_key) | ||
| _update_hash_for_value(hasher, value[dict_key]) | ||
| return | ||
|
|
||
| hasher.update(pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL)) |
There was a problem hiding this comment.
The recursive implementation of _update_hash_for_value could cause a stack overflow with deeply nested data structures (e.g., nested lists/tuples/dicts). Consider adding a maximum recursion depth check or using an iterative approach with an explicit stack to prevent potential stack overflow errors when users pass deeply nested arguments to cached functions.
| hasher.update(b"numpy.ndarray") | ||
| hasher.update(value.dtype.str.encode("utf-8")) | ||
| hasher.update(str(value.shape).encode("utf-8")) | ||
| hasher.update(value.tobytes(order="C")) |
There was a problem hiding this comment.
The tobytes(order="C") call creates a copy if the array is not already C-contiguous, which is correct for ensuring consistent hashing. However, consider documenting this behavior in the function docstring, as it has performance implications for large non-contiguous arrays (e.g., sliced views, transposed arrays). This is working as intended but worth noting for users.
| _update_hash_for_value(hasher, dict_key) | ||
| _update_hash_for_value(hasher, value[dict_key]) | ||
| return | ||
|
|
There was a problem hiding this comment.
The hash function does not handle set or frozenset types explicitly. When these types are passed to cached functions, they'll be pickled via the fallback at line 80. Consider adding explicit handling for sets (converting to sorted tuples) to ensure deterministic hashing, similar to how dicts are handled with sorted keys. This would improve hash consistency and performance for set arguments.
| if isinstance(value, (set, frozenset)): | |
| # Use a deterministic ordering of elements for hashing. | |
| hasher.update(b"frozenset" if isinstance(value, frozenset) else b"set") | |
| try: | |
| # Fast path: works for homogeneous, orderable element types. | |
| iterable = sorted(value) | |
| except TypeError: | |
| # Fallback: impose a deterministic order based on type name and repr. | |
| iterable = sorted(value, key=lambda item: (type(item).__name__, repr(item))) | |
| for item in iterable: | |
| _update_hash_for_value(hasher, item) | |
| return |
| def _default_hash_func(args, kwds): | ||
| # Sort the kwargs to ensure consistent ordering | ||
| sorted_kwargs = sorted(kwds.items()) | ||
| # Serialize args and sorted_kwargs using pickle or similar | ||
| serialized = pickle.dumps((args, sorted_kwargs)) | ||
| # Create a hash of the serialized data | ||
| return hashlib.sha256(serialized).hexdigest() | ||
| """Compute a stable hash key for function arguments. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| args : tuple | ||
| Positional arguments. | ||
| kwds : dict | ||
| Keyword arguments. | ||
|
|
||
| Returns | ||
| ------- | ||
| str | ||
| A hex digest representing the call arguments. | ||
|
|
||
| """ | ||
| hasher = hashlib.blake2b(digest_size=32) | ||
| hasher.update(b"args") | ||
| _update_hash_for_value(hasher, args) | ||
| hasher.update(b"kwds") | ||
| _update_hash_for_value(hasher, dict(sorted(kwds.items()))) | ||
| return hasher.hexdigest() |
There was a problem hiding this comment.
Changing the hash algorithm from SHA256 to blake2b is a breaking change that will invalidate all existing caches. When users upgrade to this version, their cached function results will not be found because the cache keys will be different. Consider documenting this breaking change in the PR description or adding a migration guide. Alternatively, consider versioning the hash function or providing a compatibility mode that can read old cache entries.
| def _update_hash_for_value(hasher: "hashlib._Hash", value: Any) -> None: | ||
| """Update hasher with a stable representation of a Python value. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| hasher : hashlib._Hash | ||
| The hasher to update. | ||
| value : Any | ||
| Value to encode. | ||
|
|
||
| """ | ||
| if _is_numpy_array(value): | ||
| _hash_numpy_array(hasher, value) | ||
| return | ||
|
|
||
| if isinstance(value, tuple): | ||
| hasher.update(b"tuple") | ||
| for item in value: | ||
| _update_hash_for_value(hasher, item) | ||
| return | ||
|
|
||
| if isinstance(value, list): | ||
| hasher.update(b"list") | ||
| for item in value: | ||
| _update_hash_for_value(hasher, item) | ||
| return | ||
|
|
||
| if isinstance(value, dict): | ||
| hasher.update(b"dict") | ||
| for dict_key in sorted(value): | ||
| _update_hash_for_value(hasher, dict_key) | ||
| _update_hash_for_value(hasher, value[dict_key]) | ||
| return | ||
|
|
||
| hasher.update(pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL)) |
There was a problem hiding this comment.
The new hash function adds type prefixes (e.g., b"tuple", b"list", b"dict") to distinguish between different container types. This is good for correctness, but it means that hash_func((1,), {}) and hash_func([1], {}) will produce different hashes (as expected). However, this is a semantic change from the old implementation where both might have been pickled to similar byte sequences. Ensure this behavior is desired and documented, particularly that tuple (1,) and list [1] are now guaranteed to have different cache keys.
Motivation
memoryandpicklebackends.xxhash-based reference on very large arrays to validate performance tradeoffs.Description
src/cachier/config.py:_is_numpy_array,_hash_numpy_array, and_update_hash_for_value, and replace the old pickle+SHA256 approach with an incrementalblake2b-based default hasher (_default_hash_func) that treats ndarray metadata and raw bytes specially.tests/test_numpy_hash.pywhich verifies cache hits for identical large arrays and misses when array content changes, parametrized acrossmemoryandpicklebackends.scripts/benchmark_numpy_hash.pywhich benchmarks the default hasher vs anxxhashreference implementation on configurable large NumPy arrays and prints median timings and the ratio.Testing
pytest -q tests/test_numpy_hash.py, result:2 passed.ruff check src/cachier/config.py tests/test_numpy_hash.py scripts/benchmark_numpy_hash.py, result: all checks passed.mypy src/cachier/config.py, result: no issues.python scripts/benchmark_numpy_hash.py --elements 10000000 --runs 5after installingxxhash, result:cachier_default median: 0.326273s,xxhash_reference median: 0.229056s, ratio1.42x(benchmark succeeded; requiresnumpyandxxhashin the environment).Codex Task