Add NumPy-aware default hashing, regression test, and xxhash benchmark by shaypal5 · Pull Request #337 · python-cachier/cachier

shaypal5 · 2026-02-17T06:40:52Z

Motivation

This will close issue Numpy array default hasher #43
Large NumPy ndarrays were being hashed via generic pickle serialization which is inefficient and can be incorrect for array content comparison at scale. The change ensures array content is hashed deterministically and efficiently without importing NumPy at module import time.
Provide a regression test to ensure equal arrays produce cache hits and content-changed arrays produce misses for both memory and pickle backends.
Provide a simple benchmark to compare the current default hasher against an xxhash-based reference on very large arrays to validate performance tradeoffs.

Description

Implement NumPy-aware hashing helpers in src/cachier/config.py: _is_numpy_array, _hash_numpy_array, and _update_hash_for_value, and replace the old pickle+SHA256 approach with an incremental blake2b-based default hasher (_default_hash_func) that treats ndarray metadata and raw bytes specially.
Add tests/test_numpy_hash.py which verifies cache hits for identical large arrays and misses when array content changes, parametrized across memory and pickle backends.
Add scripts/benchmark_numpy_hash.py which benchmarks the default hasher vs an xxhash reference implementation on configurable large NumPy arrays and prints median timings and the ratio.
Keep NumPy import lazy/dynamic so missing optional dependencies do not break import-time behavior.

Testing

Ran the regression test with pytest -q tests/test_numpy_hash.py, result: 2 passed.
Ran linting with ruff check src/cachier/config.py tests/test_numpy_hash.py scripts/benchmark_numpy_hash.py, result: all checks passed.
Ran static typing with mypy src/cachier/config.py, result: no issues.
Executed the benchmark python scripts/benchmark_numpy_hash.py --elements 10000000 --runs 5 after installing xxhash, result: cachier_default median: 0.326273s, xxhash_reference median: 0.229056s, ratio 1.42x (benchmark succeeded; requires numpy and xxhash in the environment).

Codex Task

codecov · 2026-02-17T06:42:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (85539dc) to head (6a149bc).

Additional details and impacted files

@@            Coverage Diff            @@
##            master      #337   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           11        11           
  Lines         1477      1508   +31     
  Branches       185       192    +7     
=========================================
+ Hits          1477      1508   +31

Flag	Coverage Δ
local	`64.19% <100.00%> (+0.75%)`	⬆️
mongodb	`44.36% <58.82%> (+0.21%)`	⬆️
postgres	`47.54% <58.82%> (+0.15%)`	⬆️
redis	`51.39% <58.82%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/cachier/config.py	`100.00% <100.00%> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85539dc...6a149bc. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This pull request addresses issue #43 by implementing NumPy-aware hashing for the default hash function, replacing the previous inefficient pickle-based approach for large NumPy arrays. The changes optimize cache key generation for NumPy arrays while maintaining lazy imports to avoid breaking installations without NumPy.

Changes:

Implemented incremental blake2b-based hashing with special handling for NumPy arrays, lists, tuples, and dicts in src/cachier/config.py
Added regression test to verify cache hits for identical arrays and misses for modified arrays across memory and pickle backends
Added benchmark script to compare default hasher performance against xxhash reference implementation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
`src/cachier/config.py`	Replaced SHA256+pickle hash with blake2b incremental hashing; added NumPy-aware helpers `_is_numpy_array`, `_hash_numpy_array`, and `_update_hash_for_value`
`tests/test_numpy_hash.py`	Added parametrized test verifying cache behavior with large NumPy arrays across memory and pickle backends
`scripts/benchmark_numpy_hash.py`	Added benchmark comparing default hasher against xxhash for large arrays with configurable size and run count

Copilot · 2026-02-17T07:05:23Z

tests/test_numpy_hash.py

+np = pytest.importorskip("numpy")
+
+
+@pytest.mark.parametrize("backend", ["memory", "pickle"])


The test should be decorated with either @pytest.mark.memory or @pytest.mark.pickle for each backend being tested. According to the codebase conventions, tests should be marked with the appropriate backend marker so they can be selectively run in the CI matrix. Consider splitting this into two separate test functions, one for each backend with the appropriate marker, or using pytest.mark.parametrize with marks on each parameter value as demonstrated in tests/test_defaults.py lines 188-194.

Suggested change

@pytest.mark.parametrize("backend", ["memory", "pickle"])

@pytest.mark.parametrize(

"backend",

[

pytest.param("memory", marks=pytest.mark.memory),

pytest.param("pickle", marks=pytest.mark.pickle),

],

)

Copilot · 2026-02-17T07:05:23Z

src/cachier/config.py

+
+    if isinstance(value, dict):
+        hasher.update(b"dict")
+        for dict_key in sorted(value):


Calling sorted(value) on a dict with non-comparable keys will raise a TypeError. While dict keys in kwargs are typically strings (and thus sortable), users could pass dicts with arbitrary key types as function arguments. Consider wrapping the sort in a try-except block that falls back to sorting by the string representation of keys, or by using sorted(value.items(), key=lambda x: str(x[0])) to ensure robustness.

Copilot · 2026-02-17T07:05:23Z

src/cachier/config.py

+def _update_hash_for_value(hasher: "hashlib._Hash", value: Any) -> None:
+    """Update hasher with a stable representation of a Python value.
+
+    Parameters
+    ----------
+    hasher : hashlib._Hash
+        The hasher to update.
+    value : Any
+        Value to encode.
+
+    """
+    if _is_numpy_array(value):
+        _hash_numpy_array(hasher, value)
+        return
+
+    if isinstance(value, tuple):
+        hasher.update(b"tuple")
+        for item in value:
+            _update_hash_for_value(hasher, item)
+        return
+
+    if isinstance(value, list):
+        hasher.update(b"list")
+        for item in value:
+            _update_hash_for_value(hasher, item)
+        return
+
+    if isinstance(value, dict):
+        hasher.update(b"dict")
+        for dict_key in sorted(value):
+            _update_hash_for_value(hasher, dict_key)
+            _update_hash_for_value(hasher, value[dict_key])
+        return
+
+    hasher.update(pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL))


The recursive implementation of _update_hash_for_value could cause a stack overflow with deeply nested data structures (e.g., nested lists/tuples/dicts). Consider adding a maximum recursion depth check or using an iterative approach with an explicit stack to prevent potential stack overflow errors when users pass deeply nested arguments to cached functions.

Copilot · 2026-02-17T07:05:23Z

src/cachier/config.py

+    hasher.update(b"numpy.ndarray")
+    hasher.update(value.dtype.str.encode("utf-8"))
+    hasher.update(str(value.shape).encode("utf-8"))
+    hasher.update(value.tobytes(order="C"))


The tobytes(order="C") call creates a copy if the array is not already C-contiguous, which is correct for ensuring consistent hashing. However, consider documenting this behavior in the function docstring, as it has performance implications for large non-contiguous arrays (e.g., sliced views, transposed arrays). This is working as intended but worth noting for users.

Copilot · 2026-02-17T07:05:24Z

src/cachier/config.py

+            _update_hash_for_value(hasher, dict_key)
+            _update_hash_for_value(hasher, value[dict_key])
+        return
+


The hash function does not handle set or frozenset types explicitly. When these types are passed to cached functions, they'll be pickled via the fallback at line 80. Consider adding explicit handling for sets (converting to sorted tuples) to ensure deterministic hashing, similar to how dicts are handled with sorted keys. This would improve hash consistency and performance for set arguments.

Suggested change

if isinstance(value, (set, frozenset)):

# Use a deterministic ordering of elements for hashing.

hasher.update(b"frozenset" if isinstance(value, frozenset) else b"set")

try:

# Fast path: works for homogeneous, orderable element types.

iterable = sorted(value)

except TypeError:

# Fallback: impose a deterministic order based on type name and repr.

iterable = sorted(value, key=lambda item: (type(item).__name__, repr(item)))

for item in iterable:

_update_hash_for_value(hasher, item)

return

Copilot · 2026-02-17T07:05:24Z

src/cachier/config.py

 def _default_hash_func(args, kwds):
-    # Sort the kwargs to ensure consistent ordering
-    sorted_kwargs = sorted(kwds.items())
-    # Serialize args and sorted_kwargs using pickle or similar
-    serialized = pickle.dumps((args, sorted_kwargs))
-    # Create a hash of the serialized data
-    return hashlib.sha256(serialized).hexdigest()
+    """Compute a stable hash key for function arguments.
+
+    Parameters
+    ----------
+    args : tuple
+        Positional arguments.
+    kwds : dict
+        Keyword arguments.
+
+    Returns
+    -------
+    str
+        A hex digest representing the call arguments.
+
+    """
+    hasher = hashlib.blake2b(digest_size=32)
+    hasher.update(b"args")
+    _update_hash_for_value(hasher, args)
+    hasher.update(b"kwds")
+    _update_hash_for_value(hasher, dict(sorted(kwds.items())))
+    return hasher.hexdigest()


Changing the hash algorithm from SHA256 to blake2b is a breaking change that will invalidate all existing caches. When users upgrade to this version, their cached function results will not be found because the cache keys will be different. Consider documenting this breaking change in the PR description or adding a migration guide. Alternatively, consider versioning the hash function or providing a compatibility mode that can read old cache entries.

Copilot · 2026-02-17T07:05:24Z

src/cachier/config.py

+def _update_hash_for_value(hasher: "hashlib._Hash", value: Any) -> None:
+    """Update hasher with a stable representation of a Python value.
+
+    Parameters
+    ----------
+    hasher : hashlib._Hash
+        The hasher to update.
+    value : Any
+        Value to encode.
+
+    """
+    if _is_numpy_array(value):
+        _hash_numpy_array(hasher, value)
+        return
+
+    if isinstance(value, tuple):
+        hasher.update(b"tuple")
+        for item in value:
+            _update_hash_for_value(hasher, item)
+        return
+
+    if isinstance(value, list):
+        hasher.update(b"list")
+        for item in value:
+            _update_hash_for_value(hasher, item)
+        return
+
+    if isinstance(value, dict):
+        hasher.update(b"dict")
+        for dict_key in sorted(value):
+            _update_hash_for_value(hasher, dict_key)
+            _update_hash_for_value(hasher, value[dict_key])
+        return
+
+    hasher.update(pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL))


The new hash function adds type prefixes (e.g., b"tuple", b"list", b"dict") to distinguish between different container types. This is good for correctness, but it means that hash_func((1,), {}) and hash_func([1], {}) will produce different hashes (as expected). However, this is a semantic change from the old implementation where both might have been pickled to similar byte sequences. Ensure this behavior is desired and documented, particularly that tuple (1,) and list [1] are now guaranteed to have different cache keys.

Add NumPy hash coverage and benchmark against xxhash

6a149bc

shaypal5 added the codex label Feb 17, 2026 — with ChatGPT Codex Connector

shaypal5 temporarily deployed to test February 17, 2026 06:40 — with GitHub Actions Inactive

shaypal5 requested review from Borda and Copilot February 17, 2026 07:00

shaypal5 added enhancement testing and development AI & Agents labels Feb 17, 2026

Copilot started reviewing on behalf of shaypal5 February 17, 2026 07:01 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NumPy-aware default hashing, regression test, and xxhash benchmark#337

Add NumPy-aware default hashing, regression test, and xxhash benchmark#337
shaypal5 wants to merge 1 commit intomasterfrom
codex/add-test-and-benchmark-for-numpy-array-performance

shaypal5 commented Feb 17, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Copilot AI Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		np = pytest.importorskip("numpy")


		@pytest.mark.parametrize("backend", ["memory", "pickle"])

+    if isinstance(value, (set, frozenset)):
+        # Use a deterministic ordering of elements for hashing.
+        hasher.update(b"frozenset" if isinstance(value, frozenset) else b"set")
+        try:
+            # Fast path: works for homogeneous, orderable element types.
+            iterable = sorted(value)
+        except TypeError:
+            # Fallback: impose a deterministic order based on type name and repr.
+            iterable = sorted(value, key=lambda item: (type(item).__name__, repr(item)))
+        for item in iterable:
+            _update_hash_for_value(hasher, item)
+        return

Conversation

shaypal5 commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Testing

Uh oh!

codecov bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shaypal5 commented Feb 17, 2026 •

edited

Loading

codecov bot commented Feb 17, 2026 •

edited

Loading