Skip to content

Add NumPy-aware default hashing, regression test, and xxhash benchmark#337

Open
shaypal5 wants to merge 1 commit intomasterfrom
codex/add-test-and-benchmark-for-numpy-array-performance
Open

Add NumPy-aware default hashing, regression test, and xxhash benchmark#337
shaypal5 wants to merge 1 commit intomasterfrom
codex/add-test-and-benchmark-for-numpy-array-performance

Conversation

@shaypal5
Copy link
Member

@shaypal5 shaypal5 commented Feb 17, 2026

Motivation

  • This will close issue Numpy array default hasher #43
  • Large NumPy ndarrays were being hashed via generic pickle serialization which is inefficient and can be incorrect for array content comparison at scale. The change ensures array content is hashed deterministically and efficiently without importing NumPy at module import time.
  • Provide a regression test to ensure equal arrays produce cache hits and content-changed arrays produce misses for both memory and pickle backends.
  • Provide a simple benchmark to compare the current default hasher against an xxhash-based reference on very large arrays to validate performance tradeoffs.

Description

  • Implement NumPy-aware hashing helpers in src/cachier/config.py: _is_numpy_array, _hash_numpy_array, and _update_hash_for_value, and replace the old pickle+SHA256 approach with an incremental blake2b-based default hasher (_default_hash_func) that treats ndarray metadata and raw bytes specially.
  • Add tests/test_numpy_hash.py which verifies cache hits for identical large arrays and misses when array content changes, parametrized across memory and pickle backends.
  • Add scripts/benchmark_numpy_hash.py which benchmarks the default hasher vs an xxhash reference implementation on configurable large NumPy arrays and prints median timings and the ratio.
  • Keep NumPy import lazy/dynamic so missing optional dependencies do not break import-time behavior.

Testing

  • Ran the regression test with pytest -q tests/test_numpy_hash.py, result: 2 passed.
  • Ran linting with ruff check src/cachier/config.py tests/test_numpy_hash.py scripts/benchmark_numpy_hash.py, result: all checks passed.
  • Ran static typing with mypy src/cachier/config.py, result: no issues.
  • Executed the benchmark python scripts/benchmark_numpy_hash.py --elements 10000000 --runs 5 after installing xxhash, result: cachier_default median: 0.326273s, xxhash_reference median: 0.229056s, ratio 1.42x (benchmark succeeded; requires numpy and xxhash in the environment).

Codex Task

@codecov
Copy link

codecov bot commented Feb 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (85539dc) to head (6a149bc).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #337   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           11        11           
  Lines         1477      1508   +31     
  Branches       185       192    +7     
=========================================
+ Hits          1477      1508   +31     
Flag Coverage Δ
local 64.19% <100.00%> (+0.75%) ⬆️
mongodb 44.36% <58.82%> (+0.21%) ⬆️
postgres 47.54% <58.82%> (+0.15%) ⬆️
redis 51.39% <58.82%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/cachier/config.py 100.00% <100.00%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85539dc...6a149bc. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses issue #43 by implementing NumPy-aware hashing for the default hash function, replacing the previous inefficient pickle-based approach for large NumPy arrays. The changes optimize cache key generation for NumPy arrays while maintaining lazy imports to avoid breaking installations without NumPy.

Changes:

  • Implemented incremental blake2b-based hashing with special handling for NumPy arrays, lists, tuples, and dicts in src/cachier/config.py
  • Added regression test to verify cache hits for identical arrays and misses for modified arrays across memory and pickle backends
  • Added benchmark script to compare default hasher performance against xxhash reference implementation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
src/cachier/config.py Replaced SHA256+pickle hash with blake2b incremental hashing; added NumPy-aware helpers _is_numpy_array, _hash_numpy_array, and _update_hash_for_value
tests/test_numpy_hash.py Added parametrized test verifying cache behavior with large NumPy arrays across memory and pickle backends
scripts/benchmark_numpy_hash.py Added benchmark comparing default hasher against xxhash for large arrays with configurable size and run count

np = pytest.importorskip("numpy")


@pytest.mark.parametrize("backend", ["memory", "pickle"])
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should be decorated with either @pytest.mark.memory or @pytest.mark.pickle for each backend being tested. According to the codebase conventions, tests should be marked with the appropriate backend marker so they can be selectively run in the CI matrix. Consider splitting this into two separate test functions, one for each backend with the appropriate marker, or using pytest.mark.parametrize with marks on each parameter value as demonstrated in tests/test_defaults.py lines 188-194.

Suggested change
@pytest.mark.parametrize("backend", ["memory", "pickle"])
@pytest.mark.parametrize(
"backend",
[
pytest.param("memory", marks=pytest.mark.memory),
pytest.param("pickle", marks=pytest.mark.pickle),
],
)

Copilot uses AI. Check for mistakes.

if isinstance(value, dict):
hasher.update(b"dict")
for dict_key in sorted(value):
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling sorted(value) on a dict with non-comparable keys will raise a TypeError. While dict keys in kwargs are typically strings (and thus sortable), users could pass dicts with arbitrary key types as function arguments. Consider wrapping the sort in a try-except block that falls back to sorting by the string representation of keys, or by using sorted(value.items(), key=lambda x: str(x[0])) to ensure robustness.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +80
def _update_hash_for_value(hasher: "hashlib._Hash", value: Any) -> None:
"""Update hasher with a stable representation of a Python value.

Parameters
----------
hasher : hashlib._Hash
The hasher to update.
value : Any
Value to encode.

"""
if _is_numpy_array(value):
_hash_numpy_array(hasher, value)
return

if isinstance(value, tuple):
hasher.update(b"tuple")
for item in value:
_update_hash_for_value(hasher, item)
return

if isinstance(value, list):
hasher.update(b"list")
for item in value:
_update_hash_for_value(hasher, item)
return

if isinstance(value, dict):
hasher.update(b"dict")
for dict_key in sorted(value):
_update_hash_for_value(hasher, dict_key)
_update_hash_for_value(hasher, value[dict_key])
return

hasher.update(pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL))
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recursive implementation of _update_hash_for_value could cause a stack overflow with deeply nested data structures (e.g., nested lists/tuples/dicts). Consider adding a maximum recursion depth check or using an iterative approach with an explicit stack to prevent potential stack overflow errors when users pass deeply nested arguments to cached functions.

Copilot uses AI. Check for mistakes.
hasher.update(b"numpy.ndarray")
hasher.update(value.dtype.str.encode("utf-8"))
hasher.update(str(value.shape).encode("utf-8"))
hasher.update(value.tobytes(order="C"))
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tobytes(order="C") call creates a copy if the array is not already C-contiguous, which is correct for ensuring consistent hashing. However, consider documenting this behavior in the function docstring, as it has performance implications for large non-contiguous arrays (e.g., sliced views, transposed arrays). This is working as intended but worth noting for users.

Copilot uses AI. Check for mistakes.
_update_hash_for_value(hasher, dict_key)
_update_hash_for_value(hasher, value[dict_key])
return

Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash function does not handle set or frozenset types explicitly. When these types are passed to cached functions, they'll be pickled via the fallback at line 80. Consider adding explicit handling for sets (converting to sorted tuples) to ensure deterministic hashing, similar to how dicts are handled with sorted keys. This would improve hash consistency and performance for set arguments.

Suggested change
if isinstance(value, (set, frozenset)):
# Use a deterministic ordering of elements for hashing.
hasher.update(b"frozenset" if isinstance(value, frozenset) else b"set")
try:
# Fast path: works for homogeneous, orderable element types.
iterable = sorted(value)
except TypeError:
# Fallback: impose a deterministic order based on type name and repr.
iterable = sorted(value, key=lambda item: (type(item).__name__, repr(item)))
for item in iterable:
_update_hash_for_value(hasher, item)
return

Copilot uses AI. Check for mistakes.
Comment on lines 83 to +104
def _default_hash_func(args, kwds):
# Sort the kwargs to ensure consistent ordering
sorted_kwargs = sorted(kwds.items())
# Serialize args and sorted_kwargs using pickle or similar
serialized = pickle.dumps((args, sorted_kwargs))
# Create a hash of the serialized data
return hashlib.sha256(serialized).hexdigest()
"""Compute a stable hash key for function arguments.

Parameters
----------
args : tuple
Positional arguments.
kwds : dict
Keyword arguments.

Returns
-------
str
A hex digest representing the call arguments.

"""
hasher = hashlib.blake2b(digest_size=32)
hasher.update(b"args")
_update_hash_for_value(hasher, args)
hasher.update(b"kwds")
_update_hash_for_value(hasher, dict(sorted(kwds.items())))
return hasher.hexdigest()
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the hash algorithm from SHA256 to blake2b is a breaking change that will invalidate all existing caches. When users upgrade to this version, their cached function results will not be found because the cache keys will be different. Consider documenting this breaking change in the PR description or adding a migration guide. Alternatively, consider versioning the hash function or providing a compatibility mode that can read old cache entries.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +80
def _update_hash_for_value(hasher: "hashlib._Hash", value: Any) -> None:
"""Update hasher with a stable representation of a Python value.

Parameters
----------
hasher : hashlib._Hash
The hasher to update.
value : Any
Value to encode.

"""
if _is_numpy_array(value):
_hash_numpy_array(hasher, value)
return

if isinstance(value, tuple):
hasher.update(b"tuple")
for item in value:
_update_hash_for_value(hasher, item)
return

if isinstance(value, list):
hasher.update(b"list")
for item in value:
_update_hash_for_value(hasher, item)
return

if isinstance(value, dict):
hasher.update(b"dict")
for dict_key in sorted(value):
_update_hash_for_value(hasher, dict_key)
_update_hash_for_value(hasher, value[dict_key])
return

hasher.update(pickle.dumps(value, protocol=pickle.HIGHEST_PROTOCOL))
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new hash function adds type prefixes (e.g., b"tuple", b"list", b"dict") to distinguish between different container types. This is good for correctness, but it means that hash_func((1,), {}) and hash_func([1], {}) will produce different hashes (as expected). However, this is a semantic change from the old implementation where both might have been pickled to similar byte sequences. Ensure this behavior is desired and documented, particularly that tuple (1,) and list [1] are now guaranteed to have different cache keys.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant