Include metadata in numpy/polars cache fingerprints to prevent collisions by skrawcz · Pull Request #1616 · apache/hamilton

skrawcz · 2026-05-29T22:19:46Z

Summary

Fix hash collisions in the caching subsystem's fingerprinting for numpy arrays and polars DataFrames.

Problem

hash_numpy_array used only obj.tobytes(), which discards shape and dtype. Arrays with identical raw bytes but different shapes (e.g., shape=(6,) vs shape=(2,3)) or different dtypes (e.g., float32(1.0) vs int32(1065353216)) produced identical cache keys.
hash_polars_dataframe used only obj.hash_rows(), which discards column names. DataFrames with identical cell values but different schemas produced identical cache keys.

Both could cause the cache to silently return incorrect results from a previous execution.

Fix

hash_numpy_array: prepend f"{obj.shape}:{obj.dtype}" to the bytes before hashing
hash_polars_dataframe: include column names and dtypes (schema) alongside row hashes

Backwards compatibility

This changes hash output for numpy arrays and polars DataFrames. Existing caches will miss (different hash = recomputation), not produce incorrect results. Users will see a one-time recomputation after upgrading but no manual cache clearing is needed.

Tests

Added tests verifying:

Different shapes produce different hashes
Different dtypes with same bit pattern produce different hashes
Different column names produce different hashes
Identical data still produces identical hashes

Reported-by: Dem0

…ions hash_numpy_array now includes shape and dtype in the hash, preventing collisions between arrays with identical raw bytes but different semantics (e.g., shape=(6,) vs shape=(2,3)). hash_polars_dataframe now includes column names and dtypes (schema) in the hash, preventing collisions between DataFrames with identical cell values but different column schemas. Existing caches will simply miss (different hash = recomputation), not produce incorrect results. Reported-by: Dem0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include metadata in numpy/polars cache fingerprints to prevent collisions#1616

Include metadata in numpy/polars cache fingerprints to prevent collisions#1616
skrawcz wants to merge 1 commit into
mainfrom
stefan/fix-cache-fingerprinting-collisions

skrawcz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skrawcz commented May 29, 2026

Summary

Problem

Fix

Backwards compatibility

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant