Skip to content

Include metadata in numpy/polars cache fingerprints to prevent collisions#1616

Open
skrawcz wants to merge 1 commit into
mainfrom
stefan/fix-cache-fingerprinting-collisions
Open

Include metadata in numpy/polars cache fingerprints to prevent collisions#1616
skrawcz wants to merge 1 commit into
mainfrom
stefan/fix-cache-fingerprinting-collisions

Conversation

@skrawcz
Copy link
Copy Markdown
Contributor

@skrawcz skrawcz commented May 29, 2026

Summary

Fix hash collisions in the caching subsystem's fingerprinting for numpy arrays and polars DataFrames.

Problem

  • hash_numpy_array used only obj.tobytes(), which discards shape and dtype. Arrays with identical raw bytes but different shapes (e.g., shape=(6,) vs shape=(2,3)) or different dtypes (e.g., float32(1.0) vs int32(1065353216)) produced identical cache keys.

  • hash_polars_dataframe used only obj.hash_rows(), which discards column names. DataFrames with identical cell values but different schemas produced identical cache keys.

Both could cause the cache to silently return incorrect results from a previous execution.

Fix

  • hash_numpy_array: prepend f"{obj.shape}:{obj.dtype}" to the bytes before hashing
  • hash_polars_dataframe: include column names and dtypes (schema) alongside row hashes

Backwards compatibility

This changes hash output for numpy arrays and polars DataFrames. Existing caches will miss (different hash = recomputation), not produce incorrect results. Users will see a one-time recomputation after upgrading but no manual cache clearing is needed.

Tests

Added tests verifying:

  • Different shapes produce different hashes
  • Different dtypes with same bit pattern produce different hashes
  • Different column names produce different hashes
  • Identical data still produces identical hashes

Reported-by: Dem0

…ions

hash_numpy_array now includes shape and dtype in the hash, preventing
collisions between arrays with identical raw bytes but different
semantics (e.g., shape=(6,) vs shape=(2,3)).

hash_polars_dataframe now includes column names and dtypes (schema)
in the hash, preventing collisions between DataFrames with identical
cell values but different column schemas.

Existing caches will simply miss (different hash = recomputation),
not produce incorrect results.

Reported-by: Dem0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant