Skip to content

feat(fts): implement CacheCodec for posting lists and positions#6691

Open
wkalt wants to merge 1 commit intolance-format:mainfrom
wkalt:ticket/oss-741/cachecodec-fts
Open

feat(fts): implement CacheCodec for posting lists and positions#6691
wkalt wants to merge 1 commit intolance-format:mainfrom
wkalt:ticket/oss-741/cachecodec-fts

Conversation

@wkalt
Copy link
Copy Markdown
Contributor

@wkalt wkalt commented May 5, 2026

OSS-741. Follows lance#6223. PostingListKey and PositionKey gain a codec mirroring partition_serde.rs (variant tag + JSON header + Arrow IPC streams). SharedPositionStream is backed by bytes::Bytes so the cache read path is end-to-end zero-copy.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label May 5, 2026
OSS-741. Follows lance#6223. PostingListKey and PositionKey gain a
codec mirroring partition_serde.rs (variant tag + JSON header +
Arrow IPC streams). SharedPositionStream is backed by bytes::Bytes
so the cache read path is end-to-end zero-copy.
@wkalt wkalt force-pushed the ticket/oss-741/cachecodec-fts branch from 7e2cd07 to 1235bce Compare May 5, 2026 22:24
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 82.82353% with 73 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ust/lance-index/src/scalar/inverted/cache_codec.rs 83.29% 25 Missing and 43 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 72.22% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented May 5, 2026

@claude review once

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one nit (inline) about preferring as_primitive_opt for type-safe IPC column reads — otherwise the codec mirrors partition_serde.rs cleanly and the round-trip + zero-copy tests look solid. Deferring to a human for the format-design call (variant tags, JSON header layout) since this is the persistence contract for FTS cache entries going forward.

Extended reasoning...

Overview

This PR introduces a new ~665-line cache_codec.rs implementing CacheCodecImpl for PostingList and Positions, plus changes to inverted/index.rs to (a) switch SharedPositionStream::bytes from Vec<u8> to bytes::Bytes for zero-copy reads, (b) wire PostingListKey / PositionKey to the new codec via CacheKey::codec(), and (c) make Positions field visibility pub(super). Tests exercise round-trip across all variants (Plain/Compressed × None/Legacy/SharedStream positions), verify that the deserialized shared-stream byte slice points back into the input Bytes (true zero-copy), and add an integration test (test_fts_prewarm_with_serializing_backend_serves_query_with_no_io) that drives a real FTS prewarm + phrase query through a serializing cache backend and asserts zero IO post-prewarm.

Security risks

None of significance. The codec consumes Bytes from a cache backend, which is trusted infrastructure rather than untrusted input. The IPC stream + JSON header path could panic on malformed input via as_primitive::<T>() (the nit flagged inline), but the realistic threat is corruption / version-skew rather than an attacker-controlled payload, and the fix is the standard _opt + ok_or_else pattern already used elsewhere in the same file.

Level of scrutiny

Moderate-to-high. This is a persistence format — once cache entries are written by a real persistent backend, the variant tags (POSTING_VARIANT_PLAIN = 0, etc.), JSON header field names (max_score, length, posting_tail_codec, codec), and IPC schema column ordering become a forward-compat contract. Worth a human eye specifically on: (1) whether the JSON headers should carry a format version field for future evolution, (2) whether the panic-on-type-mismatch behavior at cache_codec.rs:199-203 and :295-304 is acceptable for persistent backends, and (3) whether Positions::deserialize rejecting POSITIONS_TAG_NONE is the right policy.

Other factors

  • Codecov reports 82.83% patch coverage with 73 lines uncovered, mostly in error paths (truncated input, unknown tags, JSON header parse failures). The happy-path round-trip coverage is good.
  • The only notable behavioral change in index.rs is SharedPositionStream::size() now returns bytes.len() instead of bytes.capacity() — a small accounting tweak that is correct given Bytes does not expose capacity.
  • The integration test in dataset_index.rs introduces a SerializingBackend test helper inline and the comment notes "if a third user appears, lift this into a shared test utility" — reasonable deferral.
  • Author explicitly requested review (@claude review once), so a deferral with the nit context is more useful than silence here.

Comment on lines +199 to +203
let block_offsets = batch
.column(0)
.as_primitive::<UInt32Type>()
.values()
.to_vec();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The codec deserialization path uses panicking .as_primitive::<T>() calls at lines 199-203 and 295-304 to read the IPC-decoded block_offsets, row_ids, and frequencies columns, while the same file uses the safer downcast_ref + ok_or_else pattern for ListArray (line 188) and LargeBinaryArray (in deserialize_compressed). Per rust/AGENTS.md line 34, _opt variants should be preferred when the data type has not been pre-verified — and here the IPC schema is part of the (potentially corrupted or version-skewed) input. Suggest switching to as_primitive_opt::<T>().ok_or_else(|| Error::io("...")) so a malformed cache entry returns Error::io instead of panicking the process.

Extended reasoning...

What's happening. The new cache_codec.rs is the persistent-cache serialization layer for FTS posting lists and positions. Its deserialization entry points (deserialize for PostingList and Positions) take a Bytes buffer that, by design, has crossed a serialization boundary — possibly into a persistent backend, possibly between Lance versions. Inside read_position_storage (line 199-203) and deserialize_plain (line 295-304), the code reads three primitive-typed columns from an IPC-decoded RecordBatch using .as_primitive::<UInt32Type>(), .as_primitive::<UInt64Type>(), and .as_primitive::<Float32Type>().\n\nWhy this is a defect. The non-_opt form of as_primitive panics on type mismatch instead of returning a Result. rust/AGENTS.md line 34 explicitly says: "Prefer _opt variants (e.g., as_string_opt) unless the data type has already been verified." Here the type has not been verified — the IPC stream's schema is part of the input, not a precondition. The same file already follows the safer pattern for non-primitive columns: ListArray at line 188 and LargeBinaryArray inside deserialize_compressed both use downcast_ref + ok_or_else and surface Error::io. The primitives are inconsistent.\n\nConcrete failure path (step-by-step).\n1. A persistent cache backend (e.g. a future on-disk store) writes a Positions entry serialized by today's codec.\n2. A future Lance version, or a corrupted entry (single-bit flip in the IPC schema metadata bytes), is read back and decoded by read_ipc_stream_single_at. The IPC stream parses successfully but yields a RecordBatch whose block_offsets column has, say, DataType::UInt64 instead of UInt32.\n3. Line 201 calls .as_primitive::<UInt32Type>() on that column. Per arrow_array::cast::AsArray, the non-_opt variant unwraps internally and panics with "Expected primitive of type ...".\n4. The panic propagates out of Positions::deserialize, past the CacheBackend::get call, into whatever async task is doing FTS query execution — process abort instead of an Err the caller could log and fall back from.\n\nWhy existing code does not prevent this. read_ipc_stream_single_at validates that some IPC stream parses, but it does not enforce the column types the codec expects. The codec writes typed columns (UInt32Array, UInt64Array, Float32Array) but does not re-check those types on read; it trusts the schema embedded in the bytes.\n\nImpact. Realistically small — the codec is mostly fed bytes the same codec just wrote, so type mismatches are unlikely outside of corruption or version skew. However: (a) this codec is explicitly marketed for persistent cache backends (per the module docstring and PR description), where corruption and version skew are the whole point of having serialization; (b) the panic is a worse failure mode than Error::io because it cannot be caught by callers; (c) the fix is purely mechanical and the file already establishes the right pattern two lines above.\n\nFix. Replace each .as_primitive::<T>() with .as_primitive_opt::<T>().ok_or_else(|| Error::io("<column> column is not a <type>".to_string()))?. For example at line 199-203:\n\nrust\nlet block_offsets = batch\n .column(0)\n .as_primitive_opt::<UInt32Type>()\n .ok_or_else(|| Error::io("block_offsets column is not UInt32".to_string()))?\n .values()\n .to_vec();\n\n\nApply the same change to the row_ids (UInt64) and frequencies (Float32) reads in deserialize_plain. This aligns with the convention already used for ListArray / LargeBinaryArray in this file and with the project's Arrow-access guidance.\n\nSeverity: nit. Real but minor: defensive-programming inconsistency rather than an active bug, since the realistic likelihood of a type-mismatch under normal operation is low (the same codec writes and reads). Worth flagging because the fix is one-line per column, the file is brand new, and it removes a process-panic failure mode in code that is explicitly designed to handle externally-stored bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant