feat(fts): implement CacheCodec for posting lists and positions#6691
feat(fts): implement CacheCodec for posting lists and positions#6691wkalt wants to merge 1 commit intolance-format:mainfrom
Conversation
OSS-741. Follows lance#6223. PostingListKey and PositionKey gain a codec mirroring partition_serde.rs (variant tag + JSON header + Arrow IPC streams). SharedPositionStream is backed by bytes::Bytes so the cache read path is end-to-end zero-copy.
7e2cd07 to
1235bce
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@claude review once |
There was a problem hiding this comment.
Found one nit (inline) about preferring as_primitive_opt for type-safe IPC column reads — otherwise the codec mirrors partition_serde.rs cleanly and the round-trip + zero-copy tests look solid. Deferring to a human for the format-design call (variant tags, JSON header layout) since this is the persistence contract for FTS cache entries going forward.
Extended reasoning...
Overview
This PR introduces a new ~665-line cache_codec.rs implementing CacheCodecImpl for PostingList and Positions, plus changes to inverted/index.rs to (a) switch SharedPositionStream::bytes from Vec<u8> to bytes::Bytes for zero-copy reads, (b) wire PostingListKey / PositionKey to the new codec via CacheKey::codec(), and (c) make Positions field visibility pub(super). Tests exercise round-trip across all variants (Plain/Compressed × None/Legacy/SharedStream positions), verify that the deserialized shared-stream byte slice points back into the input Bytes (true zero-copy), and add an integration test (test_fts_prewarm_with_serializing_backend_serves_query_with_no_io) that drives a real FTS prewarm + phrase query through a serializing cache backend and asserts zero IO post-prewarm.
Security risks
None of significance. The codec consumes Bytes from a cache backend, which is trusted infrastructure rather than untrusted input. The IPC stream + JSON header path could panic on malformed input via as_primitive::<T>() (the nit flagged inline), but the realistic threat is corruption / version-skew rather than an attacker-controlled payload, and the fix is the standard _opt + ok_or_else pattern already used elsewhere in the same file.
Level of scrutiny
Moderate-to-high. This is a persistence format — once cache entries are written by a real persistent backend, the variant tags (POSTING_VARIANT_PLAIN = 0, etc.), JSON header field names (max_score, length, posting_tail_codec, codec), and IPC schema column ordering become a forward-compat contract. Worth a human eye specifically on: (1) whether the JSON headers should carry a format version field for future evolution, (2) whether the panic-on-type-mismatch behavior at cache_codec.rs:199-203 and :295-304 is acceptable for persistent backends, and (3) whether Positions::deserialize rejecting POSITIONS_TAG_NONE is the right policy.
Other factors
- Codecov reports 82.83% patch coverage with 73 lines uncovered, mostly in error paths (truncated input, unknown tags, JSON header parse failures). The happy-path round-trip coverage is good.
- The only notable behavioral change in
index.rsisSharedPositionStream::size()now returnsbytes.len()instead ofbytes.capacity()— a small accounting tweak that is correct givenBytesdoes not expose capacity. - The integration test in
dataset_index.rsintroduces aSerializingBackendtest helper inline and the comment notes "if a third user appears, lift this into a shared test utility" — reasonable deferral. - Author explicitly requested review (
@claude review once), so a deferral with the nit context is more useful than silence here.
| let block_offsets = batch | ||
| .column(0) | ||
| .as_primitive::<UInt32Type>() | ||
| .values() | ||
| .to_vec(); |
There was a problem hiding this comment.
🟡 The codec deserialization path uses panicking .as_primitive::<T>() calls at lines 199-203 and 295-304 to read the IPC-decoded block_offsets, row_ids, and frequencies columns, while the same file uses the safer downcast_ref + ok_or_else pattern for ListArray (line 188) and LargeBinaryArray (in deserialize_compressed). Per rust/AGENTS.md line 34, _opt variants should be preferred when the data type has not been pre-verified — and here the IPC schema is part of the (potentially corrupted or version-skewed) input. Suggest switching to as_primitive_opt::<T>().ok_or_else(|| Error::io("...")) so a malformed cache entry returns Error::io instead of panicking the process.
Extended reasoning...
What's happening. The new cache_codec.rs is the persistent-cache serialization layer for FTS posting lists and positions. Its deserialization entry points (deserialize for PostingList and Positions) take a Bytes buffer that, by design, has crossed a serialization boundary — possibly into a persistent backend, possibly between Lance versions. Inside read_position_storage (line 199-203) and deserialize_plain (line 295-304), the code reads three primitive-typed columns from an IPC-decoded RecordBatch using .as_primitive::<UInt32Type>(), .as_primitive::<UInt64Type>(), and .as_primitive::<Float32Type>().\n\nWhy this is a defect. The non-_opt form of as_primitive panics on type mismatch instead of returning a Result. rust/AGENTS.md line 34 explicitly says: "Prefer _opt variants (e.g., as_string_opt) unless the data type has already been verified." Here the type has not been verified — the IPC stream's schema is part of the input, not a precondition. The same file already follows the safer pattern for non-primitive columns: ListArray at line 188 and LargeBinaryArray inside deserialize_compressed both use downcast_ref + ok_or_else and surface Error::io. The primitives are inconsistent.\n\nConcrete failure path (step-by-step).\n1. A persistent cache backend (e.g. a future on-disk store) writes a Positions entry serialized by today's codec.\n2. A future Lance version, or a corrupted entry (single-bit flip in the IPC schema metadata bytes), is read back and decoded by read_ipc_stream_single_at. The IPC stream parses successfully but yields a RecordBatch whose block_offsets column has, say, DataType::UInt64 instead of UInt32.\n3. Line 201 calls .as_primitive::<UInt32Type>() on that column. Per arrow_array::cast::AsArray, the non-_opt variant unwraps internally and panics with "Expected primitive of type ...".\n4. The panic propagates out of Positions::deserialize, past the CacheBackend::get call, into whatever async task is doing FTS query execution — process abort instead of an Err the caller could log and fall back from.\n\nWhy existing code does not prevent this. read_ipc_stream_single_at validates that some IPC stream parses, but it does not enforce the column types the codec expects. The codec writes typed columns (UInt32Array, UInt64Array, Float32Array) but does not re-check those types on read; it trusts the schema embedded in the bytes.\n\nImpact. Realistically small — the codec is mostly fed bytes the same codec just wrote, so type mismatches are unlikely outside of corruption or version skew. However: (a) this codec is explicitly marketed for persistent cache backends (per the module docstring and PR description), where corruption and version skew are the whole point of having serialization; (b) the panic is a worse failure mode than Error::io because it cannot be caught by callers; (c) the fix is purely mechanical and the file already establishes the right pattern two lines above.\n\nFix. Replace each .as_primitive::<T>() with .as_primitive_opt::<T>().ok_or_else(|| Error::io("<column> column is not a <type>".to_string()))?. For example at line 199-203:\n\nrust\nlet block_offsets = batch\n .column(0)\n .as_primitive_opt::<UInt32Type>()\n .ok_or_else(|| Error::io("block_offsets column is not UInt32".to_string()))?\n .values()\n .to_vec();\n\n\nApply the same change to the row_ids (UInt64) and frequencies (Float32) reads in deserialize_plain. This aligns with the convention already used for ListArray / LargeBinaryArray in this file and with the project's Arrow-access guidance.\n\nSeverity: nit. Real but minor: defensive-programming inconsistency rather than an active bug, since the realistic likelihood of a type-mismatch under normal operation is low (the same codec writes and reads). Worth flagging because the fix is one-line per column, the file is brand new, and it removes a process-panic failure mode in code that is explicitly designed to handle externally-stored bytes.
OSS-741. Follows lance#6223. PostingListKey and PositionKey gain a codec mirroring partition_serde.rs (variant tag + JSON header + Arrow IPC streams). SharedPositionStream is backed by bytes::Bytes so the cache read path is end-to-end zero-copy.