Add single-field text_and_string indexing with native fast field support#156
Closed
tlee732 wants to merge 5 commits into
Closed
Add single-field text_and_string indexing with native fast field support#156tlee732 wants to merge 5 commits into
tlee732 wants to merge 5 commits into
Conversation
Creates two tantivy fields from one parquet string column: - <name> with raw tokenizer (exact match, aggregation, sorting) - <name>__text with default tokenizer (full-text search) Includes collision detection, hash field rewriter skip, and 7 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ases
- Fix full_text/phrase queries on TextAndString fields silently hitting wrong
field by adding explicit routing to __text companion in hash_field_rewriter
- Cache text_companion_field lookup outside per-document loop to avoid 100M+
string allocations and HashMap lookups on large parquet files
- Add serde wire format test pinning {"mode":"text_and_string"} JSON format
- Normalize text_and_string/exact_only to "raw" in build_column_mapping to
prevent storing invalid tokenizer names in fast_field_tokenizer
- Add design comment explaining why TextAndString omits set_stored/set_fast
- Add edge case integration test covering empty strings and multiple
text_and_string columns
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single field uses default-tokenized inverted index for full-text search and PhraseQuery equality, plus raw fast field for aggregations and sorting. Eliminates the __text companion field, halving index size per text_and_string column. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TextAndString fields have native fast data (from set_fast(Some("raw")))
but were also transcoded from parquet in Hybrid mode. The merge of native
+ transcoded data doubled fast field ordinals, causing GROUP BY counts
to be 2x.
- Skip parquet transcoding for TextAndString by checking
manifest.string_indexing_modes (not fast_field_tokenizer, which is
set on ALL Str columns)
- Set fast_field_tokenizer=None for TextAndString in build_column_mapping
(it has native fast data, no transcoding needed)
- Classify TextAndString as native in ensure_fast_fields_for_query
- Add debug logging for transcode skip decisions
- Add error logging in jni_prewarm.rs for serialization failures
- 3 new regression tests + updated fixture to match production manifests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
Contributor
Author
|
Closing in favor of a clean branch rebased on latest main (no stacked PR dependencies). Reopening as new PR from feature/text-and-string-clean. |
Contributor
Author
|
Replaced by #157 (same code, clean branch rebased on latest main). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Single-field
text_and_stringindexing mode for companion splits. One tantivy field serves both full-text search and aggregations — replacing the dual-field__textcompanion approach from the original PR for lower storage cost and simpler query routing.Architecture
Each
text_and_stringcolumn creates one tantivy field with two independent behaviors:default(lowercase + split on non-alphanumeric)defaultraw(stores original string)rawWrite path
Read path — fast field transcoding
The companion read path normally transcodes string fast fields from parquet at query time (Hybrid mode). TextAndString fields are excluded from transcoding because they already have native fast data from
set_fast(Some("raw")).The exclusion uses
manifest.string_indexing_modes(checking forTextAndString) rather thanfast_field_tokenizer.is_some()becausebuild_column_mappingsetsfast_field_tokenizeron ALL Str columns — onlystring_indexing_modescorrectly distinguishes TextAndString from regular string fields.Without this exclusion,
merge_two_columnars()combines native + transcoded data, producing duplicate ordinals that double GROUP BY counts (the bug this PR fixes).Manifest representation
Regular string fields have
fast_field_tokenizer: Some("raw")(needs transcoding).Design decisions
string_indexing_modesas discriminator, notfast_field_tokenizer: All Str columns havefast_field_tokenizerset in production (build_column_mappingdefaults toSome("raw")). Usingfast_field_tokenizer.is_some()would skip transcoding for ALL Str fields, breaking GROUP BY on regular string columns.string_indexing_modesis the authoritative source.fast_field_tokenizer: Nonefor TextAndString: Fixedbuild_column_mappingto setNoneinstead ofSome("raw"). The field has native fast data and doesn't need transcoding —Noneaccurately represents this. Previously the misleadingSome("raw")suggested it needed transcoding.Hybrid-only skip: The transcode skip only applies in
FastFieldMode::Hybrid. InParquetOnlymode, native.fastdata is ignored entirely, so TextAndString must be transcoded to have any fast data at all.Backward compatibility:
string_indexing_modesis#[serde(default)]so old manifests deserialize with an empty map. TextAndString andstring_indexing_modeswere introduced together — no old manifest can have TextAndString native fast data without the corresponding entry.Testing
6 Rust unit tests (
transcode.rs):test_columns_to_transcode_hybridtest_columns_to_transcode_hybrid_distinguishes_text_and_string_from_regularfast_field_tokenizerbut only TextAndString is skippedtest_columns_to_transcode_parquet_onlytest_columns_to_transcode_hybrid_requested_text_and_string_still_skippedrequested_columnscan't force transcodingtest_columns_to_transcode_disabledtest_columns_to_transcode_with_filterTest fixture (
make_test_manifest) matches production: TextAndString field hasfast_field_tokenizer: None+string_indexing_modesentry. Regular string field hasfast_field_tokenizer: Some("raw")with no indexing mode.3 Rust integration tests (
indexing.rs):defaulttokenizer +rawfast field on same field, no__textcompanionOpen items (out of scope for this PR)
Rust-side fast field post-filter: Would eliminate ~10% PhraseQuery false positives in tantivy before Spark sees them. Rejected because the companion streaming path (
nativeStartStreamingRetrieval) bypassessearchWithSplitQuery— the filter would need to exist in two separate code paths. Spark's candidate post-filter already guarantees correctness. Revisit as performance optimization.Non-companion text_and_string: The Java
SchemaBuilder.addTextField(fast=true)uses one tokenizer for both inverted index and fast field. Separate tokenizers (defaultfor search,rawfor fast) are only possible through the companionschema_derivation.rspath. Not a limitation in practice since text_and_string is companion-only.Dependencies
🤖 Generated with Claude Code