Context
Users need to index Array[String] columns with specialized field types (IP, text, string, compact string modes) for full search semantics — CIDR/range queries on IP arrays, tokenized full-text search on text arrays. Tantivy natively supports multi-valued fields (calling addIpAddr()/addText() multiple times on the same document creates multiple indexed values), and queries automatically match against any value. The tantivy4java Document.getAll() method already exists for reading multi-valued fields back.
The Spark-side validation and Direct Interface changes are tracked in: indextables/indextables_spark (companion issue).
This issue covers the Rust-side changes needed in tantivy4java to handle List<Utf8> Arrow vectors and Parquet LIST columns as multi-valued fields.
Implementation Steps
Step 1: Arrow batch ingestion — handle List<Utf8> vectors
Location: The Arrow batch ingestion code in Rust (add_arrow_batch function / related ingestion path)
Current behavior: When processing a field config with "type":"ip" (or "text"/"string"), the code expects a scalar Utf8 Arrow vector. A List<Utf8> vector would fail or be misinterpreted.
Change: When processing a field config with "type":"ip" (or "text"/"string"), check the Arrow vector type. If it's a List<Utf8> instead of Utf8:
- Iterate list elements for each row
- Call
add_ip_addr() / add_text() per element
- Skip null elements
Alternative approach: The Spark side can emit a "multi_valued":true flag in the field config JSON for clarity, so Rust can explicitly detect this case rather than inferring from the Arrow vector type alone. This may be cleaner than type-sniffing.
// Pseudocode for the List<Utf8> handling:
if field_config.multi_valued || arrow_column.data_type() == DataType::List(Utf8) {
let list_array = arrow_column.as_any().downcast_ref::<ListArray>().unwrap();
for row in 0..list_array.len() {
if list_array.is_null(row) { continue; }
let values = list_array.value(row);
let string_array = values.as_any().downcast_ref::<StringArray>().unwrap();
for i in 0..string_array.len() {
if string_array.is_null(i) { continue; }
let elem = string_array.value(i);
match field_config.field_type {
"ip" => document.add_ip_addr(field_name, elem),
"text" => document.add_text(field_name, elem),
_ => document.add_text(field_name, elem),
}
}
}
}
Step 2: Parquet ingestion (companion/sync) — handle LIST columns
Location: createFromParquet() / Parquet ingestion path in Rust
Current behavior: When the companion config marks a field as IP (or text with tokenizer overrides) and the Parquet column is a scalar UTF8, the code calls add_ip_addr() once per row. If the Parquet column is a LIST<UTF8> (repeated group), it would fail or be mishandled.
Change: In createFromParquet(), when the companion config marks a field as IP (or text/tokenizer override) and the Parquet column is a LIST<UTF8> (repeated group):
- Iterate list elements for each row
- Call
add_ip_addr() / add_text() per element
- Skip null elements
This is the same pattern as Step 1 but for the Parquet reader instead of Arrow batch ingestion.
// Pseudocode for Parquet LIST handling:
if parquet_column.is_list() {
let list_column = parquet_column.as_list();
for row in 0..num_rows {
if list_column.is_null(row) { continue; }
let values = list_column.value(row);
for i in 0..values.len() {
if values.is_null(i) { continue; }
let elem = values.value(i); // &str
match field_type {
"ip" => document.add_ip_addr(field_name, elem),
"text" => document.add_text(field_name, elem),
_ => document.add_text(field_name, elem),
}
}
}
}
Notes
- The
Document.getAll() Java method already exists for reading multi-valued fields back — no read-side changes needed in tantivy4java.
- The field config JSON from Spark will either: (a) include
"multi_valued":true alongside the type, or (b) rely on Rust detecting List<Utf8> at runtime. Approach (a) is recommended for explicitness.
- Tantivy's indexing semantics handle multi-valued fields naturally — calling
add_ip_addr() multiple times on the same document creates multiple indexed values, and queries automatically match against any value.
Context
Users need to index
Array[String]columns with specialized field types (IP, text, string, compact string modes) for full search semantics — CIDR/range queries on IP arrays, tokenized full-text search on text arrays. Tantivy natively supports multi-valued fields (callingaddIpAddr()/addText()multiple times on the same document creates multiple indexed values), and queries automatically match against any value. The tantivy4javaDocument.getAll()method already exists for reading multi-valued fields back.The Spark-side validation and Direct Interface changes are tracked in: indextables/indextables_spark (companion issue).
This issue covers the Rust-side changes needed in tantivy4java to handle
List<Utf8>Arrow vectors and Parquet LIST columns as multi-valued fields.Implementation Steps
Step 1: Arrow batch ingestion — handle
List<Utf8>vectorsLocation: The Arrow batch ingestion code in Rust (
add_arrow_batchfunction / related ingestion path)Current behavior: When processing a field config with
"type":"ip"(or"text"/"string"), the code expects a scalarUtf8Arrow vector. AList<Utf8>vector would fail or be misinterpreted.Change: When processing a field config with
"type":"ip"(or"text"/"string"), check the Arrow vector type. If it's aList<Utf8>instead ofUtf8:add_ip_addr()/add_text()per elementAlternative approach: The Spark side can emit a
"multi_valued":trueflag in the field config JSON for clarity, so Rust can explicitly detect this case rather than inferring from the Arrow vector type alone. This may be cleaner than type-sniffing.Step 2: Parquet ingestion (companion/sync) — handle LIST columns
Location:
createFromParquet()/ Parquet ingestion path in RustCurrent behavior: When the companion config marks a field as IP (or text with tokenizer overrides) and the Parquet column is a scalar
UTF8, the code callsadd_ip_addr()once per row. If the Parquet column is aLIST<UTF8>(repeated group), it would fail or be mishandled.Change: In
createFromParquet(), when the companion config marks a field as IP (or text/tokenizer override) and the Parquet column is aLIST<UTF8>(repeated group):add_ip_addr()/add_text()per elementThis is the same pattern as Step 1 but for the Parquet reader instead of Arrow batch ingestion.
Notes
Document.getAll()Java method already exists for reading multi-valued fields back — no read-side changes needed in tantivy4java."multi_valued":truealongside the type, or (b) rely on Rust detectingList<Utf8>at runtime. Approach (a) is recommended for explicitness.add_ip_addr()multiple times on the same document creates multiple indexed values, and queries automatically match against any value.