Skip to content

Multi-Valued Field Support: Handle List<Utf8> Arrow vectors and Parquet LIST columns #161

@Zach-Zach7

Description

@Zach-Zach7

Context

Users need to index Array[String] columns with specialized field types (IP, text, string, compact string modes) for full search semantics — CIDR/range queries on IP arrays, tokenized full-text search on text arrays. Tantivy natively supports multi-valued fields (calling addIpAddr()/addText() multiple times on the same document creates multiple indexed values), and queries automatically match against any value. The tantivy4java Document.getAll() method already exists for reading multi-valued fields back.

The Spark-side validation and Direct Interface changes are tracked in: indextables/indextables_spark (companion issue).

This issue covers the Rust-side changes needed in tantivy4java to handle List<Utf8> Arrow vectors and Parquet LIST columns as multi-valued fields.


Implementation Steps

Step 1: Arrow batch ingestion — handle List<Utf8> vectors

Location: The Arrow batch ingestion code in Rust (add_arrow_batch function / related ingestion path)

Current behavior: When processing a field config with "type":"ip" (or "text"/"string"), the code expects a scalar Utf8 Arrow vector. A List<Utf8> vector would fail or be misinterpreted.

Change: When processing a field config with "type":"ip" (or "text"/"string"), check the Arrow vector type. If it's a List<Utf8> instead of Utf8:

  • Iterate list elements for each row
  • Call add_ip_addr() / add_text() per element
  • Skip null elements

Alternative approach: The Spark side can emit a "multi_valued":true flag in the field config JSON for clarity, so Rust can explicitly detect this case rather than inferring from the Arrow vector type alone. This may be cleaner than type-sniffing.

// Pseudocode for the List<Utf8> handling:
if field_config.multi_valued || arrow_column.data_type() == DataType::List(Utf8) {
    let list_array = arrow_column.as_any().downcast_ref::<ListArray>().unwrap();
    for row in 0..list_array.len() {
        if list_array.is_null(row) { continue; }
        let values = list_array.value(row);
        let string_array = values.as_any().downcast_ref::<StringArray>().unwrap();
        for i in 0..string_array.len() {
            if string_array.is_null(i) { continue; }
            let elem = string_array.value(i);
            match field_config.field_type {
                "ip" => document.add_ip_addr(field_name, elem),
                "text" => document.add_text(field_name, elem),
                _ => document.add_text(field_name, elem),
            }
        }
    }
}

Step 2: Parquet ingestion (companion/sync) — handle LIST columns

Location: createFromParquet() / Parquet ingestion path in Rust

Current behavior: When the companion config marks a field as IP (or text with tokenizer overrides) and the Parquet column is a scalar UTF8, the code calls add_ip_addr() once per row. If the Parquet column is a LIST<UTF8> (repeated group), it would fail or be mishandled.

Change: In createFromParquet(), when the companion config marks a field as IP (or text/tokenizer override) and the Parquet column is a LIST<UTF8> (repeated group):

  • Iterate list elements for each row
  • Call add_ip_addr() / add_text() per element
  • Skip null elements

This is the same pattern as Step 1 but for the Parquet reader instead of Arrow batch ingestion.

// Pseudocode for Parquet LIST handling:
if parquet_column.is_list() {
    let list_column = parquet_column.as_list();
    for row in 0..num_rows {
        if list_column.is_null(row) { continue; }
        let values = list_column.value(row);
        for i in 0..values.len() {
            if values.is_null(i) { continue; }
            let elem = values.value(i);  // &str
            match field_type {
                "ip" => document.add_ip_addr(field_name, elem),
                "text" => document.add_text(field_name, elem),
                _ => document.add_text(field_name, elem),
            }
        }
    }
}

Notes

  • The Document.getAll() Java method already exists for reading multi-valued fields back — no read-side changes needed in tantivy4java.
  • The field config JSON from Spark will either: (a) include "multi_valued":true alongside the type, or (b) rely on Rust detecting List<Utf8> at runtime. Approach (a) is recommended for explicitness.
  • Tantivy's indexing semantics handle multi-valued fields naturally — calling add_ip_addr() multiple times on the same document creates multiple indexed values, and queries automatically match against any value.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions