Multi-Valued Field Support: Handle List<Utf8> Arrow vectors and Parquet LIST columns

## Context

Users need to index `Array[String]` columns with specialized field types (IP, text, string, compact string modes) for full search semantics — CIDR/range queries on IP arrays, tokenized full-text search on text arrays. Tantivy natively supports multi-valued fields (calling `addIpAddr()`/`addText()` multiple times on the same document creates multiple indexed values), and queries automatically match against any value. The tantivy4java `Document.getAll()` method already exists for reading multi-valued fields back.

The Spark-side validation and Direct Interface changes are tracked in: indextables/indextables_spark (companion issue).

This issue covers the **Rust-side changes** needed in tantivy4java to handle `List<Utf8>` Arrow vectors and Parquet LIST columns as multi-valued fields.

---

## Implementation Steps

### Step 1: Arrow batch ingestion — handle `List<Utf8>` vectors

**Location:** The Arrow batch ingestion code in Rust (`add_arrow_batch` function / related ingestion path)

**Current behavior:** When processing a field config with `"type":"ip"` (or `"text"`/`"string"`), the code expects a scalar `Utf8` Arrow vector. A `List<Utf8>` vector would fail or be misinterpreted.

**Change:** When processing a field config with `"type":"ip"` (or `"text"`/`"string"`), check the Arrow vector type. If it's a `List<Utf8>` instead of `Utf8`:
- Iterate list elements for each row
- Call `add_ip_addr()` / `add_text()` per element
- Skip null elements

**Alternative approach:** The Spark side can emit a `"multi_valued":true` flag in the field config JSON for clarity, so Rust can explicitly detect this case rather than inferring from the Arrow vector type alone. This may be cleaner than type-sniffing.

```rust
// Pseudocode for the List<Utf8> handling:
if field_config.multi_valued || arrow_column.data_type() == DataType::List(Utf8) {
    let list_array = arrow_column.as_any().downcast_ref::<ListArray>().unwrap();
    for row in 0..list_array.len() {
        if list_array.is_null(row) { continue; }
        let values = list_array.value(row);
        let string_array = values.as_any().downcast_ref::<StringArray>().unwrap();
        for i in 0..string_array.len() {
            if string_array.is_null(i) { continue; }
            let elem = string_array.value(i);
            match field_config.field_type {
                "ip" => document.add_ip_addr(field_name, elem),
                "text" => document.add_text(field_name, elem),
                _ => document.add_text(field_name, elem),
            }
        }
    }
}
```

---

### Step 2: Parquet ingestion (companion/sync) — handle LIST columns

**Location:** `createFromParquet()` / Parquet ingestion path in Rust

**Current behavior:** When the companion config marks a field as IP (or text with tokenizer overrides) and the Parquet column is a scalar `UTF8`, the code calls `add_ip_addr()` once per row. If the Parquet column is a `LIST<UTF8>` (repeated group), it would fail or be mishandled.

**Change:** In `createFromParquet()`, when the companion config marks a field as IP (or text/tokenizer override) and the Parquet column is a `LIST<UTF8>` (repeated group):
- Iterate list elements for each row
- Call `add_ip_addr()` / `add_text()` per element
- Skip null elements

This is the same pattern as Step 1 but for the Parquet reader instead of Arrow batch ingestion.

```rust
// Pseudocode for Parquet LIST handling:
if parquet_column.is_list() {
    let list_column = parquet_column.as_list();
    for row in 0..num_rows {
        if list_column.is_null(row) { continue; }
        let values = list_column.value(row);
        for i in 0..values.len() {
            if values.is_null(i) { continue; }
            let elem = values.value(i);  // &str
            match field_type {
                "ip" => document.add_ip_addr(field_name, elem),
                "text" => document.add_text(field_name, elem),
                _ => document.add_text(field_name, elem),
            }
        }
    }
}
```

---

## Notes

- The `Document.getAll()` Java method already exists for reading multi-valued fields back — no read-side changes needed in tantivy4java.
- The field config JSON from Spark will either: (a) include `"multi_valued":true` alongside the type, or (b) rely on Rust detecting `List<Utf8>` at runtime. Approach (a) is recommended for explicitness.
- Tantivy's indexing semantics handle multi-valued fields naturally — calling `add_ip_addr()` multiple times on the same document creates multiple indexed values, and queries automatically match against any value.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Valued Field Support: Handle List<Utf8> Arrow vectors and Parquet LIST columns #161

Context

Implementation Steps

Step 1: Arrow batch ingestion — handle `List<Utf8>` vectors

Step 2: Parquet ingestion (companion/sync) — handle LIST columns

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-Valued Field Support: Handle List<Utf8> Arrow vectors and Parquet LIST columns #161

Description

Context

Implementation Steps

Step 1: Arrow batch ingestion — handle List<Utf8> vectors

Step 2: Parquet ingestion (companion/sync) — handle LIST columns

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Step 1: Arrow batch ingestion — handle `List<Utf8>` vectors