feat: column-major streaming Parquet writer primitive by g-talbot · Pull Request #6384 · quickwit-oss/quickwit

g-talbot · 2026-05-05T19:26:49Z

Summary

PR-2 of the streaming-merge stack. Lands a primitive in
quickwit-parquet-engine/src/storage/streaming_writer.rs that wraps
SerializedFileWriter directly (not ArrowWriter) and exposes a
per-column write API. Each column's chunk is appended and flushed
before the next column is written, so peak memory per row group is
bounded by one column's encoded chunk plus small bookkeeping (bloom
filters + page indexes) — not by the entire row group.

No production callers in this PR. PR-3 will cut ingest over to a
single-RG writer built on this primitive; PR-6 will cut the merge
engine over and add direct page copy.

Caller contract

1. start_row_group()                    → RowGroupBuilder<'_, W>
2. RowGroupBuilder::write_next_column() once per top-level arrow field, in schema order
3. RowGroupBuilder::finish()
4. repeat (1)–(3) for additional row groups
5. StreamingParquetWriter::close()      → ParquetMetaData

Out-of-order calls (too many columns, finish before all columns,
mismatched row counts across columns within an RG) return a
structured ParquetWriteError::SchemaValidation rather than
panicking. The compiler enforces single-RG-open via the lifetime on
RowGroupBuilder.

Limitations (PR-2)

Top-level arrow fields must each map to exactly one parquet leaf —
"flat" schemas of primitive, byte-array, and dictionary types. The
metrics schema is flat in this sense. Nested types (Struct, List,
Map) are rejected at start_row_group with a clear error message; if
PR-6 ever needs them, the implementation can extend the per-call
contract to consume multiple writers per arrow column.

The streaming writer does not implicitly add the ARROW:schema
IPC entry that ArrowWriter writes by default. Callers wanting that
self-describing arrow round-trip pass their WriterProperties
through parquet::arrow::add_encoded_arrow_schema_to_metadata first.
PR-3 ingest and PR-6 merge will do this in their own setup helpers.
This is documented on try_new.

Tests (13)

Round-trip: single RG, multi RG, nulls preserved.

Metadata identity vs ArrowWriter under identical WriterProperties:
single RG and multi RG. Asserts schema descriptor (column count +
names + physical types), per-RG sorting_columns, all qh.* KV
entries, per-column compression, per-column bloom filter presence,
per-column statistics_enabled level, num row groups, and per-RG
num_rows.

Functional contracts:

chunk-level statistics propagate (EnabledStatistics::Chunk)
bloom filter round-trip — present values match, absent values are
rejected by the read-back filter
bounded memory — pending-writers' total memory_size is monotone
non-increasing as columns are written

Caller-contract violations:

nested type rejected at start_row_group
writing past the last field → SchemaValidation
row-count mismatch across columns → SchemaValidation
finish before all columns → SchemaValidation
empty row group still produces a readable file with the expected
schema

Position in stack

PR	Status	Description
PR-1	open (#6377)	Page-level stats + `qh.rg_partition_prefix_len` marker
PR-2	this PR	Streaming column-major writer primitive
PR-3	next	Ingest writer cuts over to single-RG using PR-2
PR-4		Streaming Parquet input reader (one footer GET + one body GET)
PR-5		Legacy multi-RG input adapter
PR-6		Streaming column-major merge engine; folds in direct page copy
PR-7		Wire `ParquetMergeExecutor` to streaming engine; delete downloader

PR-2 is independent of PR-1 review — it branches off main.

Test plan

cargo +nightly fmt --all -- --check (per files touched)
RUSTFLAGS="-Dwarnings --cfg tokio_unstable" cargo clippy -p quickwit-parquet-engine --tests
cargo doc -p quickwit-parquet-engine --no-deps
cargo machete
bash quickwit/scripts/check_license_headers.sh
bash quickwit/scripts/check_log_format.sh
cargo nextest run -p quickwit-parquet-engine --all-features — 369 tests, all pass

🤖 Generated with Claude Code

Wraps SerializedFileWriter directly to expose a per-column write API that flushes one column chunk at a time. Peak memory per row group is bounded by a single column chunk plus small bookkeeping (bloom filters + page indexes), not by the whole row group. PR-2 of the streaming-merge stack. No production callers in this PR; PR-3 cuts ingest over to a single-RG writer built on this primitive, PR-6 cuts the merge engine over. The caller contract is a 4-step state machine (start_row_group → write_next_column × N → finish → repeat → close). Out-of-order calls return a structured error rather than panicking. Top-level arrow fields must each map to exactly one parquet leaf, which the metrics schema satisfies; nested types are rejected at start_row_group with a clear message. 13 tests cover: round-trip (single RG, multi RG, with nulls), metadata identity vs ArrowWriter (single + multi RG, with KV entries and sorting_columns populated), per-column statistics enabled propagation, bloom filter functional round-trip, bounded-memory contract, empty row group, and four caller-contract violations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

g-talbot requested review from alexanderbianchi and mattmkim May 5, 2026 20:12

g-talbot mentioned this pull request May 5, 2026

feat: column-major streaming Parquet reader primitive #6386

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: column-major streaming Parquet writer primitive#6384

feat: column-major streaming Parquet writer primitive#6384
g-talbot wants to merge 1 commit intomainfrom
gtt/streaming-parquet-writer

g-talbot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

g-talbot commented May 5, 2026

Summary

Caller contract

Limitations (PR-2)

Tests (13)

Position in stack

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant