Skip to content

feat: column-major streaming Parquet writer primitive#6384

Open
g-talbot wants to merge 1 commit intomainfrom
gtt/streaming-parquet-writer
Open

feat: column-major streaming Parquet writer primitive#6384
g-talbot wants to merge 1 commit intomainfrom
gtt/streaming-parquet-writer

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

@g-talbot g-talbot commented May 5, 2026

Summary

PR-2 of the streaming-merge stack. Lands a primitive in
quickwit-parquet-engine/src/storage/streaming_writer.rs that wraps
SerializedFileWriter directly (not ArrowWriter) and exposes a
per-column write API. Each column's chunk is appended and flushed
before the next column is written, so peak memory per row group is
bounded by one column's encoded chunk plus small bookkeeping (bloom
filters + page indexes) — not by the entire row group.

No production callers in this PR. PR-3 will cut ingest over to a
single-RG writer built on this primitive; PR-6 will cut the merge
engine over and add direct page copy.

Caller contract

1. start_row_group()                    → RowGroupBuilder<'_, W>
2. RowGroupBuilder::write_next_column() once per top-level arrow field, in schema order
3. RowGroupBuilder::finish()
4. repeat (1)–(3) for additional row groups
5. StreamingParquetWriter::close()      → ParquetMetaData

Out-of-order calls (too many columns, finish before all columns,
mismatched row counts across columns within an RG) return a
structured ParquetWriteError::SchemaValidation rather than
panicking. The compiler enforces single-RG-open via the lifetime on
RowGroupBuilder.

Limitations (PR-2)

Top-level arrow fields must each map to exactly one parquet leaf —
"flat" schemas of primitive, byte-array, and dictionary types. The
metrics schema is flat in this sense. Nested types (Struct, List,
Map) are rejected at start_row_group with a clear error message; if
PR-6 ever needs them, the implementation can extend the per-call
contract to consume multiple writers per arrow column.

The streaming writer does not implicitly add the ARROW:schema
IPC entry that ArrowWriter writes by default. Callers wanting that
self-describing arrow round-trip pass their WriterProperties
through parquet::arrow::add_encoded_arrow_schema_to_metadata first.
PR-3 ingest and PR-6 merge will do this in their own setup helpers.
This is documented on try_new.

Tests (13)

Round-trip: single RG, multi RG, nulls preserved.

Metadata identity vs ArrowWriter under identical WriterProperties:
single RG and multi RG. Asserts schema descriptor (column count +
names + physical types), per-RG sorting_columns, all qh.* KV
entries, per-column compression, per-column bloom filter presence,
per-column statistics_enabled level, num row groups, and per-RG
num_rows.

Functional contracts:

  • chunk-level statistics propagate (EnabledStatistics::Chunk)
  • bloom filter round-trip — present values match, absent values are
    rejected by the read-back filter
  • bounded memory — pending-writers' total memory_size is monotone
    non-increasing as columns are written

Caller-contract violations:

  • nested type rejected at start_row_group
  • writing past the last field → SchemaValidation
  • row-count mismatch across columns → SchemaValidation
  • finish before all columns → SchemaValidation
  • empty row group still produces a readable file with the expected
    schema

Position in stack

PR Status Description
PR-1 open (#6377) Page-level stats + qh.rg_partition_prefix_len marker
PR-2 this PR Streaming column-major writer primitive
PR-3 next Ingest writer cuts over to single-RG using PR-2
PR-4 Streaming Parquet input reader (one footer GET + one body GET)
PR-5 Legacy multi-RG input adapter
PR-6 Streaming column-major merge engine; folds in direct page copy
PR-7 Wire ParquetMergeExecutor to streaming engine; delete downloader

PR-2 is independent of PR-1 review — it branches off main.

Test plan

  • cargo +nightly fmt --all -- --check (per files touched)
  • RUSTFLAGS="-Dwarnings --cfg tokio_unstable" cargo clippy -p quickwit-parquet-engine --tests
  • cargo doc -p quickwit-parquet-engine --no-deps
  • cargo machete
  • bash quickwit/scripts/check_license_headers.sh
  • bash quickwit/scripts/check_log_format.sh
  • cargo nextest run -p quickwit-parquet-engine --all-features — 369 tests, all pass

🤖 Generated with Claude Code

Wraps SerializedFileWriter directly to expose a per-column write API
that flushes one column chunk at a time. Peak memory per row group is
bounded by a single column chunk plus small bookkeeping (bloom filters
+ page indexes), not by the whole row group.

PR-2 of the streaming-merge stack. No production callers in this PR;
PR-3 cuts ingest over to a single-RG writer built on this primitive,
PR-6 cuts the merge engine over.

The caller contract is a 4-step state machine (start_row_group →
write_next_column × N → finish → repeat → close). Out-of-order calls
return a structured error rather than panicking. Top-level arrow
fields must each map to exactly one parquet leaf, which the metrics
schema satisfies; nested types are rejected at start_row_group with a
clear message.

13 tests cover: round-trip (single RG, multi RG, with nulls), metadata
identity vs ArrowWriter (single + multi RG, with KV entries and
sorting_columns populated), per-column statistics enabled
propagation, bloom filter functional round-trip, bounded-memory
contract, empty row group, and four caller-contract violations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant