Skip to content

feat: legacy multi-RG input adapter for ColumnPageStream (PR-5)#6408

Draft
g-talbot wants to merge 2 commits intogtt/column-page-stream-traitfrom
gtt/legacy-input-adapter
Draft

feat: legacy multi-RG input adapter for ColumnPageStream (PR-5)#6408
g-talbot wants to merge 2 commits intogtt/column-page-stream-traitfrom
gtt/legacy-input-adapter

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

@g-talbot g-talbot commented May 8, 2026

Summary

  • New LegacyInputAdapter exposes legacy multi-row-group parquet files through PR-5a's ColumnPageStream so PR-6's merge engine doesn't have to special-case them.
  • Pattern: one buffered GET → decode via ParquetRecordBatchReaderBuilderarrow::compute::concat_batches (preserves order — does NOT re-sort) → re-encode as single-row-group via StreamingParquetWriter (set_max_row_group_row_count(num_rows + 1)) → wrap in private InMemoryByteSource → expose via inner StreamingParquetReader.
  • Carries forward key_value_metadata (qh.* keys) and RG0 sorting_columns to the consolidated output.
  • 4 GiB defensive cap on input file size (MAX_LEGACY_INPUT_BYTES) — production legacy files are well under 1 GiB.
  • Triggered when qh.rg_partition_prefix_len == 0 && num_row_groups > 1. (PR-6 implements the dispatch; this PR provides the adapter as one of the trait impls.)

Stack

main
└── gtt/parquet-streaming-base (= main ∪ PR-2 #6384 ∪ PR-4 #6386)
    └── gtt/column-page-stream-trait (PR-5a #6406)
        ├── gtt/legacy-input-adapter ← PR-5 (this PR)
        └── gtt/parquet-page-decoder (PR-6a #6407)
            └── gtt/streaming-merge-engine-merger (PR-6b — next)
                └── gtt/streaming-merge-engine-multi-rg (PR-6c)

Test plan

  • test_empty_multi_rg_input — empty 2-RG fixture consolidates to 0 rows
  • test_multi_rg_consolidates_to_single_rg — 3-RG fixture → 1 RG, all pages at rg_idx == 0
  • test_data_roundtrip_through_adapter — adapter exposes oracle row count + schema
  • test_reencode_preserves_arrays_byte_equalbyte-equal data round trip through reencode_as_single_row_group (added on top of agent's initial commit to close a value-corruption test gap)
  • test_kv_metadata_preservedqh.sort_fields and qh.window_start_secs carried through
  • test_sorting_columns_preserved — RG0's sorting_columns preserved on the consolidated RG
  • test_dict_and_null_columns_preserved — data-page num_values sums match per column for dict + nullable columns
  • test_io_failure_surfaces_as_io_error — buffered GET failure surfaces as LegacyAdapterError::Io carrying the original io::Error
  • test_satisfies_column_page_stream_trait — drains via &mut dyn ColumnPageStream and confirms idempotent EOF

CI gates locally green: clippy --workspace --all-features --tests with -Dwarnings, nightly fmt --check, cargo doc --no-deps, cargo machete, license headers, log format, typos. 9/9 adapter tests + 401/401 crate tests pass.

🤖 Generated with Claude Code

@g-talbot g-talbot force-pushed the gtt/column-page-stream-trait branch from af78c8f to 2714921 Compare May 8, 2026 20:49
@g-talbot g-talbot force-pushed the gtt/legacy-input-adapter branch from e169002 to 1af3a64 Compare May 8, 2026 20:49
@g-talbot g-talbot force-pushed the gtt/column-page-stream-trait branch from 2714921 to 61b6310 Compare May 8, 2026 21:27
@g-talbot g-talbot force-pushed the gtt/legacy-input-adapter branch from 1af3a64 to 65a686d Compare May 8, 2026 21:28
@g-talbot g-talbot force-pushed the gtt/column-page-stream-trait branch from 61b6310 to 4ae07e7 Compare May 8, 2026 21:46
@g-talbot g-talbot force-pushed the gtt/legacy-input-adapter branch from 65a686d to d2330a0 Compare May 8, 2026 21:46
@g-talbot g-talbot force-pushed the gtt/column-page-stream-trait branch from 4ae07e7 to d43186d Compare May 9, 2026 00:07
g-talbot and others added 2 commits May 8, 2026 20:07
Buffers a legacy parquet file fully, decodes via Arrow, concatenates
into a single RecordBatch preserving order, and re-encodes as a
single-row-group parquet stream that StreamingParquetReader can serve
through the ColumnPageStream contract. Carries forward the original
file's key_value_metadata and sorting_columns so downstream consumers
(merge engine, metadata readers) see an identical logical view. This
unblocks the merge engine's column-major streaming path on files
where the original RG layout is misaligned with the sort prefix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`test_data_roundtrip_through_adapter` checks row count and schema
names through the streaming path; that catches dropped rows but not
value-level corruption (e.g. a hypothetical dictionary key XOR or
column-value swap during the decode/concat/re-encode chain). Adds
`test_reencode_preserves_arrays_byte_equal` calling
`reencode_as_single_row_group` directly against a multi-RG fixture
that includes dict-encoded columns and nulls, and asserts each
column equals the oracle byte-for-byte.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/legacy-input-adapter branch from d2330a0 to bb9a78f Compare May 9, 2026 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant