feat: canonical H2O coverage — q6/q8/q9 adapters + engine-only timing + DataFusion memtable fix#4
Merged
Conversation
…karounds Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
register_csv produces a listing table that re-parses CSV on every timed query. register_record_batches with the collected batches caches the columnar layout in memory. q4 154→17ms, q6 312→148ms, q8 367→262ms — DataFusion now apples-to-apples with adapters that hold native columnar storage.
q8's natural rayforce shape is 100k rows with LIST<F64>[2] cells — duckdb's ROW_NUMBER() <= 2 SQL emits 200k exploded rows. Timed bench was unfair: rayforce skipped the row-materialisation cost SQL adapters pay for. Move the explode into the timed engine query via raze + indexed gather (vectorised, no per-element lambda) so both sides materialise 200k. q8 163ms (100k rows) → 215ms (200k rows) vs duckdb 198ms — ~apples-to-apples now. Bundles the q9 two-stage adapter form already in the working tree.
run_groupby_q8's fast vectorised explode assumes K=2 everywhere (true for canonical 10m k100, where every id6 group has ≥2 non-null v3). Small check sizes (10..1m) hit groups with K=1 cells; the K=2-uniform formula produces row-count mismatch. Split: timed path keeps the fast formula; materialize() reverts to a per-cell Python explode for correctness across all check sizes.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
62 commits accumulating the canonical H2O (h2oai/db-benchmark) coverage on rayforce-bench: engine-only timing for SQL adapters, rayforce wrappers for q6/q8/q9, fairness fixes across adapters, and dashboard polish.
Headline changes
Engine-only timing across SQL adapters (
20f915a)Replace
fetchall()/ IPC-materialization with server-side draining orCREATE TEMPORARY TABLEpatterns so each adapter is timed on engine work only, not Arrow IPC / Python conversion. Affects DuckDB, chDB, DataFusion, QuestDB, TimescaleDB.Rayforce q6 / q8 / q9 adapters (
a50ab48,611bcb3,626cd34,99ae025)Column.median()+Column.std()via new engineOP_MEDIANand existing stddevColumn.top(2)via engineOP_TOP_NthenOP_GROUP_TOPK_ROWFORM(row-form emit, no LIST intermediate)Column.pearson_corr(...)then arithmetic squaring; required because** 2at top would block the DAG hash-agg loweringEngine-side explode for q8 (raze + indexed gather) keeps the timed query in row form (200k rows) — matches DuckDB's
ROW_NUMBER OVER PARTITIONshape and SQL adapters' default materialization.DataFusion memtable fix (
eae3261)register_csvproduced a listing table that re-parsed CSV on every timed query (page cache avoided disk, but parse cost remained). Replaced withregister_record_batchesafter one-shotcollect(). Apples-to-apples vs duckdb/chdb/polars/pandas/rayforce which all hold native columnar storage. q4 154→17 ms, q6 312→148 ms, q8 367→262 ms.Dashboard / framework polish (multiple)
bench-bonustargetmake check— cross-adapter result equivalence at all sizes 10..10mBench snapshots
d354496— refresh afterOP_GROUP_TOPK_ROWFORM(PR rayforce#203 merged)03d1cf4— refresh after q6 + q10 bypass operators (PR rayforce#204)Perf snapshot (10M rows, k=100 cardinality, engine-only timing)
Rayforce wins 9/10 (q4 within ~4ms of duckdb — small-group
mean by id4shape where shared path dispatch overhead dominates).Related
top,bot,pearson_corr,std,median,__pow__)Test plan
make check LOCAL=1→pass — 665/665 comparisons matched polars, 0 NYI (rtol=1e-06, atol=1e-09)across all 7 sizes × all 19 ops × all 6 adaptersmake bench LOCAL=1reproduces the perf numbers aboveRAYFORCE_LOCAL_PATHpointing at rayforce#204 checkout) and re-runmake check+make bench