Skip to content

feat: canonical H2O coverage — q6/q8/q9 adapters + engine-only timing + DataFusion memtable fix#4

Merged
singaraiona merged 8 commits into
RayforceDB:masterfrom
ser-vasilich:prototype
May 18, 2026
Merged

feat: canonical H2O coverage — q6/q8/q9 adapters + engine-only timing + DataFusion memtable fix#4
singaraiona merged 8 commits into
RayforceDB:masterfrom
ser-vasilich:prototype

Conversation

@ser-vasilich
Copy link
Copy Markdown
Contributor

@ser-vasilich ser-vasilich commented May 15, 2026

Summary

62 commits accumulating the canonical H2O (h2oai/db-benchmark) coverage on rayforce-bench: engine-only timing for SQL adapters, rayforce wrappers for q6/q8/q9, fairness fixes across adapters, and dashboard polish.

Headline changes

Engine-only timing across SQL adapters (20f915a)

Replace fetchall() / IPC-materialization with server-side draining or CREATE TEMPORARY TABLE patterns so each adapter is timed on engine work only, not Arrow IPC / Python conversion. Affects DuckDB, chDB, DataFusion, QuestDB, TimescaleDB.

Rayforce q6 / q8 / q9 adapters (a50ab48, 611bcb3, 626cd34, 99ae025)

  • q6 (median + stddev by id4,id5): native Column.median() + Column.std() via new engine OP_MEDIAN and existing stddev
  • q8 (largest 2 v3 by id6): Column.top(2) via engine OP_TOP_N then OP_GROUP_TOPK_ROWFORM (row-form emit, no LIST intermediate)
  • q9 (pearson² by id2,id4): two-stage adapter — Column.pearson_corr(...) then arithmetic squaring; required because ** 2 at top would block the DAG hash-agg lowering

Engine-side explode for q8 (raze + indexed gather) keeps the timed query in row form (200k rows) — matches DuckDB's ROW_NUMBER OVER PARTITION shape and SQL adapters' default materialization.

DataFusion memtable fix (eae3261)

register_csv produced a listing table that re-parsed CSV on every timed query (page cache avoided disk, but parse cost remained). Replaced with register_record_batches after one-shot collect(). Apples-to-apples vs duckdb/chdb/polars/pandas/rayforce which all hold native columnar storage. q4 154→17 ms, q6 312→148 ms, q8 367→262 ms.

Dashboard / framework polish (multiple)

  • Canonical H2O suite (groupby q1..q10 + canonical-join q1..q5 + sort_single/sort_multi)
  • Bonus suite (3-key joins, full-row sorts) under separate bench-bonus target
  • Per-adapter QUERY_STRINGS shown on the compare panel
  • Scaling sweep with operations panel split into groupby/join/sort
  • Histogram split fast/heavy
  • make check — cross-adapter result equivalence at all sizes 10..10m

Bench snapshots

  • d354496 — refresh after OP_GROUP_TOPK_ROWFORM (PR rayforce#203 merged)
  • 03d1cf4 — refresh after q6 + q10 bypass operators (PR rayforce#204)

Perf snapshot (10M rows, k=100 cardinality, engine-only timing)

query rayforce duckdb datafusion polars
q1 7 ms 38 ms 19 ms 30 ms
q2 14 ms 56 ms 41 ms
q3 38 ms 117 ms 167 ms
q4 13 ms 9 ms 18 ms 29 ms
q5 63 ms 115 ms 137 ms
q6 75 ms 188 ms 148 ms 236 ms
q7 58 ms 105 ms 145 ms
q8 45 ms 162 ms 264 ms 503 ms
q9 66 ms 80 ms 75 ms 405 ms
q10 170 ms 390 ms 420 ms 2018 ms

Rayforce wins 9/10 (q4 within ~4ms of duckdb — small-group mean by id4 shape where shared path dispatch overhead dominates).

Related

Test plan

  • make check LOCAL=1pass — 665/665 comparisons matched polars, 0 NYI (rtol=1e-06, atol=1e-09) across all 7 sizes × all 19 ops × all 6 adapters
  • make bench LOCAL=1 reproduces the perf numbers above
  • Reviewer: build with companion branches (RAYFORCE_LOCAL_PATH pointing at rayforce#204 checkout) and re-run make check + make bench

ser-vasilich and others added 7 commits May 10, 2026 19:30
…karounds

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
register_csv produces a listing table that re-parses CSV on every
timed query.  register_record_batches with the collected batches
caches the columnar layout in memory.  q4 154→17ms, q6 312→148ms,
q8 367→262ms — DataFusion now apples-to-apples with adapters that
hold native columnar storage.
q8's natural rayforce shape is 100k rows with LIST<F64>[2] cells —
duckdb's ROW_NUMBER() <= 2 SQL emits 200k exploded rows.  Timed bench
was unfair: rayforce skipped the row-materialisation cost SQL
adapters pay for.  Move the explode into the timed engine query
via raze + indexed gather (vectorised, no per-element lambda) so
both sides materialise 200k.  q8 163ms (100k rows) → 215ms (200k
rows) vs duckdb 198ms — ~apples-to-apples now.

Bundles the q9 two-stage adapter form already in the working tree.
run_groupby_q8's fast vectorised explode assumes K=2 everywhere (true
for canonical 10m k100, where every id6 group has ≥2 non-null v3).
Small check sizes (10..1m) hit groups with K=1 cells; the K=2-uniform
formula produces row-count mismatch.  Split: timed path keeps the
fast formula; materialize() reverts to a per-cell Python explode for
correctness across all check sizes.
@singaraiona singaraiona merged commit 7326193 into RayforceDB:master May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants