Skip to content

perf(eap-items): Encode RowBinary inserts in the processor#7979

Open
phacops wants to merge 7 commits into
masterfrom
perf/eap-items-rowbinary-encode-in-processor
Open

perf(eap-items): Encode RowBinary inserts in the processor#7979
phacops wants to merge 7 commits into
masterfrom
perf/eap-items-rowbinary-encode-in-processor

Conversation

@phacops
Copy link
Copy Markdown
Contributor

@phacops phacops commented May 29, 2026

The RowBinary consumer kept Vec<EAPItemRow> alive in every batch from
the processor all the way to the writer. Each row carries ~80 attribute
buckets and their String allocations, so consumers ran with 2-3× the
in-flight memory of the JSONEachRow path and clickhouse-rs's Insert
builder double-buffered the encoded bytes during the write itself.

Each row is now serialized to RowBinary bytes inside
process_message_row_binary and the typed struct drops on the spot;
the pipeline carries only RowData { encoded_rows: Vec<u8>, num_rows }
the rest of the way. The writer is the same ClickhouseWriterStep the
JSON path uses, just with FORMAT RowBinary.

Vendored RowBinary serializer

clickhouse-rs's serializer is pub(crate) and Insert::write ties it
to an HTTP request, so there is no public surface to call it as
"serialize one row into a Vec<u8>". The new
strategies/clickhouse/rowbinary/ module is a ~280-line serde
Serializer covering exactly what EAPItemRow emits (LE primitives,
LEB128 string/seq length prefixes, raw-byte UUID with the half-reversal
ClickHouse's UUID type expects), with unit tests for each case.

Pipeline collapses to one shape

use_row_binary no longer takes a separate pipeline branch.
factory_v2.rs picks the processor function and the InsertFormat
once and the standard Reduce → ClickhouseWriterStep flow handles
both. TypedInsertBatch, EstimatedSize, the BytesInsertBatch<Vec<T>>
impl, make_rust_processor_row_binary, and
ClickhouseRowBinaryWriterStep are gone.

clickhouse-rs dependency dropped

The integration test (test_row_binary_clickhouse_insert) now exercises
the production path: it takes the processor's encoded bytes and POSTs
them with reqwest, then reads back via FORMAT JSON. With that, the
crate is removed from Cargo.toml and ~95 lines disappear from
Cargo.lock.

Agent transcript: https://claudescope.sentry.dev/share/ocWUXnBe7IomlfhNcOgtWgTf1Za2xq3Y5XMtqWqjSuo

The RowBinary consumer kept Vec<EAPItemRow> alive in every batch all the
way from the processor to the writer. Each row has ~80 attribute-bucket
Vecs plus per-bucket String allocations, so consumers ran with 2-3x the
in-flight memory of the JSONEachRow path. The clickhouse-rs Insert
builder also double-buffered the encoded bytes during writes.

Serialize each EAPItemRow to RowBinary bytes inside the processor and
carry only the resulting RowData<Vec<u8>> downstream. The typed struct
drops as soon as the bytes exist; the writer just POSTs them with
INSERT INTO ... FORMAT RowBinary, mirroring the JSON path.

Vendor a small (~280 line) RowBinary serde Serializer plus the UUID
adapter from clickhouse-rs, since the upstream serializer is pub(crate).
With that in place, drop the clickhouse-rs dependency entirely - the
integration test now talks to ClickHouse via reqwest + FORMAT JSON.

The RowBinary and JSONEachRow pipelines now share the same Reduce ->
Writer shape; use_row_binary just selects the processor function and
the FORMAT clause. TypedInsertBatch, EstimatedSize, the
BytesInsertBatch<Vec<T>> impl, make_rust_processor_row_binary, and
ClickhouseRowBinaryWriterStep are all gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/39S080clqOozGCknLHK2EbqiXwoO9gjlPnmgL_y0M0Q
@phacops phacops requested a review from a team as a code owner May 29, 2026 19:18
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5c262e1. Configure here.

Comment thread rust_snuba/src/factory_v2.rs
phacops and others added 6 commits May 29, 2026 12:28
When use_row_binary=true and use_rust_processor=false, the factory's
match fell through to the Python processor (which emits JSONEachRow
bytes) while the writer was already configured with FORMAT RowBinary -
ClickHouse would reject the batch. The pre-collapse code avoided this
by early-returning from create_row_binary_pipeline, which always used
the Rust processor regardless of use_rust_processor.

Coalesce use_rust_processor || use_row_binary at the dispatch site so
the override path is always reachable when RowBinary is requested.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without a column list, ClickHouse maps the RowBinary stream to columns
using the table's on-disk positional order, which does not match the
EAPItemRow struct's wire order in two places:

* `client_sample_rate` and `server_sample_rate` were added by migration
  0048 with identical `AFTER sampling_factor`, so on disk the pair is
  reversed (server precedes client) while the struct keeps client first.
* The initial 0024 migration places all `attributes_string_*` columns
  before all `attributes_float_*`, but the struct interleaves them per
  the `seq_attrs!` expansion.

The integration test exposed this as `CANNOT_READ_ALL_DATA` deep inside
the bucket maps. clickhouse-rs masked the same hazard by always emitting
the column list via the `Row` derive; our vendored serializer has to
opt back in. Add `EAPItemRow::COLUMN_NAMES` in struct order, thread an
`Option<&'static [&'static str]>` through `ClickhouseWriterStep` and
`ClickhouseClient`, and expand it as `INSERT INTO t (col1, ...) FORMAT
RowBinary`. JSONEachRow stays column-list-less.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/xYvkk5ydje_vV3XmhCpsmSV44ohsxFeKHfKRg-ZUaIk
ClickhouseWriterStep::new took 7 args before the column-list parameter
landed; the 8th tipped it over the clippy::too_many_arguments threshold
(set repo-wide via -D warnings) and broke "Linting - Rust" in CI.

The arguments map 1:1 to ClickhouseClient/ClickhouseConfig knobs the
factory already owns separately, so bundling them just to satisfy the
lint would add an indirection without a real abstraction. Allow the
lint at the call site instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The integration test's `assert!(status.is_success())` for the read-back
SELECT was hiding ClickHouse's error message — only the assertion's
static "Select failed" string surfaced in CI logs. Capture the status
and response body up front and include both in the assertion message so
the next CI failure tells us what ClickHouse actually rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/fQtr5P5jc3905fJGpnXjV3rTvV3OFF6pHP9bx72Meo4
ClickHouse 25.x rejects bodyless POST requests with HTTP 411
HTTP_LENGTH_REQUIRED ("Transfer-Encoding is not chunked and there is
no Content-Length header for POST request"). reqwest only emits
Content-Length when a body is set, so POST + URL-encoded query +
no body trips the check.

GET fits the SELECT semantically anyway. For the ALTER TABLE cleanup
keep POST but attach an empty body so reqwest emits Content-Length: 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The JSON processor builds its batch via `InsertBatch::from_rows(...)?`
and then sets `cogs_data` / `item_type_metrics` on the returned value.
The RowBinary processor was hand-constructing the `InsertBatch` literal
because there was no parallel "I already have the wire bytes"
constructor. The two paths now look the same: build the batch, set the
two metric fields, return.

Pure refactor — no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant