Skip to content

(improvement) perf: remove copies on the read path#734

Draft
mykaul wants to merge 5 commits intoscylladb:masterfrom
mykaul:remove_copies
Draft

(improvement) perf: remove copies on the read path#734
mykaul wants to merge 5 commits intoscylladb:masterfrom
mykaul:remove_copies

Conversation

@mykaul
Copy link

@mykaul mykaul commented Mar 9, 2026

This patch set aims to reduce the memory copies we perform in the read path, to improve overall performance - mainly reduce latency on the processing side of the driver receiving the payload.
This is more effective on larger payloads of course.

Comparison to master:

Using the 50-iteration rerun (more stable) for the key scenarios, and the 20-iteration run for the rest:
Scenario	Body Size	master (median)	remove_copies (median)	Speedup
small_100	7.3 KB	24.5 us	20.9 us	1.17x
medium_1k_256B	273 KB	265.1 us	220.2 us	1.20x
medium_1k_1KB	1.0 MB	1.34 ms	540.0 us	2.48x
large_5k_1KB	5.0 MB	10.64 ms	3.48 ms	3.06x
large_1k_4KB	3.9 MB	4.50 ms	847.9 us	5.31x
wide_5k_doubles	586 KB	2.54 ms	2.55 ms	1.00x
wide_1k_20cols	762 KB	1.49 ms	1.40 ms	1.06x
blob_1k_16KB	15.6 MB	16.11 ms	6.56 ms	2.46x

(note - we can see that wide_5k_doubles is bottlenecked on something else - CPU processing of this payload - unpacking it. This may be optimized in a different PR)

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

mykaul added 5 commits March 8, 2026 15:51
Replace getvalue() with getbuffer() memoryview in _read_frame_header
and frame body extraction to avoid full-buffer copies. Add _reset_buffer()
helper using getbuffer()[pos:] instead of read() to reduce allocations.
Wrap memoryview usage in try/finally to ensure release before mutation.
Increase in_buffer_size from 4096 to 65536 to reduce recv() call overhead.
Introduce a lightweight BytesReader class that operates directly on
bytes data without BytesIO overhead. Materializes memoryview to bytes
once in __init__ instead of checking on every read(). Includes
remaining_buffer() method for zero-copy handoff to Cython parsers.
Add offset parameter to BytesIOReader so it can start reading from the
middle of an existing buffer, avoiding the full-body copy at the
Python-to-Cython boundary. Update row_parser.pyx to use
f.remaining_buffer() for zero-copy handoff with hasattr fallback.
Track _initial_offset for error recovery.
13 BytesReader tests covering read operations, remaining_buffer(),
memoryview materialization, empty data, and EOFError handling.
9 BytesIOReader tests covering offset initialization, boundary
conditions, read behavior with offset, and error cases.
…pact

Standalone benchmark (no cluster required) that constructs synthetic
RESULT/ROWS wire-format bodies and measures ProtocolHandler.decode_message()
throughput across 8 scenarios (small to 16MB, narrow to 20-column wide).

Pins to a single CPU core via sched_setaffinity for consistent results.
Supports both Cython and pure-Python paths, with --cprofile option.
@mykaul mykaul marked this pull request as draft March 9, 2026 07:50
@mykaul
Copy link
Author

mykaul commented Mar 9, 2026

#630 would have helped with wide_5k_doubles of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant