(improvement) (Cython only) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr…#743
(improvement) (Cython only) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr…#743mykaul wants to merge 1 commit intoscylladb:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Optimizes text deserialization performance in the Cython row parser by decoding UTF-8/ASCII directly from the underlying buffer, and adds a benchmark module intended to measure the improvement.
Changes:
- Replace
to_bytes(buf).decode(...)withPyUnicode_DecodeASCII/UTF8(buf.ptr, buf.size, NULL)in Cython deserializers. - Add a new pytest-based benchmark suite for UTF-8/ASCII decode and pipeline parsing.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
cassandra/deserializers.pyx |
Uses CPython unicode decode APIs to avoid intermediate bytes allocations during ASCII/UTF-8 deserialization. |
benchmarks/test_utf8_decode_benchmark.py |
Adds end-to-end and microbenchmarks plus correctness checks for the new decode approach. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…om C buffer pointer
Replace the two-step to_bytes(buf).decode('utf8') pattern in DesUTF8Type and
DesAsciiType with direct CPython C API calls (PyUnicode_DecodeUTF8 and
PyUnicode_DecodeASCII). This eliminates an intermediate bytes object
allocation per text cell — the old code created a Python bytes object from
the C buffer pointer via to_bytes(buf), then immediately decoded it to str
and discarded the bytes.
Text (UTF8Type/VarcharType) is the most common CQL column type, so this
optimization applies to the majority of cells in typical workloads.
Benchmark results (Cython row parsing pipeline, median times):
| Scenario | Before (original) | After (direct decode) | Speedup |
|---------------------------------|-------------------:|----------------------:|--------:|
| UTF8 1row x 1col short (11B) | 565 ns | 454 ns | 1.24x |
| UTF8 1row x 10col short | 1,594 ns | 1,023 ns | 1.56x |
| UTF8 100rows x 5col medium | 61,396 ns | 28,766 ns | 2.13x |
| UTF8 1000rows x 5col medium | 547,145 ns | 290,361 ns | 1.88x |
| UTF8 100rows x 5col long(200B) | 57,940 ns | 35,680 ns | 1.62x |
| UTF8 100rows x 5col multibyte | 125,149 ns | 103,370 ns | 1.21x |
| ASCII 100rows x 5col medium | 41,608 ns | 35,817 ns | 1.16x |
| ASCII 1000rows x 5col medium | 416,350 ns | 374,341 ns | 1.11x |
| Mixed 100rows 3text+2int | 44,646 ns | 31,189 ns | 1.43x |
All existing unit tests pass (62 type tests, 116 total across key suites).
cbb8652 to
97eb4e8
Compare
There was a problem hiding this comment.
Pull request overview
This PR optimizes Cython-based UTF-8 and ASCII text deserialization by decoding directly from the underlying C buffer (avoiding an intermediate bytes allocation), and adds correctness tests plus a benchmark to validate/measure the change.
Changes:
- Switch
DesUTF8Type/DesAsciiTypeto usePyUnicode_DecodeUTF8/PyUnicode_DecodeASCIIonBuffer.ptr/Buffer.size. - Add unit tests covering common and edge-case text decode scenarios (empty, ASCII, multibyte, long strings, NULLs, multiple rows).
- Add a pytest-benchmark benchmark file to compare end-to-end Cython row parsing before/after the optimization.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
cassandra/deserializers.pyx |
Replace to_bytes(...).decode(...) with direct CPython Unicode decoding calls for UTF-8/ASCII deserializers. |
tests/unit/cython/test_deserializers.py |
New correctness-focused unit tests for the optimized Cython decoding path. |
benchmarks/utf8_decode_benchmark.py |
New benchmark suite to measure the performance impact of the UTF-8/ASCII decode optimization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| from cassandra.obj_parser import ListParser | ||
| from cassandra.bytesio import BytesIOReader | ||
| from cassandra.parsing import ParseDesc | ||
| from cassandra.deserializers import make_deserializers | ||
| from cassandra.cqltypes import UTF8Type, AsciiType | ||
| from cassandra.policies import ColDesc | ||
|
|
||
| HAS_CYTHON = True | ||
| except ImportError: | ||
| HAS_CYTHON = False |
…om C buffer pointer
Replace the two-step to_bytes(buf).decode('utf8') pattern in DesUTF8Type and DesAsciiType with direct CPython C API calls (PyUnicode_DecodeUTF8 and PyUnicode_DecodeASCII). This eliminates an intermediate bytes object allocation per text cell — the old code created a Python bytes object from the C buffer pointer via to_bytes(buf), then immediately decoded it to str and discarded the bytes.
Text (UTF8Type/VarcharType) is the most common CQL column type, so this optimization applies to the majority of cells in typical workloads.
Benchmark results (Cython row parsing pipeline, median times):
All existing unit tests pass (62 type tests, 116 total across key suites).
Pre-review checklist
./docs/source/.Fixes:annotations to PR description.