Skip to content

(improvement) (Cython only) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr…#743

Draft
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/direct-utf8-decode
Draft

(improvement) (Cython only) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr…#743
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/direct-utf8-decode

Conversation

@mykaul
Copy link

@mykaul mykaul commented Mar 13, 2026

…om C buffer pointer

Replace the two-step to_bytes(buf).decode('utf8') pattern in DesUTF8Type and DesAsciiType with direct CPython C API calls (PyUnicode_DecodeUTF8 and PyUnicode_DecodeASCII). This eliminates an intermediate bytes object allocation per text cell — the old code created a Python bytes object from the C buffer pointer via to_bytes(buf), then immediately decoded it to str and discarded the bytes.

Text (UTF8Type/VarcharType) is the most common CQL column type, so this optimization applies to the majority of cells in typical workloads.

Benchmark results (Cython row parsing pipeline, median times):

Scenario Before (original) After (direct decode) Speedup
UTF8 1row x 1col short (11B) 565 ns 454 ns 1.24x
UTF8 1row x 10col short 1,594 ns 1,023 ns 1.56x
UTF8 100rows x 5col medium 61,396 ns 28,766 ns 2.13x
UTF8 1000rows x 5col medium 547,145 ns 290,361 ns 1.88x
UTF8 100rows x 5col long(200B) 57,940 ns 35,680 ns 1.62x
UTF8 100rows x 5col multibyte 125,149 ns 103,370 ns 1.21x
ASCII 100rows x 5col medium 41,608 ns 35,817 ns 1.16x
ASCII 1000rows x 5col medium 416,350 ns 374,341 ns 1.11x
Mixed 100rows 3text+2int 44,646 ns 31,189 ns 1.43x

All existing unit tests pass (62 type tests, 116 total across key suites).

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@mykaul mykaul marked this pull request as draft March 13, 2026 11:48
@mykaul mykaul changed the title (improvement) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr… (improvement) (Cython only) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr… Mar 13, 2026
@mykaul mykaul requested a review from Copilot March 14, 2026 09:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes text deserialization performance in the Cython row parser by decoding UTF-8/ASCII directly from the underlying buffer, and adds a benchmark module intended to measure the improvement.

Changes:

  • Replace to_bytes(buf).decode(...) with PyUnicode_DecodeASCII/UTF8(buf.ptr, buf.size, NULL) in Cython deserializers.
  • Add a new pytest-based benchmark suite for UTF-8/ASCII decode and pipeline parsing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
cassandra/deserializers.pyx Uses CPython unicode decode APIs to avoid intermediate bytes allocations during ASCII/UTF-8 deserialization.
benchmarks/test_utf8_decode_benchmark.py Adds end-to-end and microbenchmarks plus correctness checks for the new decode approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…om C buffer pointer

Replace the two-step to_bytes(buf).decode('utf8') pattern in DesUTF8Type and
DesAsciiType with direct CPython C API calls (PyUnicode_DecodeUTF8 and
PyUnicode_DecodeASCII). This eliminates an intermediate bytes object
allocation per text cell — the old code created a Python bytes object from
the C buffer pointer via to_bytes(buf), then immediately decoded it to str
and discarded the bytes.

Text (UTF8Type/VarcharType) is the most common CQL column type, so this
optimization applies to the majority of cells in typical workloads.

Benchmark results (Cython row parsing pipeline, median times):

| Scenario                        | Before (original) | After (direct decode) | Speedup |
|---------------------------------|-------------------:|----------------------:|--------:|
| UTF8 1row x 1col short (11B)   |             565 ns |                454 ns |   1.24x |
| UTF8 1row x 10col short        |           1,594 ns |              1,023 ns |   1.56x |
| UTF8 100rows x 5col medium     |          61,396 ns |             28,766 ns |   2.13x |
| UTF8 1000rows x 5col medium    |         547,145 ns |            290,361 ns |   1.88x |
| UTF8 100rows x 5col long(200B) |          57,940 ns |             35,680 ns |   1.62x |
| UTF8 100rows x 5col multibyte  |         125,149 ns |            103,370 ns |   1.21x |
| ASCII 100rows x 5col medium    |          41,608 ns |             35,817 ns |   1.16x |
| ASCII 1000rows x 5col medium   |         416,350 ns |            374,341 ns |   1.11x |
| Mixed 100rows 3text+2int       |          44,646 ns |             31,189 ns |   1.43x |

All existing unit tests pass (62 type tests, 116 total across key suites).
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes Cython-based UTF-8 and ASCII text deserialization by decoding directly from the underlying C buffer (avoiding an intermediate bytes allocation), and adds correctness tests plus a benchmark to validate/measure the change.

Changes:

  • Switch DesUTF8Type/DesAsciiType to use PyUnicode_DecodeUTF8 / PyUnicode_DecodeASCII on Buffer.ptr/Buffer.size.
  • Add unit tests covering common and edge-case text decode scenarios (empty, ASCII, multibyte, long strings, NULLs, multiple rows).
  • Add a pytest-benchmark benchmark file to compare end-to-end Cython row parsing before/after the optimization.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
cassandra/deserializers.pyx Replace to_bytes(...).decode(...) with direct CPython Unicode decoding calls for UTF-8/ASCII deserializers.
tests/unit/cython/test_deserializers.py New correctness-focused unit tests for the optimized Cython decoding path.
benchmarks/utf8_decode_benchmark.py New benchmark suite to measure the performance impact of the UTF-8/ASCII decode optimization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +35
try:
from cassandra.obj_parser import ListParser
from cassandra.bytesio import BytesIOReader
from cassandra.parsing import ParseDesc
from cassandra.deserializers import make_deserializers
from cassandra.cqltypes import UTF8Type, AsciiType
from cassandra.policies import ColDesc

HAS_CYTHON = True
except ImportError:
HAS_CYTHON = False
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants