(improvement) (Cython only) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr… by mykaul · Pull Request #743 · scylladb/python-driver

mykaul · 2026-03-13T11:48:37Z

…om C buffer pointer

Replace the two-step to_bytes(buf).decode('utf8') pattern in DesUTF8Type and DesAsciiType with direct CPython C API calls (PyUnicode_DecodeUTF8 and PyUnicode_DecodeASCII). This eliminates an intermediate bytes object allocation per text cell — the old code created a Python bytes object from the C buffer pointer via to_bytes(buf), then immediately decoded it to str and discarded the bytes.

Text (UTF8Type/VarcharType) is the most common CQL column type, so this optimization applies to the majority of cells in typical workloads.

Benchmark results (Cython row parsing pipeline, median times):

Scenario	Before (original)	After (direct decode)	Speedup
UTF8 1row x 1col short (11B)	565 ns	454 ns	1.24x
UTF8 1row x 10col short	1,594 ns	1,023 ns	1.56x
UTF8 100rows x 5col medium	61,396 ns	28,766 ns	2.13x
UTF8 1000rows x 5col medium	547,145 ns	290,361 ns	1.88x
UTF8 100rows x 5col long(200B)	57,940 ns	35,680 ns	1.62x
UTF8 100rows x 5col multibyte	125,149 ns	103,370 ns	1.21x
ASCII 100rows x 5col medium	41,608 ns	35,817 ns	1.16x
ASCII 1000rows x 5col medium	416,350 ns	374,341 ns	1.11x
Mixed 100rows 3text+2int	44,646 ns	31,189 ns	1.43x

All existing unit tests pass (62 type tests, 116 total across key suites).

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

Copilot

Pull request overview

Optimizes text deserialization performance in the Cython row parser by decoding UTF-8/ASCII directly from the underlying buffer, and adds a benchmark module intended to measure the improvement.

Changes:

Replace to_bytes(buf).decode(...) with PyUnicode_DecodeASCII/UTF8(buf.ptr, buf.size, NULL) in Cython deserializers.
Add a new pytest-based benchmark suite for UTF-8/ASCII decode and pipeline parsing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`cassandra/deserializers.pyx`	Uses CPython unicode decode APIs to avoid intermediate `bytes` allocations during ASCII/UTF-8 deserialization.
`benchmarks/test_utf8_decode_benchmark.py`	Adds end-to-end and microbenchmarks plus correctness checks for the new decode approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

benchmarks/test_utf8_decode_benchmark.py

…om C buffer pointer Replace the two-step to_bytes(buf).decode('utf8') pattern in DesUTF8Type and DesAsciiType with direct CPython C API calls (PyUnicode_DecodeUTF8 and PyUnicode_DecodeASCII). This eliminates an intermediate bytes object allocation per text cell — the old code created a Python bytes object from the C buffer pointer via to_bytes(buf), then immediately decoded it to str and discarded the bytes. Text (UTF8Type/VarcharType) is the most common CQL column type, so this optimization applies to the majority of cells in typical workloads. Benchmark results (Cython row parsing pipeline, median times): | Scenario | Before (original) | After (direct decode) | Speedup | |---------------------------------|-------------------:|----------------------:|--------:| | UTF8 1row x 1col short (11B) | 565 ns | 454 ns | 1.24x | | UTF8 1row x 10col short | 1,594 ns | 1,023 ns | 1.56x | | UTF8 100rows x 5col medium | 61,396 ns | 28,766 ns | 2.13x | | UTF8 1000rows x 5col medium | 547,145 ns | 290,361 ns | 1.88x | | UTF8 100rows x 5col long(200B) | 57,940 ns | 35,680 ns | 1.62x | | UTF8 100rows x 5col multibyte | 125,149 ns | 103,370 ns | 1.21x | | ASCII 100rows x 5col medium | 41,608 ns | 35,817 ns | 1.16x | | ASCII 1000rows x 5col medium | 416,350 ns | 374,341 ns | 1.11x | | Mixed 100rows 3text+2int | 44,646 ns | 31,189 ns | 1.43x | All existing unit tests pass (62 type tests, 116 total across key suites).

Copilot

Pull request overview

This PR optimizes Cython-based UTF-8 and ASCII text deserialization by decoding directly from the underlying C buffer (avoiding an intermediate bytes allocation), and adds correctness tests plus a benchmark to validate/measure the change.

Changes:

Switch DesUTF8Type/DesAsciiType to use PyUnicode_DecodeUTF8 / PyUnicode_DecodeASCII on Buffer.ptr/Buffer.size.
Add unit tests covering common and edge-case text decode scenarios (empty, ASCII, multibyte, long strings, NULLs, multiple rows).
Add a pytest-benchmark benchmark file to compare end-to-end Cython row parsing before/after the optimization.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`cassandra/deserializers.pyx`	Replace `to_bytes(...).decode(...)` with direct CPython Unicode decoding calls for UTF-8/ASCII deserializers.
`tests/unit/cython/test_deserializers.py`	New correctness-focused unit tests for the optimized Cython decoding path.
`benchmarks/utf8_decode_benchmark.py`	New benchmark suite to measure the performance impact of the UTF-8/ASCII decode optimization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/cython/test_deserializers.py

+try:
+    from cassandra.obj_parser import ListParser
+    from cassandra.bytesio import BytesIOReader
+    from cassandra.parsing import ParseDesc
+    from cassandra.deserializers import make_deserializers
+    from cassandra.cqltypes import UTF8Type, AsciiType
+    from cassandra.policies import ColDesc
+
+    HAS_CYTHON = True
+except ImportError:
+    HAS_CYTHON = False


mykaul marked this pull request as draft March 13, 2026 11:48

mykaul changed the title ~~(improvement) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr…~~ (improvement) (Cython only) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr… Mar 13, 2026

mykaul requested a review from Copilot March 14, 2026 09:24

Copilot started reviewing on behalf of mykaul March 14, 2026 09:25 View session

Copilot AI reviewed Mar 14, 2026

View reviewed changes

benchmarks/test_utf8_decode_benchmark.py Outdated Show resolved Hide resolved

benchmarks/test_utf8_decode_benchmark.py Outdated Show resolved Hide resolved

This was referenced Mar 14, 2026

Tracking: Vector search (VectorType) performance improvement PRs #746

Open

Tracking: General (non-vector) performance improvement PRs #747

Open

mykaul force-pushed the perf/direct-utf8-decode branch from cbb8652 to 97eb4e8 Compare March 15, 2026 15:56

mykaul requested a review from Copilot March 16, 2026 19:03

Copilot started reviewing on behalf of mykaul March 16, 2026 19:03 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(improvement) (Cython only) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr…#743

(improvement) (Cython only) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr…#743
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/direct-utf8-decode

mykaul commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mykaul commented Mar 13, 2026

Pre-review checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants