(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer by mykaul · Pull Request #732 · scylladb/python-driver

mykaul · 2026-03-07T10:01:50Z

Summary

Optimize foundational Cython byte-unpacking with ntohs()/ntohl() intrinsics and int.from_bytes() — ~4-5% end-to-end row throughput improvement
Add float-specific ntohl() byte-swap (4-8x per-element float speedup)
Add dedicated DesVectorType Cython deserializer with type-specialized C-level deserialization
Add buffer bounds validation and Windows portability for the Cython layer

Commits (6, ordered by dependency)

1. Optimize Cython byte unpacking with ntohs/ntohl and int.from_bytes

Replace generic byte-swap loop in unpack_num() with ntohs()/ntohl() intrinsics for 16/32-bit types (compiles to single bswap on x86)
Replace varint_unpack() hex-string-based conversion with int.from_bytes(term, 'big', signed=True) — 4-18x speedup
Simplify read_int() to direct pointer cast + ntohl()
Remove slice_buffer(), replace all call sites with from_ptr_and_size()

2. Optimize float deserialization with ntohl() intrinsic

Add float-specific branch: reinterpret float bits as uint32_t, apply ntohl(), reinterpret back to float
Eliminates 4-iteration byte-swap loop for every float value
4-8x speedup for float unmarshaling on little-endian systems

3. Refactor to use from_ptr_and_size() helper consistently

Replace manual buf.ptr = ...; buf.size = ... patterns with from_ptr_and_size() calls
Use memcpy + ntohl instead of direct pointer dereference for alignment safety

4. Optimize VectorType deserialization with Cython deserializer

New DesVectorType class with specialized deserialization methods:
- _deserialize_float(): C-level memcpy + ntohl + pointer-cast loop (no Python dispatch per element)
- _deserialize_double() / _deserialize_int64(): 8-byte manual byte-swap
- _deserialize_int32(): memcpy + ntohl + cast
- Numpy fast-path for vectors >= 32 elements
- Generic fallback for other fixed-size types
Automatically registered via find_deserializer() for the Cython row parser
4.4-4.7x faster than pure Python for small vectors (3-4 elements)

5. Add Windows support for ntohl/ntohs with platform-specific headers

Replace arpa/inet.h with platform-conditional #ifdef _WIN32 block including winsock2.h on Windows
Use memcpy + ntohl in read_int() for alignment safety

6. Add buffer bounds validation to Cython deserializers

Full CQL protocol compliance in subelem(): NULL, "not set", and negative length handling
Bounds checking in _unpack_len(), DesTupleType, DesCompositeType, DesVectorType._deserialize_generic()
Remove int16 fast path (ShortType goes through generic path)

Performance Impact on Vector<float, 768/1536>

For small vectors (< 32 elements): The Cython C-loop path avoids all Python method dispatch, delivering 4-5x speedup over pure Python.

For large vectors (768/1536): Both Cython and pure Python paths delegate to numpy.frombuffer().tolist(), so the Cython overhead reduction is marginal per-vector — but eliminates per-row Python dispatch overhead when processing many rows.

….from_bytes Performance improvements to serialization/deserialization hot paths: 1. unpack_num(): Use ntohs()/ntohl() for 16-bit and 32-bit integer types instead of byte-by-byte swapping loop. These compile to single bswap instructions on x86, providing more predictable performance. 2. read_int(): Simplify to use ntohl() directly instead of going through unpack_num() with a temporary Buffer. 3. varint_unpack(): Replace hex string conversion with int.from_bytes(). This eliminates string allocations and provides 4-18x speedup for the function itself (larger gains for longer varints). 4. Remove slice_buffer() and replaced with direct assignment 5. _unpack_len() is now implemented similar to read_int() Also removes unused 'start' and 'end' variables from unpack_num(). End-to-end benchmark shows ~4-5% improvement in row throughput. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Use hardware byte-swap intrinsic for float unmarshaling instead of manual 4-iteration loop, providing 4-8x speedup on little-endian systems. All tests passing (609 total) [see next commit for a fix for existing Cython related issue!] Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Refactor deserializers.pyx to use from_ptr_and_size() consistently instead of manual Buffer field assignment for better code clarity and maintainability. Changes: - cassandra/deserializers.pyx: Refactor 4 locations to use helper Tests: All Cython tests compile and pass (5 tests) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

…izer Addded DesVectorType Cython deserializer with C-level optimizations for improved performance in row parsing for vectors. The deserializer uses: - Direct C byte swapping (ntohl, ntohs) for numeric types - Memory operations without Python object overhead - Unified numpy path for large vectors (≥32 elements) - struct.unpack fallback for small vectors (<32 elements) Performance improvements: - Small vectors (3-4 elements): 4.4-4.7x faster - Medium vectors (128 elements): 1.0-1.5x faster - Large vectors (384-1536 elements): 0.9-1.0x (marginal) The Cython deserializer is automatically used by the row parser when available via find_deserializer(). Includes unit tests and benchmark code. Follow-up commits will try to get Numpy arrays, and perhaps more. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Replace POSIX-specific arpa/inet.h with conditional compilation that uses winsock2.h on Windows and arpa/inet.h on POSIX systems. This ensures the driver can be compiled on Windows without modification. Changes: - cassandra/cython_marshal.pyx: Add platform detection for ntohs/ntohl - cassandra/ioutils.pyx: Add platform detection for ntohl Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Add bounds checking to prevent buffer overruns and properly handle CQL protocol value semantics in deserializers. Changes: - subelem(): Add bounds validation with protocol-compliant value handling * Happy path: Check elemlen >= 0 and offset + elemlen <= buf.size * Support NULL values (elemlen == -1) per CQL protocol * Support "not set" values (elemlen == -2) per CQL protocol * Reject invalid values (elemlen < -2) with clear error message - _unpack_len(): Add bounds check before reading int32 length field * Validates offset + 4 <= buf.size before pointer dereference * Prevents reading beyond buffer boundaries - DesTupleType: Add defensive bounds checking for tuple deserialization * Check p + 4 <= buf.size before reading item length * Check p + itemlen <= buf.size before reading item data * Explicit NULL value handling (itemlen < 0) * Clear error messages for buffer overruns - DesCompositeType: Add bounds validation for composite type elements * Check 2 + element_length + 1 <= buf.size (length + data + EOC byte) * Prevents buffer overrun when reading composite elements - DesVectorType._deserialize_generic(): Add size validation * Verify buf.size == expected_size before processing * Provides clear error message with expected vs actual sizes Protocol specification reference: [value] = [int] n, followed by n bytes if n >= 0 n == -1: NULL value n == -2: not set value n < -2: invalid (error) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

mykaul added 6 commits March 7, 2026 12:00

mykaul marked this pull request as draft March 7, 2026 10:23

mykaul mentioned this pull request Mar 14, 2026

Tracking: Vector search (VectorType) performance improvement PRs #746

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer#732

(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer#732
mykaul wants to merge 6 commits intoscylladb:masterfrom
mykaul:cython-vector-deser

mykaul commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mykaul commented Mar 7, 2026

Summary

Commits (6, ordered by dependency)

1. Optimize Cython byte unpacking with ntohs/ntohl and int.from_bytes

2. Optimize float deserialization with ntohl() intrinsic

3. Refactor to use from_ptr_and_size() helper consistently

4. Optimize VectorType deserialization with Cython deserializer

5. Add Windows support for ntohl/ntohs with platform-specific headers

6. Add buffer bounds validation to Cython deserializers

Performance Impact on Vector<float, 768/1536>

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant