(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer#732
Draft
mykaul wants to merge 6 commits intoscylladb:masterfrom
Draft
(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer#732mykaul wants to merge 6 commits intoscylladb:masterfrom
mykaul wants to merge 6 commits intoscylladb:masterfrom
Conversation
….from_bytes Performance improvements to serialization/deserialization hot paths: 1. unpack_num(): Use ntohs()/ntohl() for 16-bit and 32-bit integer types instead of byte-by-byte swapping loop. These compile to single bswap instructions on x86, providing more predictable performance. 2. read_int(): Simplify to use ntohl() directly instead of going through unpack_num() with a temporary Buffer. 3. varint_unpack(): Replace hex string conversion with int.from_bytes(). This eliminates string allocations and provides 4-18x speedup for the function itself (larger gains for longer varints). 4. Remove slice_buffer() and replaced with direct assignment 5. _unpack_len() is now implemented similar to read_int() Also removes unused 'start' and 'end' variables from unpack_num(). End-to-end benchmark shows ~4-5% improvement in row throughput. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Use hardware byte-swap intrinsic for float unmarshaling instead of manual 4-iteration loop, providing 4-8x speedup on little-endian systems. All tests passing (609 total) [see next commit for a fix for existing Cython related issue!] Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Refactor deserializers.pyx to use from_ptr_and_size() consistently instead of manual Buffer field assignment for better code clarity and maintainability. Changes: - cassandra/deserializers.pyx: Refactor 4 locations to use helper Tests: All Cython tests compile and pass (5 tests) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…izer Addded DesVectorType Cython deserializer with C-level optimizations for improved performance in row parsing for vectors. The deserializer uses: - Direct C byte swapping (ntohl, ntohs) for numeric types - Memory operations without Python object overhead - Unified numpy path for large vectors (≥32 elements) - struct.unpack fallback for small vectors (<32 elements) Performance improvements: - Small vectors (3-4 elements): 4.4-4.7x faster - Medium vectors (128 elements): 1.0-1.5x faster - Large vectors (384-1536 elements): 0.9-1.0x (marginal) The Cython deserializer is automatically used by the row parser when available via find_deserializer(). Includes unit tests and benchmark code. Follow-up commits will try to get Numpy arrays, and perhaps more. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Replace POSIX-specific arpa/inet.h with conditional compilation that uses winsock2.h on Windows and arpa/inet.h on POSIX systems. This ensures the driver can be compiled on Windows without modification. Changes: - cassandra/cython_marshal.pyx: Add platform detection for ntohs/ntohl - cassandra/ioutils.pyx: Add platform detection for ntohl Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Add bounds checking to prevent buffer overruns and properly
handle CQL protocol value semantics in deserializers.
Changes:
- subelem(): Add bounds validation with protocol-compliant value handling
* Happy path: Check elemlen >= 0 and offset + elemlen <= buf.size
* Support NULL values (elemlen == -1) per CQL protocol
* Support "not set" values (elemlen == -2) per CQL protocol
* Reject invalid values (elemlen < -2) with clear error message
- _unpack_len(): Add bounds check before reading int32 length field
* Validates offset + 4 <= buf.size before pointer dereference
* Prevents reading beyond buffer boundaries
- DesTupleType: Add defensive bounds checking for tuple deserialization
* Check p + 4 <= buf.size before reading item length
* Check p + itemlen <= buf.size before reading item data
* Explicit NULL value handling (itemlen < 0)
* Clear error messages for buffer overruns
- DesCompositeType: Add bounds validation for composite type elements
* Check 2 + element_length + 1 <= buf.size (length + data + EOC byte)
* Prevents buffer overrun when reading composite elements
- DesVectorType._deserialize_generic(): Add size validation
* Verify buf.size == expected_size before processing
* Provides clear error message with expected vs actual sizes
Protocol specification reference:
[value] = [int] n, followed by n bytes if n >= 0
n == -1: NULL value
n == -2: not set value
n < -2: invalid (error)
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ntohs()/ntohl()intrinsics andint.from_bytes()— ~4-5% end-to-end row throughput improvementntohl()byte-swap (4-8x per-element float speedup)DesVectorTypeCython deserializer with type-specialized C-level deserializationCommits (6, ordered by dependency)
1. Optimize Cython byte unpacking with ntohs/ntohl and int.from_bytes
unpack_num()withntohs()/ntohl()intrinsics for 16/32-bit types (compiles to singlebswapon x86)varint_unpack()hex-string-based conversion withint.from_bytes(term, 'big', signed=True)— 4-18x speedupread_int()to direct pointer cast +ntohl()slice_buffer(), replace all call sites withfrom_ptr_and_size()2. Optimize float deserialization with ntohl() intrinsic
uint32_t, applyntohl(), reinterpret back to float3. Refactor to use from_ptr_and_size() helper consistently
buf.ptr = ...; buf.size = ...patterns withfrom_ptr_and_size()callsmemcpy+ntohlinstead of direct pointer dereference for alignment safety4. Optimize VectorType deserialization with Cython deserializer
DesVectorTypeclass with specialized deserialization methods:_deserialize_float(): C-levelmemcpy+ntohl+ pointer-cast loop (no Python dispatch per element)_deserialize_double()/_deserialize_int64(): 8-byte manual byte-swap_deserialize_int32():memcpy+ntohl+ castfind_deserializer()for the Cython row parser5. Add Windows support for ntohl/ntohs with platform-specific headers
arpa/inet.hwith platform-conditional#ifdef _WIN32block includingwinsock2.hon Windowsmemcpy+ntohlinread_int()for alignment safety6. Add buffer bounds validation to Cython deserializers
subelem(): NULL, "not set", and negative length handling_unpack_len(),DesTupleType,DesCompositeType,DesVectorType._deserialize_generic()Performance Impact on Vector<float, 768/1536>
For small vectors (< 32 elements): The Cython C-loop path avoids all Python method dispatch, delivering 4-5x speedup over pure Python.
For large vectors (768/1536): Both Cython and pure Python paths delegate to
numpy.frombuffer().tolist(), so the Cython overhead reduction is marginal per-vector — but eliminates per-row Python dispatch overhead when processing many rows.