Skip to content

(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer#732

Draft
mykaul wants to merge 6 commits intoscylladb:masterfrom
mykaul:cython-vector-deser
Draft

(improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer#732
mykaul wants to merge 6 commits intoscylladb:masterfrom
mykaul:cython-vector-deser

Conversation

@mykaul
Copy link

@mykaul mykaul commented Mar 7, 2026

Summary

  • Optimize foundational Cython byte-unpacking with ntohs()/ntohl() intrinsics and int.from_bytes() — ~4-5% end-to-end row throughput improvement
  • Add float-specific ntohl() byte-swap (4-8x per-element float speedup)
  • Add dedicated DesVectorType Cython deserializer with type-specialized C-level deserialization
  • Add buffer bounds validation and Windows portability for the Cython layer

Commits (6, ordered by dependency)

1. Optimize Cython byte unpacking with ntohs/ntohl and int.from_bytes

  • Replace generic byte-swap loop in unpack_num() with ntohs()/ntohl() intrinsics for 16/32-bit types (compiles to single bswap on x86)
  • Replace varint_unpack() hex-string-based conversion with int.from_bytes(term, 'big', signed=True) — 4-18x speedup
  • Simplify read_int() to direct pointer cast + ntohl()
  • Remove slice_buffer(), replace all call sites with from_ptr_and_size()

2. Optimize float deserialization with ntohl() intrinsic

  • Add float-specific branch: reinterpret float bits as uint32_t, apply ntohl(), reinterpret back to float
  • Eliminates 4-iteration byte-swap loop for every float value
  • 4-8x speedup for float unmarshaling on little-endian systems

3. Refactor to use from_ptr_and_size() helper consistently

  • Replace manual buf.ptr = ...; buf.size = ... patterns with from_ptr_and_size() calls
  • Use memcpy + ntohl instead of direct pointer dereference for alignment safety

4. Optimize VectorType deserialization with Cython deserializer

  • New DesVectorType class with specialized deserialization methods:
    • _deserialize_float(): C-level memcpy + ntohl + pointer-cast loop (no Python dispatch per element)
    • _deserialize_double() / _deserialize_int64(): 8-byte manual byte-swap
    • _deserialize_int32(): memcpy + ntohl + cast
    • Numpy fast-path for vectors >= 32 elements
    • Generic fallback for other fixed-size types
  • Automatically registered via find_deserializer() for the Cython row parser
  • 4.4-4.7x faster than pure Python for small vectors (3-4 elements)

5. Add Windows support for ntohl/ntohs with platform-specific headers

  • Replace arpa/inet.h with platform-conditional #ifdef _WIN32 block including winsock2.h on Windows
  • Use memcpy + ntohl in read_int() for alignment safety

6. Add buffer bounds validation to Cython deserializers

  • Full CQL protocol compliance in subelem(): NULL, "not set", and negative length handling
  • Bounds checking in _unpack_len(), DesTupleType, DesCompositeType, DesVectorType._deserialize_generic()
  • Remove int16 fast path (ShortType goes through generic path)

Performance Impact on Vector<float, 768/1536>

For small vectors (< 32 elements): The Cython C-loop path avoids all Python method dispatch, delivering 4-5x speedup over pure Python.

For large vectors (768/1536): Both Cython and pure Python paths delegate to numpy.frombuffer().tolist(), so the Cython overhead reduction is marginal per-vector — but eliminates per-row Python dispatch overhead when processing many rows.

mykaul added 6 commits March 7, 2026 12:00
….from_bytes

Performance improvements to serialization/deserialization hot paths:

1. unpack_num(): Use ntohs()/ntohl() for 16-bit and 32-bit integer types
   instead of byte-by-byte swapping loop. These compile to single bswap
   instructions on x86, providing more predictable performance.

2. read_int(): Simplify to use ntohl() directly instead of going through
   unpack_num() with a temporary Buffer.

3. varint_unpack(): Replace hex string conversion with int.from_bytes().
   This eliminates string allocations and provides 4-18x speedup for the
   function itself (larger gains for longer varints).

4. Remove slice_buffer() and replaced with direct assignment

5. _unpack_len() is now implemented similar to read_int()

Also removes unused 'start' and 'end' variables from unpack_num().

End-to-end benchmark shows ~4-5% improvement in row throughput.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Use hardware byte-swap intrinsic for float unmarshaling instead of manual
4-iteration loop, providing 4-8x speedup on little-endian systems.

All tests passing (609 total) [see next commit for a fix for existing Cython related issue!]

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Refactor deserializers.pyx to use from_ptr_and_size() consistently
instead of manual Buffer field assignment for better code clarity and
maintainability.

Changes:
- cassandra/deserializers.pyx: Refactor 4 locations to use helper

Tests: All Cython tests compile and pass (5 tests)

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…izer

Addded DesVectorType Cython deserializer with C-level optimizations for
improved performance in row parsing for vectors.
The deserializer uses:
- Direct C byte swapping (ntohl, ntohs) for numeric types
- Memory operations without Python object overhead
- Unified numpy path for large vectors (≥32 elements)
- struct.unpack fallback for small vectors (<32 elements)

Performance improvements:
- Small vectors (3-4 elements): 4.4-4.7x faster
- Medium vectors (128 elements): 1.0-1.5x faster
- Large vectors (384-1536 elements): 0.9-1.0x (marginal)

The Cython deserializer is automatically used by the row parser when
available via find_deserializer().

Includes unit tests and benchmark code.

Follow-up commits will try to get Numpy arrays, and perhaps more.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Replace POSIX-specific arpa/inet.h with conditional compilation that uses
winsock2.h on Windows and arpa/inet.h on POSIX systems.

This ensures the driver can be compiled on Windows without modification.

Changes:
- cassandra/cython_marshal.pyx: Add platform detection for ntohs/ntohl
- cassandra/ioutils.pyx: Add platform detection for ntohl

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Add bounds checking to prevent buffer overruns and properly
handle CQL protocol value semantics in deserializers.

Changes:
- subelem(): Add bounds validation with protocol-compliant value handling
  * Happy path: Check elemlen >= 0 and offset + elemlen <= buf.size
  * Support NULL values (elemlen == -1) per CQL protocol
  * Support "not set" values (elemlen == -2) per CQL protocol
  * Reject invalid values (elemlen < -2) with clear error message

- _unpack_len(): Add bounds check before reading int32 length field
  * Validates offset + 4 <= buf.size before pointer dereference
  * Prevents reading beyond buffer boundaries

- DesTupleType: Add defensive bounds checking for tuple deserialization
  * Check p + 4 <= buf.size before reading item length
  * Check p + itemlen <= buf.size before reading item data
  * Explicit NULL value handling (itemlen < 0)
  * Clear error messages for buffer overruns

- DesCompositeType: Add bounds validation for composite type elements
  * Check 2 + element_length + 1 <= buf.size (length + data + EOC byte)
  * Prevents buffer overrun when reading composite elements

- DesVectorType._deserialize_generic(): Add size validation
  * Verify buf.size == expected_size before processing
  * Provides clear error message with expected vs actual sizes

Protocol specification reference:
  [value] = [int] n, followed by n bytes if n >= 0
            n == -1: NULL value
            n == -2: not set value
            n < -2: invalid (error)

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant