(improvement) serializers: add Cython-optimized serialization for VectorType#748
(improvement) serializers: add Cython-optimized serialization for VectorType#748mykaul wants to merge 3 commits intoscylladb:masterfrom
Conversation
…torType Add cassandra/serializers.pyx and cassandra/serializers.pxd implementing Cython-optimized serialization that mirrors the deserializers.pyx architecture. Implements type-specialized serializers for the three subtypes commonly used in vector columns: - SerFloatType: 4-byte big-endian IEEE 754 float - SerDoubleType: 8-byte big-endian double - SerInt32Type: 4-byte big-endian signed int32 SerVectorType pre-allocates a contiguous buffer and uses C-level byte swapping for float/double/int32 vectors, with a generic fallback for other subtypes. GenericSerializer delegates to the Python-level cqltype.serialize() classmethod. Factory functions find_serializer() and make_serializers() allow easy lookup and batch creation of serializers for column types. Benchmarks show ~30x speedup over the current io.BytesIO baseline and ~3x speedup over Python struct.pack for Vector<float, 1536> serialization. No setup.py changes needed - the existing cassandra/*.pyx glob already picks up new .pyx files.
…nt.bind() When Cython serializers (from cassandra.serializers) are available and no column encryption policy is active, BoundStatement.bind() now uses pre-built Serializer objects cached on the PreparedStatement instead of calling cqltype classmethods. This avoids per-value Python method dispatch overhead and enables the ~30x vector serialization speedup from the Cython serializers module. The bind loop is split into three paths: 1. Column encryption policy path (unchanged behavior) 2. Cython serializers path (new fast path) 3. Plain Python path (no CE, no Cython -- removes per-value ColDesc/CE check) Depends on PR scylladb#748 (Cython serializers module) and PR scylladb#630 (CE-policy bind split).
There was a problem hiding this comment.
Pull request overview
This PR introduces a new Cython extension module to accelerate CQL value serialization—especially VectorType—using the same general “typed Serializer object + factory lookup” approach as the existing Cython deserialization stack.
Changes:
- Add
cassandra/serializers.pyximplementing Cython serializers forFloatType,DoubleType,Int32Type, and an optimizedVectorTypeserializer with generic fallback. - Add
find_serializer()/make_serializers()factory helpers for serializer creation. - Add
cassandra/serializers.pxdto expose theSerializerinterface to other Cython modules.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| cassandra/serializers.pyx | New Cython-optimized serialization implementations and factory lookup. |
| cassandra/serializers.pxd | Cython declarations for the Serializer interface. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…nt.bind() When Cython serializers (from cassandra.serializers) are available and no column encryption policy is active, BoundStatement.bind() now uses pre-built Serializer objects cached on the PreparedStatement instead of calling cqltype classmethods. This avoids per-value Python method dispatch overhead and enables the ~30x vector serialization speedup from the Cython serializers module. The bind loop is split into three paths: 1. Column encryption policy path (unchanged behavior) 2. Cython serializers path (new fast path) 3. Plain Python path (no CE, no Cython -- removes per-value ColDesc/CE check) Depends on PR scylladb#748 (Cython serializers module) and PR scylladb#630 (CE-policy bind split).
…lizers Address all 8 Copilot review comments on PR scylladb#748: - Add _check_float_range() for float overflow detection matching struct.pack - Add _check_int32_range() for int32 bounds checking before C cast - Wire bounds checks into SerFloatType, SerInt32Type, and VectorType fast-paths - Replace malloc/free with PyBytes_FromStringAndSize(NULL,n)+PyBytes_AS_STRING - Add empty vector early return (b'') before allocation - Remove unused uint32_t cimport and libc.stdlib import - Add comprehensive test suite (67 tests) covering equivalence, overflow, special values, vectors, round-trips, and factory functions
There was a problem hiding this comment.
Pull request overview
Adds a new Cython serialization module to speed up VectorType (and a few common scalar subtypes) while keeping wire-format output identical to existing cqltypes.*.serialize() implementations.
Changes:
- Introduce
cassandra/serializers.pyx+.pxdimplementingSerializerclasses, including a specializedSerVectorTypewith float/double/int32 fast-paths. - Add serializer lookup/batch factories (
find_serializer,make_serializers). - Add unit tests validating byte-for-byte equivalence and basic round-trips for the new serializers.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
cassandra/serializers.pyx |
New Cython serializers, including optimized VectorType serialization and factory lookup functions. |
cassandra/serializers.pxd |
Cython declarations for the Serializer base class. |
tests/unit/test_serializers.py |
New unit tests covering scalar/vector equivalence, round-trips, and factory behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Import serializers only if Cython is available | ||
| if HAVE_CYTHON: | ||
| from cassandra.serializers import ( | ||
| Serializer, | ||
| SerFloatType, | ||
| SerDoubleType, | ||
| SerInt32Type, | ||
| SerVectorType, | ||
| GenericSerializer, | ||
| find_serializer, | ||
| make_serializers, | ||
| ) | ||
|
|
||
| cythontest = unittest.skipUnless( | ||
| HAVE_CYTHON or VERIFY_CYTHON, "Cython is not available" | ||
| ) |
| if value > 2147483647 or value < -2147483648: | ||
| raise OverflowError( | ||
| "Value %r out of range for int32 " | ||
| "(must be between -2147483648 and 2147483647)" % (value,)) | ||
|
|
|
|
||
| for i in range(self.vector_size): | ||
| _check_float_range(<double>values[i]) | ||
| val = <float>values[i] | ||
| src = <char *>&val |
|
|
||
| for i in range(self.vector_size): | ||
| val = <double>values[i] |
|
|
||
| for i in range(self.vector_size): | ||
| _check_int32_range(values[i]) | ||
| val = <int32_t>values[i] | ||
| src = <char *>&val |
| """Raise OverflowError for finite values outside float32 range. | ||
|
|
||
| Matches the behavior of struct.pack('>f', value), which raises | ||
| struct.error for values that cannot be represented as a 32-bit | ||
| IEEE 754 float. inf, -inf, and nan pass through unchanged. | ||
| """ |
tests/unit/test_serializers.py
Outdated
| import math | ||
| import struct | ||
| import unittest | ||
|
|
||
| from cassandra.cython_deps import HAVE_CYTHON | ||
|
|
||
| try: | ||
| from tests import VERIFY_CYTHON | ||
| except ImportError: | ||
| VERIFY_CYTHON = False | ||
|
|
||
| from cassandra.cqltypes import ( | ||
| FloatType, | ||
| DoubleType, | ||
| Int32Type, | ||
| VectorType, | ||
| UTF8Type, | ||
| LongType, | ||
| BooleanType, | ||
| parse_casstype_args, | ||
| ) |
…nt.bind() When Cython serializers (from cassandra.serializers) are available and no column encryption policy is active, BoundStatement.bind() now uses pre-built Serializer objects cached on the PreparedStatement instead of calling cqltype classmethods. This avoids per-value Python method dispatch overhead and enables the ~30x vector serialization speedup from the Cython serializers module. The bind loop is split into three paths: 1. Column encryption policy path (unchanged behavior) 2. Cython serializers path (new fast path) 3. Plain Python path (no CE, no Cython -- removes per-value ColDesc/CE check) Depends on PR scylladb#748 (Cython serializers module) and PR scylladb#630 (CE-policy bind split).
- Fix _check_float_range() docstring: clarify it raises OverflowError, not struct.error - Fix _check_int32_range() docstring: same clarification - Document __getitem__ requirement in vector fast-paths (_serialize_float, _serialize_double, _serialize_int32) as intentional for performance - Expand test import guard to cover VERIFY_CYTHON - Remove unused imports: math, parse_casstype_args
Summary
Adds
cassandra/serializers.pyxandcassandra/serializers.pxdimplementing Cython-optimized serialization that mirrors thedeserializers.pyxarchitecture.What's included
SerFloatType(4-byte IEEE 754),SerDoubleType(8-byte),SerInt32Type(4-byte signed) — the three subtypes commonly used in vector columnschar *buffer and uses C-level byte swapping for float/double/int32 vectors, with a generic fallback for other subtypescqltype.serialize()classmethod for all other typesfind_serializer(cqltype)andmake_serializers(cqltypes_list)for easy lookup and batch creationArchitecture
Mirrors
deserializers.pyxexactly:Deserializerbase classSerializerbase classDesFloatType,DesDoubleType,DesInt32TypeSerFloatType,SerDoubleType,SerInt32TypeDesVectorType(type-specialized)SerVectorType(type-specialized)GenericDeserializerGenericSerializerfind_deserializer()find_serializer()make_deserializers()make_serializers()Performance
Benchmarked on Vector<float, 1536> (typical embedding dimension):
VectorType.serialize()(io.BytesIO loop)struct.packbatch format stringSerVectorTypeNo
setup.pychanges needed — the existingcassandra/*.pyxglob already picks up new.pyxfiles.Related PRs
BoundStatement.bind()(depends on this PR + Optimize column_encryption_policy checks in recv_results_rows #630)Pre-review checklist
./docs/source/.Fixes:annotations to PR description.