(improvement) query: add Cython-aware serializer path in BoundStatement.bind()#749
(improvement) query: add Cython-aware serializer path in BoundStatement.bind()#749mykaul wants to merge 5 commits intoscylladb:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an optional, Cython-accelerated serialization path for BoundStatement.bind() to reduce per-value overhead (especially for large VectorType columns) when Cython serializers are available and no column encryption policy is enabled.
Changes:
- Add a new Cython
cassandra.serializersextension (with.pyx+.pxd) providingSerializerimplementations and amake_serializers()factory. - Add lazy caching of per-column serializer objects on
PreparedStatementvia a_serializersproperty. - Split the bind loop into three branches: column encryption policy, Cython fast path, and pure-Python fallback (with reduced per-value overhead in the fallback).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
cassandra/serializers.pyx |
Adds Cython serializer implementations (scalar + VectorType) and lookup/factory functions. |
cassandra/serializers.pxd |
Exposes the Serializer cdef interface for Cython usage. |
cassandra/query.py |
Integrates the optional Cython serializer path into PreparedStatement/BoundStatement.bind() with lazy caching and branch selection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adds an optimized parameter binding path that can leverage Cython Serializer objects (when available) to speed up BoundStatement.bind(), while preserving the existing column-encryption behavior and improving the plain-Python fallback.
Changes:
- Add
PreparedStatement._serializerslazy cache and a three-way bind loop inBoundStatement.bind()(CE policy / Cython fast path / Python fallback). - Introduce Cython
Serializerimplementations for Float/Double/Int32/Vector and amake_serializers()factory. - Extend unit tests to exercise the Cython-serializer bind branch via injected stub serializers.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
cassandra/query.py |
Adds Cython-serializer availability detection, caches serializers on PreparedStatement, and selects the new bind fast path when safe. |
cassandra/serializers.pyx |
Implements Cython serializers (including optimized VectorType) and factory/lookup helpers. |
cassandra/serializers.pxd |
Declares the Cython Serializer interface for cross-module typing. |
tests/unit/test_parameter_binding.py |
Adds tests for the new bind branch and error-wrapping behavior using stub serializers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adds a Cython-aware fast path to BoundStatement.bind() so prepared statements can use cached per-column serializer objects (when available and no column encryption policy is active), reducing Python dispatch and per-value overhead during binding.
Changes:
- Add lazy
PreparedStatement._serializerscache and splitBoundStatement.bind()into CE-policy, Cython-serializer, and pure-Python paths. - Introduce Cython
Serializerimplementations (including optimizedVectorTypeserialization) and amake_serializers()factory. - Expand/adjust unit tests to cover the new Cython bind branch via injected stub serializers.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
cassandra/query.py |
Adds serializer caching and a new bind fast path (plus significant formatting-only edits in the same module). |
cassandra/serializers.pyx |
Introduces Cython serializer implementations and factory/lookup helpers. |
cassandra/serializers.pxd |
Declares the Cython Serializer interface for cross-module use. |
tests/unit/test_parameter_binding.py |
Adds tests to exercise the Cython bind path without requiring compiled Cython. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7de782f to
8c03c2f
Compare
There was a problem hiding this comment.
Pull request overview
Adds a Cython-aware fast path to BoundStatement.bind() by reworking the bind loop into distinct branches (column encryption vs Cython serializers vs pure-Python), and introduces the Cython serializer module used by the new path.
Changes:
- Add lazy cached
PreparedStatement._serializersand updateBoundStatement.bind()to use CythonSerializerobjects when available and CE policy is not active. - Factor bind-time serialization error wrapping into a shared helper and extend wrapping to include
OverflowError. - Add unit tests that exercise the new Cython bind branch via injected stub serializers, plus overflow wrapping coverage.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
cassandra/query.py |
Adds _serializers cache on PreparedStatement, a shared bind error wrapper, and splits BoundStatement.bind() into CE / Cython / Python paths. |
cassandra/serializers.pyx |
Introduces Cython Serializer implementations (scalar + vector) and serializer factory helpers. |
cassandra/serializers.pxd |
Exposes the Serializer cdef interface for Cython interop. |
tests/unit/test_parameter_binding.py |
Adds unit tests to validate the new Cython bind path behavior and error wrapping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
94eee43 to
da1bc3f
Compare
…torType Add cassandra/serializers.pyx and cassandra/serializers.pxd implementing Cython-optimized serialization that mirrors the deserializers.pyx architecture. Implements type-specialized serializers for the three subtypes commonly used in vector columns: - SerFloatType: 4-byte big-endian IEEE 754 float - SerDoubleType: 8-byte big-endian double - SerInt32Type: 4-byte big-endian signed int32 SerVectorType pre-allocates a contiguous buffer and uses C-level byte swapping for float/double/int32 vectors, with a generic fallback for other subtypes. GenericSerializer delegates to the Python-level cqltype.serialize() classmethod. Range checks for float32 and int32 values prevent silent truncation from C-level casts, matching the behavior of struct.pack(). Factory functions find_serializer() and make_serializers() allow easy lookup and batch creation of serializers for column types. Benchmarks show ~30x speedup over the current io.BytesIO baseline and ~3x speedup over Python struct.pack for Vector<float, 1536> serialization. No setup.py changes needed - the existing cassandra/*.pyx glob already picks up new .pyx files.
…nt.bind() When Cython serializers (from cassandra.serializers) are available and no column encryption policy is active, BoundStatement.bind() now uses pre-built Serializer objects cached on the PreparedStatement instead of calling cqltype classmethods. This avoids per-value Python method dispatch overhead and enables the ~30x vector serialization speedup from the Cython serializers module. The bind loop is split into three paths: 1. Column encryption policy path (unchanged behavior) 2. Cython serializers path (new fast path) 3. Plain Python path (no CE, no Cython -- removes per-value ColDesc/CE check) Depends on PR scylladb#748 (Cython serializers module) and PR scylladb#630 (CE-policy bind split).
… repeated .append() growth
- Chain original exception in _raise_bind_serialize_error (raise from exc)
- Change _check_int32_range to raise struct.error instead of OverflowError,
matching the behaviour of struct.pack('>i', value)
- Clarify docstrings for _check_float_range/_check_int32_range
- Expand _raise_bind_serialize_error docstring with specific exception types
- Document __getitem__ requirement in vector serialize methods
- Move io and uvint_pack imports to module scope in serializers.pyx
- Add struct import to serializers.pyx for struct.error
- Fix test_plain_path_overflow_error_wrapped docstring (struct.error, not OverflowError)
- Update OverflowSerializer stub to raise struct.error
- Replace name-mangled __serializers with _cached_serializers
bb267e6 to
5d684d1
Compare
There was a problem hiding this comment.
Pull request overview
Adds an optimized binding path that leverages Cython-backed per-column Serializer objects (when available) to reduce Python overhead in BoundStatement.bind(), while keeping existing behavior for column encryption policy and providing a streamlined pure-Python fallback.
Changes:
- Add lazy, cached Cython serializer lookup on
PreparedStatementand routeBoundStatement.bind()through a 3-way path: CE-policy, Cython fast path, pure-Python fallback. - Introduce Cython serializer implementations (
Serializer, scalar serializers,SerVectorType, and factories). - Extend unit tests to exercise the Cython fast-path behavior (via injected stub serializers) and UNSET handling scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
cassandra/query.py |
Adds optional Cython serializer import, caches per-column serializers on PreparedStatement, and refactors BoundStatement.bind() into CE/Cython/Python paths with pre-allocation. |
cassandra/serializers.pyx |
Introduces Cython-optimized serializers (including VectorType) and factory functions for per-column serializer creation. |
cassandra/serializers.pxd |
Declares the Cython Serializer interface for cross-module typing/cimports. |
tests/unit/test_parameter_binding.py |
Adds tests that inject stub serializers to validate the Cython bind path and UNSET behavior across paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cdef inline bytes _serialize_float(self, object values): | ||
| """Serialize a sequence of floats into a contiguous big-endian buffer. | ||
|
|
||
| Note: uses index-based access (values[i]) rather than iteration, so | ||
| the input must support __getitem__ (e.g. list or tuple). | ||
| """ |
There was a problem hiding this comment.
SerVectorType’s optimized _serialize_float/_serialize_double/_serialize_int32 paths require getitem (values[i]) even though the pure-Python VectorType.serialize() only requires len + iteration. This means some inputs that currently serialize fine in Python (any iterable with len but without indexing) will fail only when the Cython fast path is selected. Consider using PySequence_Fast/sequence coercion (or falling back to the generic element-iteration path) so accepted input types are consistent across paths.
cassandra/query.py
Outdated
| # Fill remaining unbound columns with UNSET_VALUE (v4+ feature). | ||
| # The pre-allocated list already has slots for these, so index | ||
| # assignment works directly without trimming first. | ||
| for i in range(idx, col_meta_len): |
There was a problem hiding this comment.
In the v4+ UNSET fill loop, the loop variable i is unused (for i in range(idx, col_meta_len):). Consider using while idx < col_meta_len: (or for _ in range(...)) to avoid an unused variable and make the intent clearer.
| for i in range(idx, col_meta_len): | |
| while idx < col_meta_len: |
| message = ( | ||
| 'Received an argument of invalid type for column "%s". ' | ||
| "Expected: %s, Got: %s; (%s)" % (col_spec.name, col_spec.type, actual_type, exc) | ||
| ) |
There was a problem hiding this comment.
_raise_bind_serialize_error() wraps struct.error/OverflowError as well as TypeError, but the emitted message always says "invalid type". For out-of-range numeric values this is misleading (the type may be correct but the value is invalid). Consider adjusting the message (or branching on exception type) to mention an invalid value/range overflow while still including the column context.
| message = ( | |
| 'Received an argument of invalid type for column "%s". ' | |
| "Expected: %s, Got: %s; (%s)" % (col_spec.name, col_spec.type, actual_type, exc) | |
| ) | |
| if isinstance(exc, TypeError): | |
| # True type mismatch: the Python type of the bound value is not compatible | |
| message = ( | |
| 'Received an argument of invalid type for column "%s". ' | |
| "Expected: %s, Got: %s; (%s)" | |
| % (col_spec.name, col_spec.type, actual_type, exc) | |
| ) | |
| elif isinstance(exc, (struct.error, OverflowError)): | |
| # Value is of the right general type but cannot be represented | |
| # in the target CQL type (e.g. numeric value out of range). | |
| message = ( | |
| 'Received an out-of-range or otherwise invalid value for column "%s". ' | |
| "Expected CQL type: %s, Got value: %r (Python type: %s); (%s)" | |
| % (col_spec.name, col_spec.type, value, actual_type, exc) | |
| ) | |
| else: | |
| # Fallback for any unexpected exception types that might be passed in | |
| message = ( | |
| 'Error binding value for column "%s". Expected: %s, Got: %s; (%s)' | |
| % (col_spec.name, col_spec.type, actual_type, exc) | |
| ) |
- Replace malloc/free with PyBytes_FromStringAndSize(NULL, ...) pattern in vector fast-paths, eliminating extra buffer copy and malloc(0) edge case - Change _check_int32_range to raise OverflowError instead of struct.error, consistent with _check_float_range - Remove unused imports (struct, malloc, free) - Differentiate error messages in _raise_bind_serialize_error: 'invalid type' for TypeError vs 'value out of range' for OverflowError/struct.error - Replace unused loop variable with while loop in UNSET_VALUE fill - Expand __getitem__ docstrings explaining performance rationale - Fix copyright header in test_parameter_binding.py
Summary
cassandra.serializersfrom PR (improvement) serializers: add Cython-optimized serialization for VectorType #748) are available and no column encryption policy is active,BoundStatement.bind()uses pre-builtSerializerobjects cached on thePreparedStatementinstead of calling cqltype classmethodsColDescconstruction andce_policycheck that was previously done unconditionallyDependencies
cassandra/serializers.pyx) — providesmake_serializers()and theSerializerclasses used by the new fast pathPerformance
End-to-end
BoundStatement.bind()benchmarksMeasured on a single CPU core (pinned), Python 3.14, comparing the Cython serializers path vs the plain Python path:
Vector<float, 128>(1 column)Vector<float, 768>(1 column)Vector<float, 1536>(1 column)Vector<float, 1536>+ 3 scalars (realistic INSERT)Key observations:
io.BytesIOloop with a pre-allocated C buffer and inline byte-swapColDescnamedtuple construction andce_policychecksMemory savings
The old bind loop created a
ColDescnamedtuple (~72 bytes) for every bound value, even when no column encryption policy was set. Both new non-CE paths eliminate this entirely.Per
bind()call, for N columns:ColDescnamedtuplesce_policy and ...evaluationsce_policy.column_type(...)ternaryPer VectorType column (D-dimensional float/double/int32), additionally:
VectorType.serialize()SerVectorTypeio.BytesIO()instancebytesper elementbuf.getvalue()final copymalloc)subtype.serialize+buf.write)Concrete example: A prepared INSERT with 5 scalar columns + 1
Vector<float, 1536>column eliminates ~1545 transient Python objects and ~88 KB of transient allocations perbind()call, plus ~3076 Python method calls.Design
PreparedStatement._serializers(lazy cached property)On first access, calls
make_serializers([col.type for col in self.column_metadata])which returns a list of CythonSerializerobjects (one per bind column). ReturnsNoneif:The cache is safe because
column_metadatais immutable afterPreparedStatementcreation. Thread safety is guaranteed by the benign-race pattern (idempotent computation + atomic attribute assignment).Three-way bind loop
Path 2 uses
zip(serializers, values, col_meta)to iterate all three in lockstep without index overhead. Path 3 removes the per-valueColDescconstruction andce_policycheck from the original code.Testing
test_query.py,test_types.py)_HAVE_CYTHON_SERIALIZERS = False)