High-throughput byte-pattern replacement on GPU. Python's bytes.replace() semantics without leaving device memory.
Performs leftmost, non-overlapping pattern replacement entirely on GPU. Designed for large-scale text processing, tokenization preprocessing, and streaming transformations.
Use case: Processing multi-GB files with repeated pattern replacement without CPU/GPU transfer overhead.
- Streaming mode: Process unlimited file sizes with fixed GPU memory
- Session reuse: Amortize allocation costs across multiple operations
- Fused pipeline: Single-pass mark + suppress + scatter
- Thread-safe: Per-handle mutex allows safe concurrent access
Pre-built binaries: See bin/ directory
Build from source:
# Windows
nvcc -O3 -std=c++17 -m64 -Xcompiler "/MD" -shared -o cuda_replace.dll cuda_replace.cu
# Linux
nvcc -O3 -std=c++17 -shared -Xcompiler -fPIC -o libcuda_replace.so cuda_replace.cufrom cuda_replace_wrapper import CudaReplaceLib
lib = CudaReplaceLib('./cuda_replace.dll') # or .so on Linux
data = b"hello world hello universe"
result = lib.unified(data, b"hello", b"goodbye")
print(result) # b"goodbye world goodbye universe"from cuda_replace_wrapper import Session
with Session(lib, data) as sess:
sess.apply(b"hello", b"hi")
sess.apply(b"world", b"GPU")
result = sess.result()from cuda_replace_wrapper import gpu_replace_streaming
# Process multi-GB files with limited GPU memory
with open('huge_file.txt', 'rb') as f:
data = f.read()
pairs = [(b"pattern1", b"replacement1"), (b"pattern2", b"replacement2")]
result = gpu_replace_streaming(lib, data, pairs, chunk_bytes=256*1024*1024)with Session(lib, data) as sess:
pairs = [
(b"old1", b"new1"),
(b"old2", b"new2"),
(b"old3", b"new3")
]
new_len = sess.apply_batch(pairs)
result = sess.result()Loads the CUDA replace library.
Methods:
unified(src: bytes, pat: bytes, rep: bytes) -> bytes- One-shot replacement
Persistent GPU session for efficient multi-operation workflows.
Methods:
apply(pat: bytes, rep: bytes) -> int- Apply replacement, returns new lengthapply_batch(pairs: List[Tuple[bytes, bytes]]) -> int- Apply multiple patterns sequentiallyapply_seeded(pat, rep, prev_last: int) -> (int, int)- Apply with carry-in state for streamingreset(src: bytes)- Replace buffer contents (O(1) if fits capacity)query_prefix(limit: int) -> (int, int)- Count matches before positionresult() -> bytes- Retrieve result from GPUlength() -> int- Get current buffer lengthclose()- Free GPU resources
Memory-bounded streaming processor. Handles files larger than GPU memory.
- Mark Phase: SIMD pattern matching across input (AVX2-style on GPU)
- Suppress Phase: Leftmost greedy selection (non-overlapping semantics)
- Scatter Phase:
- Shrink: Compact output when
len(replacement) <= len(pattern) - Expand: Prefix-sum scatter when
len(replacement) > len(pattern)
- Shrink: Compact output when
Optimizations:
- Fused mark+suppress in shared memory for small patterns
- Tiled prefix computation (Blelloch scan)
- Range-sliced expansion to avoid GPU TDR timeouts
- Presence index for sparse patterns (optional)
- Handles embedded NULs (binary-safe)
- Denormal floats (FTZ/DAZ enabled)
- Thread-safe sessions (internal mutex per handle)
- Overflow protection (streaming mode prevents 32-bit expansion overflow)
cuda_replace.cu- CUDA kernel implementationcuda_replace_wrapper.py- Python ctypes interfacebin/- Pre-compiled librariestest/- Usage examples
# Replace special tokens in bulk before tokenization
text = load_huge_corpus() # 10GB text file
replacements = [
(b"<br>", b"\n"),
(b" ", b" "),
(b"&", b"&"),
# ... 100+ HTML entities
]
cleaned = gpu_replace_streaming(lib, text, replacements)import time
data = b"a" * (1024 * 1024 * 1024) # 1GB of 'a's
pat = b"aa"
rep = b"X"
# CPU baseline
start = time.perf_counter()
cpu_result = data.replace(pat, rep)
cpu_time = time.perf_counter() - start
# GPU
start = time.perf_counter()
gpu_result = lib.unified(data, pat, rep)
gpu_time = time.perf_counter() - start
print(f"CPU: {cpu_time:.3f}s")
print(f"GPU: {gpu_time:.3f}s")
print(f"Speedup: {cpu_time/gpu_time:.2f}x")- CUDA-capable GPU (compute capability 3.5+)
- CUDA Toolkit 11.0+ (for building)
- Python 3.7+ (for wrapper)
MIT