CUDA Replace

High-throughput byte-pattern replacement on GPU. Python's bytes.replace() semantics without leaving device memory.

What It Does

Performs leftmost, non-overlapping pattern replacement entirely on GPU. Designed for large-scale text processing, tokenization preprocessing, and streaming transformations.

Use case: Processing multi-GB files with repeated pattern replacement without CPU/GPU transfer overhead.

Performance

Streaming mode: Process unlimited file sizes with fixed GPU memory
Session reuse: Amortize allocation costs across multiple operations
Fused pipeline: Single-pass mark + suppress + scatter
Thread-safe: Per-handle mutex allows safe concurrent access

Installation

Pre-built binaries: See bin/ directory

Build from source:

# Windows
nvcc -O3 -std=c++17 -m64 -Xcompiler "/MD" -shared -o cuda_replace.dll cuda_replace.cu

# Linux
nvcc -O3 -std=c++17 -shared -Xcompiler -fPIC -o libcuda_replace.so cuda_replace.cu

Usage

One-Shot (Simple)

from cuda_replace_wrapper import CudaReplaceLib

lib = CudaReplaceLib('./cuda_replace.dll')  # or .so on Linux

data = b"hello world hello universe"
result = lib.unified(data, b"hello", b"goodbye")
print(result)  # b"goodbye world goodbye universe"

Session (Reusable GPU Buffer)

from cuda_replace_wrapper import Session

with Session(lib, data) as sess:
    sess.apply(b"hello", b"hi")
    sess.apply(b"world", b"GPU")
    result = sess.result()

Streaming (Large Files)

from cuda_replace_wrapper import gpu_replace_streaming

# Process multi-GB files with limited GPU memory
with open('huge_file.txt', 'rb') as f:
    data = f.read()

pairs = [(b"pattern1", b"replacement1"), (b"pattern2", b"replacement2")]
result = gpu_replace_streaming(lib, data, pairs, chunk_bytes=256*1024*1024)

Batch Operations

with Session(lib, data) as sess:
    pairs = [
        (b"old1", b"new1"),
        (b"old2", b"new2"),
        (b"old3", b"new3")
    ]
    new_len = sess.apply_batch(pairs)
    result = sess.result()

API Reference

`CudaReplaceLib(dll_path: str)`

Loads the CUDA replace library.

Methods:

unified(src: bytes, pat: bytes, rep: bytes) -> bytes - One-shot replacement

`Session(lib, src: bytes)`

Persistent GPU session for efficient multi-operation workflows.

Methods:

apply(pat: bytes, rep: bytes) -> int - Apply replacement, returns new length
apply_batch(pairs: List[Tuple[bytes, bytes]]) -> int - Apply multiple patterns sequentially
apply_seeded(pat, rep, prev_last: int) -> (int, int) - Apply with carry-in state for streaming
reset(src: bytes) - Replace buffer contents (O(1) if fits capacity)
query_prefix(limit: int) -> (int, int) - Count matches before position
result() -> bytes - Retrieve result from GPU
length() -> int - Get current buffer length
close() - Free GPU resources

`gpu_replace_streaming(lib, src, pairs, chunk_bytes=256MB) -> bytes`

Memory-bounded streaming processor. Handles files larger than GPU memory.

Algorithm

Mark Phase: SIMD pattern matching across input (AVX2-style on GPU)
Suppress Phase: Leftmost greedy selection (non-overlapping semantics)
Scatter Phase:
- Shrink: Compact output when len(replacement) <= len(pattern)
- Expand: Prefix-sum scatter when len(replacement) > len(pattern)

Optimizations:

Fused mark+suppress in shared memory for small patterns
Tiled prefix computation (Blelloch scan)
Range-sliced expansion to avoid GPU TDR timeouts
Presence index for sparse patterns (optional)

Edge Cases

Handles embedded NULs (binary-safe)
Denormal floats (FTZ/DAZ enabled)
Thread-safe sessions (internal mutex per handle)
Overflow protection (streaming mode prevents 32-bit expansion overflow)

Files

cuda_replace.cu - CUDA kernel implementation
cuda_replace_wrapper.py - Python ctypes interface
bin/ - Pre-compiled libraries
test/ - Usage examples

Example: Tokenizer Preprocessing

# Replace special tokens in bulk before tokenization
text = load_huge_corpus()  # 10GB text file

replacements = [
    (b"<br>", b"\n"),
    (b"&nbsp;", b" "),
    (b"&amp;", b"&"),
    # ... 100+ HTML entities
]

cleaned = gpu_replace_streaming(lib, text, replacements)

Benchmarking

import time

data = b"a" * (1024 * 1024 * 1024)  # 1GB of 'a's
pat = b"aa"
rep = b"X"

# CPU baseline
start = time.perf_counter()
cpu_result = data.replace(pat, rep)
cpu_time = time.perf_counter() - start

# GPU
start = time.perf_counter()
gpu_result = lib.unified(data, pat, rep)
gpu_time = time.perf_counter() - start

print(f"CPU: {cpu_time:.3f}s")
print(f"GPU: {gpu_time:.3f}s")
print(f"Speedup: {cpu_time/gpu_time:.2f}x")

Requirements

CUDA-capable GPU (compute capability 3.5+)
CUDA Toolkit 11.0+ (for building)
Python 3.7+ (for wrapper)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
test		test
LICENSE		LICENSE
README.md		README.md
cuda_replace.cu		cuda_replace.cu
cuda_replace_wrapper.py		cuda_replace_wrapper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Replace

What It Does

Performance

Installation

Usage

One-Shot (Simple)

Session (Reusable GPU Buffer)

Streaming (Large Files)

Batch Operations

API Reference

`CudaReplaceLib(dll_path: str)`

`Session(lib, src: bytes)`

`gpu_replace_streaming(lib, src, pairs, chunk_bytes=256MB) -> bytes`

Algorithm

Edge Cases

Files

Example: Tokenizer Preprocessing

Benchmarking

Requirements

License

About

Uh oh!

Releases

Packages

Languages

License

RAZZULLIX/cuda_replace

Folders and files

Latest commit

History

Repository files navigation

CUDA Replace

What It Does

Performance

Installation

Usage

One-Shot (Simple)

Session (Reusable GPU Buffer)

Streaming (Large Files)

Batch Operations

API Reference

CudaReplaceLib(dll_path: str)

Session(lib, src: bytes)

gpu_replace_streaming(lib, src, pairs, chunk_bytes=256MB) -> bytes

Algorithm

Edge Cases

Files

Example: Tokenizer Preprocessing

Benchmarking

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`CudaReplaceLib(dll_path: str)`

`Session(lib, src: bytes)`

`gpu_replace_streaming(lib, src, pairs, chunk_bytes=256MB) -> bytes`

Packages