| layout | default |
|---|---|
| title | LLM-Speed |
| description | CUDA kernel library for LLM inference with FlashAttention forward, Tensor Core GEMM, and PyTorch bindings. Optimized for Ampere and newer architectures. |
| lang | en |
A focused CUDA kernel library implementing FlashAttention forward, Tensor Core GEMM acceleration, and seamless PyTorch integration. Designed for efficient LLM inference on modern GPUs.
Optimized CUDA kernels for modern LLM inference with memory-efficient algorithms and hardware acceleration
O(N) memory complexity with online softmax algorithm. Supports causal masking for autoregressive models.
Hardware-accelerated matrix multiplication using WMMA API. FP16 input with FP32 accumulation.
Compute/memory overlap with pipelined execution. Async copy for Ampere+ architectures.
Carefully designed shared memory layouts with padding to eliminate bank conflicts.
FlashAttention implements online softmax with O(N) memory complexity instead of O(N²) for standard attention.
| Sequence Length | Standard Attention | FlashAttention | Memory Savings |
|---|---|---|---|
| 1024 | 4 MB (full attention matrix) | 0.25 MB (streaming) | 16× |
| 4096 | 64 MB (full attention matrix) | 1 MB (streaming) | 64× |
| 8192 | 256 MB (full attention matrix) | 2 MB (streaming) | 128× |
Assumes 8 attention heads, FP32 accumulation, batch size 1. Exact savings depend on hardware and kernel implementation.
import torch from cuda_llm_ops import flash_attentionbatch, heads = 2, 8 seq_len, head_dim = 2048, 64
q = torch.randn(batch, heads, seq_len, head_dim, device='cuda', dtype=torch.float16) k = torch.randn_like(q) v = torch.randn_like(q)
output = flash_attention(q, k, v, is_causal=True)
import torch from cuda_llm_ops import tensor_core_gemma = torch.randn(1024, 512, device='cuda', dtype=torch.float16) b = torch.randn(512, 1024, device='cuda', dtype=torch.float16)
c = tensor_core_gemm(a, b) print(c.dtype) # torch.float32
Optimized for Ampere (A100, RTX 30) and newer. Forward compatibility with Hopper and future architectures.
| Architecture | Tensor Core | Status |
|---|---|---|
| Ampere (A100, RTX 30/40) | WMMA with FP16, BF16, TF32 | ✅ Primary target |
| Hopper (H100) | WMMA with FP16, BF16, FP8 | ✅ Supported |
| Volta (V100) | WMMA with FP16 | |
| Turing (T4, RTX 20) | WMMA with FP16, INT8 |
Get up and running in 5 minutes with installation and basic usage examples.
Complete API documentation with parameters, examples, and error handling.
Technical deep dive into CUDA kernels, optimization strategies, and implementation details.
Optimization tips, benchmarking tools, and best practices for maximum performance.
Three clear paths to get value from this project: