Skip to content

Latest commit

 

History

History
283 lines (271 loc) · 10.9 KB

File metadata and controls

283 lines (271 loc) · 10.9 KB
layout default
title LLM-Speed
description CUDA kernel library for LLM inference with FlashAttention forward, Tensor Core GEMM, and PyTorch bindings. Optimized for Ampere and newer architectures.
lang en
v{{ site.current_version }} — CUDA kernels for LLM inference

LLM-Speed

A focused CUDA kernel library implementing FlashAttention forward, Tensor Core GEMM acceleration, and seamless PyTorch integration. Designed for efficient LLM inference on modern GPUs.

CI License CUDA C++ Python

Key Features

Optimized CUDA kernels for modern LLM inference with memory-efficient algorithms and hardware acceleration

FlashAttention

O(N) memory complexity with online softmax algorithm. Supports causal masking for autoregressive models.

🔢

Tensor Core GEMM

Hardware-accelerated matrix multiplication using WMMA API. FP16 input with FP32 accumulation.

🐍

PyTorch Integration

Seamless integration with PyTorch via pybind11. Native CUDA tensor support.

🔄

Double Buffering

Compute/memory overlap with pipelined execution. Async copy for Ampere+ architectures.

🏦

Bank Conflict Free

Carefully designed shared memory layouts with padding to eliminate bank conflicts.

📊

Property Testing

Comprehensive tests with Hypothesis for correctness verification across edge cases.

Memory-Efficient Design

FlashAttention implements online softmax with O(N) memory complexity instead of O(N²) for standard attention.

Sequence Length Standard Attention FlashAttention Memory Savings
1024 4 MB (full attention matrix) 0.25 MB (streaming) 16×
4096 64 MB (full attention matrix) 1 MB (streaming) 64×
8192 256 MB (full attention matrix) 2 MB (streaming) 128×

Assumes 8 attention heads, FP32 accumulation, batch size 1. Exact savings depend on hardware and kernel implementation.

Quick Example

Get started with just a few lines of code

flash_attention.py
import torch
from cuda_llm_ops import flash_attention

Create inputs

batch, heads = 2, 8 seq_len, head_dim = 2048, 64

q = torch.randn(batch, heads, seq_len, head_dim, device='cuda', dtype=torch.float16) k = torch.randn_like(q) v = torch.randn_like(q)

O(N) memory attention!

output = flash_attention(q, k, v, is_causal=True)

tensor_core_gemm.py
import torch
from cuda_llm_ops import tensor_core_gemm

Matrix multiplication

a = torch.randn(1024, 512, device='cuda', dtype=torch.float16) b = torch.randn(512, 1024, device='cuda', dtype=torch.float16)

Hardware accelerated GEMM

FP16 input → FP32 output

c = tensor_core_gemm(a, b) print(c.dtype) # torch.float32

GPU Architecture Support

Optimized for Ampere (A100, RTX 30) and newer. Forward compatibility with Hopper and future architectures.

Architecture Tensor Core Status
Ampere (A100, RTX 30/40) WMMA with FP16, BF16, TF32 ✅ Primary target
Hopper (H100) WMMA with FP16, BF16, FP8 ✅ Supported
Volta (V100) WMMA with FP16 ⚠️ Limited
Turing (T4, RTX 20) WMMA with FP16, INT8 ⚠️ Limited

Documentation

Comprehensive guides in English and Chinese

🚀

Get up and running in 5 minutes with installation and basic usage examples.

📚

Complete API documentation with parameters, examples, and error handling.

🏗️

Technical deep dive into CUDA kernels, optimization strategies, and implementation details.

Optimization tips, benchmarking tools, and best practices for maximum performance.

Start using LLM-Speed

Three clear paths to get value from this project:

🚀

Install and run your first FlashAttention or Tensor Core GEMM example in 5 minutes.

🏗️

Explore kernel design, memory layout optimization, and Tensor Core utilization patterns.

📊

Run performance benchmarks on your GPU. See memory usage and speedups with provided tools.