layout	default
title	LLM-Speed
description	CUDA kernel library for LLM inference with FlashAttention forward, Tensor Core GEMM, and PyTorch bindings. Optimized for Ampere and newer architectures.
lang	en

v{{ site.current_version }} — CUDA kernels for LLM inference

LLM-Speed

A focused CUDA kernel library implementing FlashAttention forward, Tensor Core GEMM acceleration, and seamless PyTorch integration. Designed for efficient LLM inference on modern GPUs.

🚀 Get Started 📖 Understand Architecture 📊 Benchmark Locally

Key Features

Optimized CUDA kernels for modern LLM inference with memory-efficient algorithms and hardware acceleration

⚡

FlashAttention

O(N) memory complexity with online softmax algorithm. Supports causal masking for autoregressive models.

🔢

Tensor Core GEMM

Hardware-accelerated matrix multiplication using WMMA API. FP16 input with FP32 accumulation.

🐍

PyTorch Integration

Seamless integration with PyTorch via pybind11. Native CUDA tensor support.

🔄

Double Buffering

Compute/memory overlap with pipelined execution. Async copy for Ampere+ architectures.

🏦

Bank Conflict Free

Carefully designed shared memory layouts with padding to eliminate bank conflicts.

📊

Property Testing

Comprehensive tests with Hypothesis for correctness verification across edge cases.

Memory-Efficient Design

FlashAttention implements online softmax with O(N) memory complexity instead of O(N²) for standard attention.

Sequence Length	Standard Attention	FlashAttention	Memory Savings
1024	4 MB (full attention matrix)	0.25 MB (streaming)	16×
4096	64 MB (full attention matrix)	1 MB (streaming)	64×
8192	256 MB (full attention matrix)	2 MB (streaming)	128×

Assumes 8 attention heads, FP32 accumulation, batch size 1. Exact savings depend on hardware and kernel implementation.

Quick Example

Get started with just a few lines of code

flash_attention.py

import torch
from cuda_llm_ops import flash_attention
Create inputs

batch, heads = 2, 8
seq_len, head_dim = 2048, 64
q = torch.randn(batch, heads, seq_len, head_dim,
device='cuda', dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
O(N) memory attention!

output = flash_attention(q, k, v, is_causal=True)

tensor_core_gemm.py

import torch
from cuda_llm_ops import tensor_core_gemm
Matrix multiplication

a = torch.randn(1024, 512, device='cuda',
dtype=torch.float16)
b = torch.randn(512, 1024, device='cuda',
dtype=torch.float16)
Hardware accelerated GEMM

FP16 input → FP32 output

c = tensor_core_gemm(a, b)
print(c.dtype)  # torch.float32

GPU Architecture Support

Optimized for Ampere (A100, RTX 30) and newer. Forward compatibility with Hopper and future architectures.

Architecture	Tensor Core	Status
Ampere (A100, RTX 30/40)	WMMA with FP16, BF16, TF32	✅ Primary target
Hopper (H100)	WMMA with FP16, BF16, FP8	✅ Supported
Volta (V100)	WMMA with FP16	⚠️ Limited
Turing (T4, RTX 20)	WMMA with FP16, INT8	⚠️ Limited

Documentation

Comprehensive guides in English and Chinese

🚀

Quick Start

Get up and running in 5 minutes with installation and basic usage examples.

📚

API Reference

Complete API documentation with parameters, examples, and error handling.

🏗️

Architecture

Technical deep dive into CUDA kernels, optimization strategies, and implementation details.

⚡

Performance Guide

Optimization tips, benchmarking tools, and best practices for maximum performance.

Start using LLM-Speed

Three clear paths to get value from this project:

🚀

Get Started

Install and run your first FlashAttention or Tensor Core GEMM example in 5 minutes.

🏗️

Understand Architecture

Explore kernel design, memory layout optimization, and Tensor Core utilization patterns.

📊

Benchmark Locally

Run performance benchmarks on your GPU. See memory usage and speedups with provided tools.

💻 Browse Source 📚 API Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM-Speed

Key Features

FlashAttention

Tensor Core GEMM

PyTorch Integration

Double Buffering

Bank Conflict Free

Property Testing

Memory-Efficient Design

Quick Example

Create inputs

O(N) memory attention!

Matrix multiplication

Hardware accelerated GEMM

FP16 input → FP32 output

GPU Architecture Support

Documentation

Quick Start

API Reference

Architecture

Performance Guide

Start using LLM-Speed

Get Started

Understand Architecture

Benchmark Locally

FilesExpand file tree

index.md

Latest commit

History

index.md

File metadata and controls

LLM-Speed

Key Features

FlashAttention

Tensor Core GEMM

PyTorch Integration

Double Buffering

Bank Conflict Free

Property Testing

Memory-Efficient Design

Quick Example

Create inputs

O(N) memory attention!

Matrix multiplication

Hardware accelerated GEMM

FP16 input → FP32 output

GPU Architecture Support

Documentation

Quick Start

API Reference

Architecture

Performance Guide

Start using LLM-Speed

Get Started

Understand Architecture

Benchmark Locally