Hetero-Paged-Infer

A High-Performance LLM Inference Engine with PagedAttention & Continuous Batching

Overview

Hetero-Paged-Infer is a production-ready inference engine for Large Language Models (LLMs) built in Rust. It implements cutting-edge techniques from vLLM with a modular, testable architecture designed for production deployment.

Feature	Description	Status
PagedAttention KV Cache	Block-based memory management, <5% waste	✅
Continuous Batching	Dynamic prefill/decode scheduling	✅
Memory Pressure Awareness	Configurable OOM prevention	✅
Modular Architecture	Trait-based abstractions	✅
Comprehensive Testing	135 tests (unit, property, integration)	✅
CUDA Kernels	Real GPU execution	🚧 Planned

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        InferenceEngine (CPU)                          │
├──────────────────────────────────────────────────────────────────────┤
│  ┌────────────┐  ┌────────────┐  ┌────────────────────────────────┐  │
│  │ Tokenizer  │  │ Scheduler  │  │      KV Cache Manager          │  │
│  │            │  │            │  │   BlockPool + PageTable        │  │
│  └─────┬──────┘  └─────┬──────┘  └───────────────┬────────────────┘  │
│        │               │                         │                    │
│        │        ┌──────▼──────┐                  │                    │
│        │        │Batch Builder│◄─────────────────┘                    │
│        │        └──────┬──────┘                                       │
├────────┼───────────────┼─────────────────────────────────────────────┤
│        │        ┌──────▼──────┐                                       │
│        │        │ GPU Executor│  (CUDA / Mock)                        │
│        │        └──────┬──────┘                                       │
│        │        ┌──────▼──────┐                                       │
│        └───────►│  KV Cache   │  (GPU Memory)                         │
│                 └─────────────┘                                       │
└──────────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Rust 1.70+ (2021 edition)
Linux (Ubuntu 20.04+ recommended) or macOS

Installation

# Clone the repository
git clone https://github.com/LessUp/hetero-paged-infer.git
cd hetero-paged-infer

# Build in release mode
cargo build --release

# Run the test suite (135 tests)
cargo test

CLI Usage

# Basic usage
./target/release/hetero-infer --input "Hello, world!" --max-tokens 50

# With custom parameters
./target/release/hetero-infer \
  --input "Explain quantum computing" \
  --max-tokens 100 \
  --temperature 0.8 \
  --top-p 0.95

Library Usage

use hetero_infer::{EngineConfig, GenerationParams, InferenceEngine};

// Create engine with default configuration
let mut engine = InferenceEngine::new(EngineConfig::default())?;

// Submit a generation request
let request_id = engine.submit_request(
    "Hello, world!",
    GenerationParams { 
        max_tokens: 100, 
        temperature: 0.8, 
        top_p: 0.95 
    }
)?;

// Run inference and collect results
let results = engine.run();
for result in results {
    println!("Generated: {}", result.output_text);
}

Configuration

Option	Default	Description
`--block-size`	16	Tokens per physical block
`--max-num-blocks`	1024	Total physical blocks
`--max-batch-size`	32	Max sequences per batch
`--memory-threshold`	0.9	Memory pressure threshold
`--temperature`	1.0	Sampling temperature
`--top-p`	0.9	Nucleus sampling threshold

Config file (config.json):

{
  "block_size": 16,
  "max_num_blocks": 1024,
  "max_batch_size": 32,
  "memory_threshold": 0.9
}

Load: ./hetero-infer --config config.json

Documentation

Resource	Link
GitHub Pages	https://lessup.github.io/hetero-paged-infer/
API Reference (docs.rs)	https://docs.rs/hetero-infer
Architecture Guide	docs/en/architecture/overview.md
Contributing Guide	CONTRIBUTING.md
Changelog	CHANGELOG.md

Local Documentation

# Build and open API documentation
cargo doc --open

# Build documentation site locally
pip install mkdocs-material mkdocs-static-i18n
mkdocs serve -f mkdocs.yml

Performance

Approach	Memory Waste	Throughput	Description
Static Allocation	~40-60%	Baseline	Pre-allocate max context for each request
Dynamic Allocation	~20-30%	+20%	Resize per request but still fragmented
PagedAttention	<5%	+50%	Block-based sharing with copy-on-write

Why PagedAttention?

Traditional LLM serving allocates contiguous memory blocks for each request's KV cache, leading to significant memory fragmentation and waste. PagedAttention solves this by:

Block-based allocation: Split KV cache into fixed-size blocks
On-demand paging: Allocate blocks only when needed
Copy-on-write: Share blocks across sequences for efficient beam search

Testing

# Run all tests
cargo test

# Run with coverage
cargo llvm-cov --html

# Run property-based tests
cargo test -- --test-threads=1

Type	Count	Description
Unit Tests	78	Core functionality tests
Property Tests	15	Invariant verification with proptest
Integration Tests	13	End-to-end workflow tests
Doc Tests	29	Documentation examples
Total	135

Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.

# Run all checks before submitting
cargo test && cargo fmt --check && cargo clippy

Roadmap

License

MIT License - See LICENSE.

Acknowledgments

vLLM - PagedAttention concept and inspiration
Rust - Systems programming language
Criterion - Statistical benchmarking

Made with ❤️ by LessUp

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
.qwen		.qwen
.vscode		.vscode
benches		benches
changelog		changelog
docs		docs
examples		examples
specs		specs
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
AGENTS.md		AGENTS.md
AGENTS.zh.md		AGENTS.zh.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
RELEASE_NOTES.md		RELEASE_NOTES.md
_config.yml		_config.yml
config.example.json		config.example.json
index.md		index.md
mkdocs.yml		mkdocs.yml
mkdocs.zh.yml		mkdocs.zh.yml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hetero-Paged-Infer

Overview

Architecture

Quick Start

Prerequisites

Installation

CLI Usage

Library Usage

Configuration

Documentation

Local Documentation

Performance

Why PagedAttention?

Testing

Contributing

Roadmap

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hetero-Paged-Infer

Overview

Architecture

Quick Start

Prerequisites

Installation

CLI Usage

Library Usage

Configuration

Documentation

Local Documentation

Performance

Why PagedAttention?

Testing

Contributing

Roadmap

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages