NEW: TinyGPT2-SFT is here! π Our 95M parameter model is now instruction fine-tuned on Stanford Alpaca. It can follow instructions, answer questions, write poems, and more β all trained on a single RTX 3070 Ti. Try it out!
TinyGPT is an educational implementation of the GPT (Generative Pre-trained Transformer) architecture, featuring five model variants β from simple story generators to an instruction-following model built with RoPE, GQA, and RMSNorm. Built from the ground up with modern PyTorch, TinyGPT demonstrates how state-of-the-art language models can be both accessible and performant. β¨
π Quick Links:
- π€ HuggingFace Repository
- π Live Demo
- π Training Notebooks
TinyGPT represents a carefully crafted balance between accessibility and performance in language model design. The project progresses through five model variants β from a standard GPT to Mixture-of-Experts architectures to an instruction fine-tuned TinyGPT-2 model with cutting-edge techniques.
- Educational: Provide a clear, well-documented implementation of GPT architecture
- Production-Ready: Deliver robust, efficient models suitable for real-world applications
- Efficient: Optimized for running on consumer GPUs and edge devices with minimal latency
- Accessible: Make it easy to run, train, fine-tune, and deploy on various platforms
TinyGPT comes in five variants:
- 8 transformer blocks π§±
- 8 attention heads ποΈ
- 512 embedding dimensions π
- Vocabulary size of 50,304 tokens π
- Context window of 512 tokens πͺ
- Parameters: ~51M
- Training data: TinyStories dataset
- 8 transformer blocks with MoE layers π§±
- 8 attention heads ποΈ
- 512 embedding dimensions π
- 4 experts per MoE layer with top-2 routing π
- Vocabulary size of 50,304 tokens π
- Context window of 512 tokens πͺ
- Parameters: ~85M
- Training data: TinyStories dataset
- Enhanced storytelling capabilities through expert specialization
- 8 transformer blocks with MoE layers π§±
- 16 attention heads ποΈ
- 512 embedding dimensions π
- 8 experts per MoE layer with top-2 routing π
- Vocabulary size of 50,304 tokens π
- Context window of 512 tokens πͺ
- Parameters: ~135M
- Training data: Wikipedia (C4 dataset)
- Enhanced knowledge representation with more experts and attention heads
- 12 transformer blocks π§±
- 12 attention heads with Grouped Query Attention (4 KV groups) ποΈ
- 768 embedding dimensions π
- 2048 FFN hidden size π§
- RoPE (Rotary Position Embeddings) for position encoding π
- RMSNorm for layer normalization π
- KV Cache for efficient autoregressive generation π
- Weight tying between token embeddings and output head π
- Vocabulary size of 50,304 tokens π
- Context window of 512 tokens πͺ
- Parameters: ~95M
- Training data: OpenWebText (~6.5B+ tokens)
- Base model: TinyGPT2 (~95M parameters)
- Fine-tuning data: Stanford Alpaca (52K instruction-response pairs)
- Training: 3 epochs with response-only loss masking
- Prompt format:
### Instruction: ... ### Response: ... - Capabilities: Follows instructions, answers questions, writes creatively
- Hardware: Single NVIDIA RTX 3070 Ti (8GB VRAM), ~85 minutes total
| Model | Dataset | Tokens |
|---|---|---|
| TinyGPT | TinyStories | ~300M |
| TinyGPT-MoE | TinyStories | ~300M |
| Wikipedia-MoE | Wikipedia (C4) | ~500M |
| TinyGPT2 | OpenWebText | ~6.7B |
| TinyGPT2-SFT | Stanford Alpaca (52K) | ~72M |
- Scale: TinyGPT2 is trained on 3.4B+ tokens from OpenWebText, significantly enhancing its general language understanding.
- Data Processing: Efficient data loading with HuggingFace
datasetsand tiktoken tokenization for fast throughput.
To install TinyGPT, follow these steps:
# Clone the repository
git clone https://github.com/NotShrirang/tinygpt.git
# Navigate to the project directory
cd tinygpt
# Install the required packages
pip install -r requirements.txt
# Download the model weights
mkdir -p tinygpt/weightsFor optimal training performance with liger-kernel (used by TinyGPT2 and MoE models), you need:
- Linux operating system (POSIX-compliant)
- NVIDIA GPU with CUDA support
- liger-kernel
# Install liger-kernel for training optimizations (Linux + CUDA only)
pip install liger-kernelNote: On Windows or CPU-only environments, all models automatically fall back to pure PyTorch implementations without liger-kernel optimizations. The models will still work but training may be slower.
TinyGPT fully supports Docker for easy deployment and development:
# Production deployment
docker-compose up --build
# Development with hot reload
docker-compose --profile dev up tinygpt-dev --buildThe Docker setup includes:
- Multi-model support: All four model variants
- Hot reload: Automatic code updates during development
- Cross-platform: Works seamlessly on Windows, macOS, and Linux
- Persistent storage: Model weights are cached between container restarts
For detailed Docker usage, see DOCKER.md.
Choose from five model variants:
- TinyGPT: Standard 51M parameter model for story generation
- TinyGPT-MoE: 85M parameter MoE model with enhanced storytelling
- Wikipedia-MoE: 135M parameter MoE model trained on Wikipedia
- TinyGPT2: 95M parameter modern GPT with RoPE, GQA, and RMSNorm
- TinyGPT2-SFT: TinyGPT2 instruction fine-tuned on Alpaca β follows instructions and answers questions
streamlit run main.pyThis launches a web application where you can:
- Select between all four model variants
- Adjust generation parameters (temperature, top-k, top-p, max tokens)
- Input text prompts and see real-time streaming responses
- Download models automatically from Hugging Face
# SFT model (default) β wraps prompts in instruction template
python inference.py --checkpoint checkpoints_sft/sft_epoch2.pth
# Pretrained model β raw text completion
python inference.py --checkpoint checkpoints/ckpt_step25500.pth --raw
# Single prompt
python inference.py --checkpoint checkpoints_sft/sft_epoch2.pth --prompt "What is the capital of France?"
# With custom settings
python inference.py --checkpoint checkpoints_sft/sft_epoch2.pth --max_tokens 200 --temperature 0.7 --top_k 40Features:
- KV cache for fast autoregressive generation
- Streaming token-by-token output with EOS detection
- Interactive REPL mode
- Instruction template (default) or raw mode (
--raw) - Checkpoint info display (step, loss, tokens seen)
# Start FastAPI server directly
python app.py
# Or use Docker
docker-compose up tinygpt-api --buildFeatures:
- REST API endpoints for text generation
- Multi-model support (TinyGPT, TinyGPT-MoE, TinyGPT2)
- Interactive Swagger docs at http://localhost:8000/docs
- Health monitoring and model management
For detailed API documentation, see docs/API.md.
# Production deployment
docker-compose up --build
# Development mode with hot reload
docker-compose --profile dev up tinygpt-dev --buildAccess the application at http://localhost:8501
TinyGPT runs smoothly on:
- Windows β (with automatic fallback for liger-kernel)
- macOS β (with automatic fallback for liger-kernel)
- Linux β (full liger-kernel optimization support)
- Docker β (all platforms)
Trained using PyTorch on their respective datasets. See the training notebooks in the notebooks/ directory.
TinyGPT2 is pretrained on OpenWebText using train_liger.py:
# Start training from scratch
python train_liger.py
# Resume from checkpoint
python train_liger.py --resume
Training configuration:
- Hardware: Single NVIDIA RTX 3070 Ti (8GB VRAM)
- Effective batch size: 262K tokens/step (batch 8 Γ grad accum 64 Γ block size 512)
- Optimizer: AdamW with cosine decay schedule and warmup
- Mixed precision: bfloat16 with
torch.compilefor speed - Evaluation: Periodic validation with sample text generation
- Checkpointing: Automatic saves with train/val loss tracking
Fine-tune TinyGPT2 on instruction-following tasks using the Stanford Alpaca dataset:
# Fine-tune from a pretrained checkpoint
python sft.py --checkpoint checkpoints/ckpt_step25500.pth
# Resume SFT training
python sft.py --checkpoint checkpoints/ckpt_step25500.pth --resumeSFT configuration:
- Hardware: Single NVIDIA RTX 3070 Ti (8GB VRAM)
- Dataset: Stanford Alpaca (52K instruction-response pairs, 90/10 train/val split)
- Epochs: 3 (~85 minutes total)
- Effective batch size: 16K tokens/step (batch 4 Γ grad accum 8 Γ block size 512)
- Response-only loss masking: Only trains on the response portion, not the instruction prompt
- Prompt template:
### Instruction: ... ### Input: ... ### Response: ...
| Epoch | Train Loss | Val Loss | Train PPL | Val PPL |
|---|---|---|---|---|
| 1 | 2.13 | 2.01 | 8.45 | 7.44 |
| 2 | 1.97 | 1.98 | 7.17 | 7.27 |
| 3 | 1.91 | 1.98 | 6.77 | 7.26 |
- Kernel Fusion: Implemented to reduce memory bandwidth bottlenecks and speed up training operations
- Mixed Precision Training: Utilizes bfloat16 format for significantly faster training while maintaining numerical stability
- Gradient Accumulation: Applied to improve training stability and allow effective training with larger batch sizes
- Cosine Scheduler: Implements variable learning rate throughout training for better convergence
- PyTorch's Multi-Head Attention: Uses standard PyTorch implementations for Multi-Head Attention layers to boost training speed
- torch.compile: Full model compilation for fused kernel execution
- Grouped Query Attention (GQA): 12 query heads sharing 4 KV groups β reduces memory while maintaining quality
- Rotary Position Embeddings (RoPE): Efficient relative position encoding without learned position embeddings
- RMSNorm: Faster and more stable alternative to LayerNorm
- KV Cache: Efficient autoregressive generation β only computes attention for the new token
- Weight Tying: Shares weights between token embeddings and output projection to reduce parameters
- Fused AdamW: Uses CUDA-fused optimizer when available
- liger-kernel Integration: Uses optimized SwiGLU implementations for enhanced performance on Linux + CUDA
- Expert Routing: Dynamic routing of tokens to specialized experts for improved capabilities
- Sparse Activation: Only activates top-2 experts per token, maintaining efficiency while increasing model capacity
- Automatic Fallback: Gracefully falls back to PyTorch-native implementations on non-CUDA or Windows systems
tinygpt/
βββ tinygpt/ # Core package
β βββ __init__.py # Exports all models, configs, tokenizer
β βββ model.py # GPTLanguageModel, MoEGPTLanguageModel, WikipediaMoEGPTLanguageModel, TinyGPT2
β βββ layers.py # DecoderBlock, MoE blocks, GQA, RoPE, RMSNorm, TinyGPT2Block
β βββ config.py # GPTConfig, MoEGPTConfig, WikipediaMoEGPTConfig, TinyGPT2Config
β βββ tokenizer.py # Tiktoken-based tokenizer
β βββ utils.py # Generation utilities, mask helpers
β βββ weights/ # Model weight files
βββ train_liger.py # TinyGPT2 pretraining script (OpenWebText)
βββ sft.py # Supervised fine-tuning on Stanford Alpaca
βββ inference.py # TinyGPT2 CLI inference with KV cache & instruction template
βββ main.py # Streamlit web UI (all models)
βββ app.py # FastAPI REST API service
βββ notebooks/ # Training notebooks
βββ docs/ # API documentation, Docker guide
βββ docker-compose.yml # Docker deployment
βββ requirements.txt # Python dependencies
- CPU: Any modern multi-core processor
- RAM: 4GB+ (8GB recommended)
- Storage: 1GB for model weights and dependencies
- Python: 3.8 or higher
- OS: Linux (Ubuntu 20.04+ recommended)
- GPU: NVIDIA GPU with 8GB+ VRAM and CUDA 11.0+
- RAM: 16GB+
- Additional: liger-kernel for fused kernels
# Standard Python environment
pip install -r requirements.txt
streamlit run main.py# Production deployment
docker-compose up --build
# Development with auto-reload
docker-compose --profile dev up tinygpt-dev --build- Streamlit Cloud: Fully supported β
- Heroku: Supported with Docker β
- AWS/GCP/Azure: Supported with containerization β
- Hugging Face Spaces: Supported β
Prompt: One day, a dragon
Output:
One day, a dragon named Bobo was walking in the forest when he saw a little bunny. The bunny was sad because he had no friends. Bobo wanted to help the bunny, so he asked the bunny to give him a hug. The bunny said yes, and the bunny gave the bunny a hug.
Bobo was very happy and thanked the bunny. He named the bunny, and they became good friends. The bunny was always grateful for Bobo's help. They became good friends, and they always shared their toys and treats!
Prompt: The meaning of life
Output:
The meaning of life is more complex than its meanings. The two most common forms of human love are love and affection.
What is Love?
Love is both good and bad; it is one of love's most enduring possessions.
Love is the most fundamental, at times, measure of humanity's capacity for love. Love is an object of a man's desire and a desire. The desire of the man is the most important attribute of love.
Love is a self-awareness. It is a way of feeling out and doing something.
>>> Explain what machine learning is in simple terms.
Machine learning is a branch of computer science that focuses on using machine learning algorithms to identify patterns in data and identify patterns in data. It is a branch of computer science that focuses on creating computer systems that can perform tasks such as image recognition, image classification, and natural language processing. Machine learning algorithms are used to develop algorithms that can be used to generate and classify data in order to identify patterns in data. These algorithms are used to analyze large amounts of data and make predictions about future trends.
>>> What is the capital of France?
The capital of France is Paris.
>>> Write a motivational quote.
"The only way to make a difference is to be bold and courageous."
This project is licensed under the GPL-3.0 license - see the LICENSE file for details.
Contributions are welcome! Feel free to submit pull requests, create issues, or suggest improvements to the model or codebase.
If you find TinyGPT useful, please consider starring the repository β