Skip to content

🎈 A series of lightweight GPT models featuring TinyGPT Base (~51M params) and TinyGPT-MoE (~85M params), TinyGPT2 (~95M params). Fast, creative text generation trained on whimsical stories.

License

Notifications You must be signed in to change notification settings

NotShrirang/tinygpt

Repository files navigation

TinyGPT Banner

TinyGPT πŸ€–

NEW: TinyGPT2-SFT is here! πŸŽ‰ Our 95M parameter model is now instruction fine-tuned on Stanford Alpaca. It can follow instructions, answer questions, write poems, and more β€” all trained on a single RTX 3070 Ti. Try it out!

GitHub stars GitHub forks Streamlit App

TinyGPT is an educational implementation of the GPT (Generative Pre-trained Transformer) architecture, featuring five model variants β€” from simple story generators to an instruction-following model built with RoPE, GQA, and RMSNorm. Built from the ground up with modern PyTorch, TinyGPT demonstrates how state-of-the-art language models can be both accessible and performant. ✨

πŸ”— Quick Links:

Overview πŸ”

TinyGPT represents a carefully crafted balance between accessibility and performance in language model design. The project progresses through five model variants β€” from a standard GPT to Mixture-of-Experts architectures to an instruction fine-tuned TinyGPT-2 model with cutting-edge techniques.

🎯 Project Goals

  • Educational: Provide a clear, well-documented implementation of GPT architecture
  • Production-Ready: Deliver robust, efficient models suitable for real-world applications
  • Efficient: Optimized for running on consumer GPUs and edge devices with minimal latency
  • Accessible: Make it easy to run, train, fine-tune, and deploy on various platforms

Model Architecture πŸ—οΈ

TinyGPT comes in five variants:

TinyGPT (Standard) πŸ€–

TinyGPT 51M
  • 8 transformer blocks 🧱
  • 8 attention heads πŸ‘οΈ
  • 512 embedding dimensions πŸ“Š
  • Vocabulary size of 50,304 tokens πŸ“š
  • Context window of 512 tokens πŸͺŸ
  • Parameters: ~51M
  • Training data: TinyStories dataset

TinyGPT-MoE (Mixture of Experts) 🧠

TinyGPT MoE 84M
  • 8 transformer blocks with MoE layers 🧱
  • 8 attention heads πŸ‘οΈ
  • 512 embedding dimensions πŸ“Š
  • 4 experts per MoE layer with top-2 routing πŸ”€
  • Vocabulary size of 50,304 tokens πŸ“š
  • Context window of 512 tokens πŸͺŸ
  • Parameters: ~85M
  • Training data: TinyStories dataset
  • Enhanced storytelling capabilities through expert specialization

Wikipedia-MoE 🌐

  • 8 transformer blocks with MoE layers 🧱
  • 16 attention heads πŸ‘οΈ
  • 512 embedding dimensions πŸ“Š
  • 8 experts per MoE layer with top-2 routing πŸ”€
  • Vocabulary size of 50,304 tokens πŸ“š
  • Context window of 512 tokens πŸͺŸ
  • Parameters: ~135M
  • Training data: Wikipedia (C4 dataset)
  • Enhanced knowledge representation with more experts and attention heads

TinyGPT2 ⚑

  • 12 transformer blocks 🧱
  • 12 attention heads with Grouped Query Attention (4 KV groups) πŸ‘οΈ
  • 768 embedding dimensions πŸ“Š
  • 2048 FFN hidden size πŸ”§
  • RoPE (Rotary Position Embeddings) for position encoding πŸ”„
  • RMSNorm for layer normalization πŸ“
  • KV Cache for efficient autoregressive generation πŸš€
  • Weight tying between token embeddings and output head πŸ”—
  • Vocabulary size of 50,304 tokens πŸ“š
  • Context window of 512 tokens πŸͺŸ
  • Parameters: ~95M
  • Training data: OpenWebText (~6.5B+ tokens)

TinyGPT2-SFT (Instruction Fine-Tuned) πŸ’¬

  • Base model: TinyGPT2 (~95M parameters)
  • Fine-tuning data: Stanford Alpaca (52K instruction-response pairs)
  • Training: 3 epochs with response-only loss masking
  • Prompt format: ### Instruction: ... ### Response: ...
  • Capabilities: Follows instructions, answers questions, writes creatively
  • Hardware: Single NVIDIA RTX 3070 Ti (8GB VRAM), ~85 minutes total

Datasets πŸ“–

Model Dataset Tokens
TinyGPT TinyStories ~300M
TinyGPT-MoE TinyStories ~300M
Wikipedia-MoE Wikipedia (C4) ~500M
TinyGPT2 OpenWebText ~6.7B
TinyGPT2-SFT Stanford Alpaca (52K) ~72M

Training Data Improvements πŸ“ˆ

  • Scale: TinyGPT2 is trained on 3.4B+ tokens from OpenWebText, significantly enhancing its general language understanding.
  • Data Processing: Efficient data loading with HuggingFace datasets and tiktoken tokenization for fast throughput.

Installation πŸ’Ώ

To install TinyGPT, follow these steps:

# Clone the repository
git clone https://github.com/NotShrirang/tinygpt.git

# Navigate to the project directory
cd tinygpt

# Install the required packages
pip install -r requirements.txt

# Download the model weights
mkdir -p tinygpt/weights

Liger Kernel Dependencies πŸ”§

For optimal training performance with liger-kernel (used by TinyGPT2 and MoE models), you need:

  • Linux operating system (POSIX-compliant)
  • NVIDIA GPU with CUDA support
  • liger-kernel
# Install liger-kernel for training optimizations (Linux + CUDA only)
pip install liger-kernel

Note: On Windows or CPU-only environments, all models automatically fall back to pure PyTorch implementations without liger-kernel optimizations. The models will still work but training may be slower.

Docker Support 🐳

TinyGPT fully supports Docker for easy deployment and development:

# Production deployment
docker-compose up --build

# Development with hot reload
docker-compose --profile dev up tinygpt-dev --build

The Docker setup includes:

  • Multi-model support: All four model variants
  • Hot reload: Automatic code updates during development
  • Cross-platform: Works seamlessly on Windows, macOS, and Linux
  • Persistent storage: Model weights are cached between container restarts

For detailed Docker usage, see DOCKER.md.

Usage πŸš€

Model Selection 🎯

Choose from five model variants:

  • TinyGPT: Standard 51M parameter model for story generation
  • TinyGPT-MoE: 85M parameter MoE model with enhanced storytelling
  • Wikipedia-MoE: 135M parameter MoE model trained on Wikipedia
  • TinyGPT2: 95M parameter modern GPT with RoPE, GQA, and RMSNorm
  • TinyGPT2-SFT: TinyGPT2 instruction fine-tuned on Alpaca β€” follows instructions and answers questions

Quick Start Options

Option 1: Streamlit Interface (Recommended for beginners)

streamlit run main.py

This launches a web application where you can:

  • Select between all four model variants
  • Adjust generation parameters (temperature, top-k, top-p, max tokens)
  • Input text prompts and see real-time streaming responses
  • Download models automatically from Hugging Face

Option 2: CLI Inference (TinyGPT2)

# SFT model (default) β€” wraps prompts in instruction template
python inference.py --checkpoint checkpoints_sft/sft_epoch2.pth

# Pretrained model β€” raw text completion
python inference.py --checkpoint checkpoints/ckpt_step25500.pth --raw

# Single prompt
python inference.py --checkpoint checkpoints_sft/sft_epoch2.pth --prompt "What is the capital of France?"

# With custom settings
python inference.py --checkpoint checkpoints_sft/sft_epoch2.pth --max_tokens 200 --temperature 0.7 --top_k 40

Features:

  • KV cache for fast autoregressive generation
  • Streaming token-by-token output with EOS detection
  • Interactive REPL mode
  • Instruction template (default) or raw mode (--raw)
  • Checkpoint info display (step, loss, tokens seen)

Option 3: FastAPI Service (Production REST API)

# Start FastAPI server directly
python app.py

# Or use Docker
docker-compose up tinygpt-api --build

Features:

  • REST API endpoints for text generation
  • Multi-model support (TinyGPT, TinyGPT-MoE, TinyGPT2)
  • Interactive Swagger docs at http://localhost:8000/docs
  • Health monitoring and model management

For detailed API documentation, see docs/API.md.

Option 4: Docker (Recommended for production)

# Production deployment
docker-compose up --build

# Development mode with hot reload
docker-compose --profile dev up tinygpt-dev --build

Access the application at http://localhost:8501

Cross-Platform Compatibility 🌐

TinyGPT runs smoothly on:

  • Windows βœ… (with automatic fallback for liger-kernel)
  • macOS βœ… (with automatic fallback for liger-kernel)
  • Linux βœ… (full liger-kernel optimization support)
  • Docker βœ… (all platforms)

Training βš™οΈ

TinyGPT / TinyGPT-MoE / Wikipedia-MoE

Trained using PyTorch on their respective datasets. See the training notebooks in the notebooks/ directory.

Loss Curve

TinyGPT2 Pretraining

TinyGPT2 is pretrained on OpenWebText using train_liger.py:

# Start training from scratch
python train_liger.py

# Resume from checkpoint
python train_liger.py --resume
training_loss_curve

Training configuration:

  • Hardware: Single NVIDIA RTX 3070 Ti (8GB VRAM)
  • Effective batch size: 262K tokens/step (batch 8 Γ— grad accum 64 Γ— block size 512)
  • Optimizer: AdamW with cosine decay schedule and warmup
  • Mixed precision: bfloat16 with torch.compile for speed
  • Evaluation: Periodic validation with sample text generation
  • Checkpointing: Automatic saves with train/val loss tracking

Supervised Fine-Tuning (SFT)

Fine-tune TinyGPT2 on instruction-following tasks using the Stanford Alpaca dataset:

# Fine-tune from a pretrained checkpoint
python sft.py --checkpoint checkpoints/ckpt_step25500.pth

# Resume SFT training
python sft.py --checkpoint checkpoints/ckpt_step25500.pth --resume

SFT configuration:

  • Hardware: Single NVIDIA RTX 3070 Ti (8GB VRAM)
  • Dataset: Stanford Alpaca (52K instruction-response pairs, 90/10 train/val split)
  • Epochs: 3 (~85 minutes total)
  • Effective batch size: 16K tokens/step (batch 4 Γ— grad accum 8 Γ— block size 512)
  • Response-only loss masking: Only trains on the response portion, not the instruction prompt
  • Prompt template: ### Instruction: ... ### Input: ... ### Response: ...
Epoch Train Loss Val Loss Train PPL Val PPL
1 2.13 2.01 8.45 7.44
2 1.97 1.98 7.17 7.27
3 1.91 1.98 6.77 7.26

Training Optimizations πŸš€

Standard TinyGPT Optimizations

  • Kernel Fusion: Implemented to reduce memory bandwidth bottlenecks and speed up training operations
  • Mixed Precision Training: Utilizes bfloat16 format for significantly faster training while maintaining numerical stability
  • Gradient Accumulation: Applied to improve training stability and allow effective training with larger batch sizes
  • Cosine Scheduler: Implements variable learning rate throughout training for better convergence
  • PyTorch's Multi-Head Attention: Uses standard PyTorch implementations for Multi-Head Attention layers to boost training speed

TinyGPT2 Specific Optimizations

  • torch.compile: Full model compilation for fused kernel execution
  • Grouped Query Attention (GQA): 12 query heads sharing 4 KV groups β€” reduces memory while maintaining quality
  • Rotary Position Embeddings (RoPE): Efficient relative position encoding without learned position embeddings
  • RMSNorm: Faster and more stable alternative to LayerNorm
  • KV Cache: Efficient autoregressive generation β€” only computes attention for the new token
  • Weight Tying: Shares weights between token embeddings and output projection to reduce parameters
  • Fused AdamW: Uses CUDA-fused optimizer when available

MoE Specific Optimizations

  • liger-kernel Integration: Uses optimized SwiGLU implementations for enhanced performance on Linux + CUDA
  • Expert Routing: Dynamic routing of tokens to specialized experts for improved capabilities
  • Sparse Activation: Only activates top-2 experts per token, maintaining efficiency while increasing model capacity
  • Automatic Fallback: Gracefully falls back to PyTorch-native implementations on non-CUDA or Windows systems

Project Structure πŸ“

tinygpt/
β”œβ”€β”€ tinygpt/                  # Core package
β”‚   β”œβ”€β”€ __init__.py           # Exports all models, configs, tokenizer
β”‚   β”œβ”€β”€ model.py              # GPTLanguageModel, MoEGPTLanguageModel, WikipediaMoEGPTLanguageModel, TinyGPT2
β”‚   β”œβ”€β”€ layers.py             # DecoderBlock, MoE blocks, GQA, RoPE, RMSNorm, TinyGPT2Block
β”‚   β”œβ”€β”€ config.py             # GPTConfig, MoEGPTConfig, WikipediaMoEGPTConfig, TinyGPT2Config
β”‚   β”œβ”€β”€ tokenizer.py          # Tiktoken-based tokenizer
β”‚   β”œβ”€β”€ utils.py              # Generation utilities, mask helpers
β”‚   └── weights/              # Model weight files
β”œβ”€β”€ train_liger.py            # TinyGPT2 pretraining script (OpenWebText)
β”œβ”€β”€ sft.py                    # Supervised fine-tuning on Stanford Alpaca
β”œβ”€β”€ inference.py              # TinyGPT2 CLI inference with KV cache & instruction template
β”œβ”€β”€ main.py                   # Streamlit web UI (all models)
β”œβ”€β”€ app.py                    # FastAPI REST API service
β”œβ”€β”€ notebooks/                # Training notebooks
β”œβ”€β”€ docs/                     # API documentation, Docker guide
β”œβ”€β”€ docker-compose.yml        # Docker deployment
└── requirements.txt          # Python dependencies

Deployment & System Requirements πŸ’»

Minimum Requirements (Inference)

  • CPU: Any modern multi-core processor
  • RAM: 4GB+ (8GB recommended)
  • Storage: 1GB for model weights and dependencies
  • Python: 3.8 or higher

Optimal Performance (Training)

  • OS: Linux (Ubuntu 20.04+ recommended)
  • GPU: NVIDIA GPU with 8GB+ VRAM and CUDA 11.0+
  • RAM: 16GB+
  • Additional: liger-kernel for fused kernels

Deployment Options

Local Development

# Standard Python environment
pip install -r requirements.txt
streamlit run main.py

Docker (Recommended)

# Production deployment
docker-compose up --build

# Development with auto-reload
docker-compose --profile dev up tinygpt-dev --build

Cloud Deployment

  • Streamlit Cloud: Fully supported βœ…
  • Heroku: Supported with Docker βœ…
  • AWS/GCP/Azure: Supported with containerization βœ…
  • Hugging Face Spaces: Supported βœ…

Sample Outputs πŸ“

TinyGPT (Standard Model)

Prompt: One day, a dragon

Output:
One day, a dragon named Bobo was walking in the forest when he saw a little bunny. The bunny was sad because he had no friends. Bobo wanted to help the bunny, so he asked the bunny to give him a hug. The bunny said yes, and the bunny gave the bunny a hug.

Bobo was very happy and thanked the bunny. He named the bunny, and they became good friends. The bunny was always grateful for Bobo's help. They became good friends, and they always shared their toys and treats!

TinyGPT2

Prompt: The meaning of life

Output:
The meaning of life is more complex than its meanings. The two most common forms of human love are love and affection.

What is Love?

Love is both good and bad; it is one of love's most enduring possessions.

Love is the most fundamental, at times, measure of humanity's capacity for love. Love is an object of a man's desire and a desire. The desire of the man is the most important attribute of love.

Love is a self-awareness. It is a way of feeling out and doing something.

TinyGPT2-SFT (Instruction Fine-Tuned)

>>> Explain what machine learning is in simple terms.
Machine learning is a branch of computer science that focuses on using machine learning algorithms to identify patterns in data and identify patterns in data. It is a branch of computer science that focuses on creating computer systems that can perform tasks such as image recognition, image classification, and natural language processing. Machine learning algorithms are used to develop algorithms that can be used to generate and classify data in order to identify patterns in data. These algorithms are used to analyze large amounts of data and make predictions about future trends.

>>> What is the capital of France?
The capital of France is Paris.

>>> Write a motivational quote.
"The only way to make a difference is to be bold and courageous."

License πŸ“œ

This project is licensed under the GPL-3.0 license - see the LICENSE file for details.

Contributing πŸ‘₯

Contributions are welcome! Feel free to submit pull requests, create issues, or suggest improvements to the model or codebase.

Support ❀️

If you find TinyGPT useful, please consider starring the repository ⭐

About

🎈 A series of lightweight GPT models featuring TinyGPT Base (~51M params) and TinyGPT-MoE (~85M params), TinyGPT2 (~95M params). Fast, creative text generation trained on whimsical stories.

Topics

Resources

License

Stars

Watchers

Forks