Skip to content

RITURAJRAMAN/NeoLLM

Repository files navigation

NeoLLM

A decoder-only Transformer language model built entirely from scratch in PyTorch. Implements the same modern architecture used by LLaMA-3 and Mistral — Grouped Query Attention, RoPE, RMSNorm, SwiGLU FFN, and a full training pipeline with mixed precision, gradient accumulation, and checkpoint resume.

Architecture

Component Implementation Reference
Tokenizer BPE (Byte-Pair Encoding) GPT-2/LLaMA
Normalization RMSNorm (pre-norm) LLaMA
Position Encoding Rotary Embeddings (RoPE) Su et al., 2021
Attention Grouped Query Attention + KV-cache Ainslie et al., 2023
Feed-Forward SwiGLU Shazeer, 2020
LR Schedule Warmup-Stable-Decay (WSD) LLaMA-3, Falcon
Precision fp16 / bf16 (auto-detected)

Model Sizes

Preset Params Layers Dim Heads VRAM Config
nano 22M 6 512 8 ~2 GB --preset nano
small 125M 12 768 12 ~4 GB --config configs/small.yaml
medium 370M 24 1024 16 ~10 GB --config configs/medium.yaml

Requirements

  • Python 3.10+
  • PyTorch 2.3+ with CUDA (recommended) or CPU
  • 4 GB+ GPU VRAM for small model, 2 GB for nano

Installation

git clone https://github.com/yourusername/NeoLLM.git
cd NeoLLM

# Create virtual environment
python -m venv .venv
source .venv/bin/activate      # Linux / macOS
.venv\Scripts\activate         # Windows

# Install dependencies
pip install -r requirements.txt

For CUDA-enabled PyTorch (recommended — install before requirements.txt):

pip install torch --index-url https://download.pytorch.org/whl/cu126

Quick Start

Step 0 — Verify Installation

python -c "import torch; print('GPU:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only')"

Step 1 — Train Tokenizer

Trains a 32,000-token BPE vocabulary on OpenWebText. Run once.

python scripts/train_tokenizer.py --samples 200000 --vocab-size 32000

Output: data/tokenizer.json


Step 2 — Prepare Training Data

Downloads and tokenizes documents into fast binary shards. Run once (or repeat with --max-docs to get more data).

# Recommended starting point (~500MB, good for testing)
python scripts/prepare_data.py --max-docs 50000

# For serious training (~5GB, much better model quality)
python scripts/prepare_data.py --max-docs 500000

Output: data/processed/train_????.bin


Step 3 — Train a Model

Nano — 22M params (sanity check, runs in minutes)

python scripts/train.py --preset nano --max-steps 500

Use this to verify your setup works before committing to a long run.

Small — 125M params (recommended, GPT-2 scale)

python scripts/train.py --config configs/small.yaml

Medium — 370M params (needs 10GB+ VRAM)

python scripts/train.py --config configs/medium.yaml

Training auto-resumes from the latest checkpoint. Just restart the command to continue.


Step 4 — Monitor Training

Open a second terminal while training runs:

tensorboard --logdir runs/

Then open http://localhost:6006 in your browser. Live graphs for loss, perplexity, learning rate, and tokens/sec.


Step 5 — Chat with Your Model

# Chat with the small model
python scripts/chat.py --checkpoint checkpoints/small/latest.pt

# Chat with nano (after a quick test run)
python scripts/chat.py --checkpoint checkpoints/nano/latest.pt

# Single prompt (non-interactive)
python scripts/chat.py --checkpoint checkpoints/small/latest.pt \
    --prompt "The future of artificial intelligence" --max-tokens 200

Step 6 — Use in Your Application

from neollm.inference.engine import InferenceEngine

engine = InferenceEngine.from_checkpoint(
    checkpoint_path="checkpoints/small/latest.pt",
    tokenizer_path="data/tokenizer.json",
)

response = engine.generate(
    prompt="Once upon a time,",
    max_new_tokens=100,
    temperature=0.8,
    top_p=0.9,
)
print(response)

Training on Free Cloud GPUs (Kaggle)

The project ships with a ready-to-use Kaggle notebook (kaggle_neollm_small.ipynb). Kaggle provides free GPU sessions of up to 12 hours with 30 GPU-hours per week.

  1. Upload this repository as a Kaggle Dataset
  2. Upload kaggle_neollm_small.ipynb as a new notebook
  3. Set Accelerator → GPU T4 and enable Internet
  4. Click Run All
  5. Download checkpoints/small/latest.pt from the Output tab when done

Re-upload the checkpoint next session — training resumes automatically from where it left off.

Checkpoint Layout

checkpoints/
├── nano/
│   ├── step_000100.pt
│   └── latest.pt          ← always the most recent
├── small/
│   ├── step_002000.pt
│   ├── step_004000.pt
│   └── latest.pt
└── medium/
    └── latest.pt

runs/                       ← TensorBoard logs
├── nano/
└── small/

Configuration

All model and training settings are stored in YAML files under configs/. Key options:

# configs/small.yaml (excerpt)
train:
  batch_size: 4          # increase if you have more VRAM
  grad_accum_steps: 8    # effective batch = batch_size × grad_accum_steps
  lr: 3.0e-4
  max_steps: 200000
  fp16: true             # use bf16: true on Ampere+ GPUs (RTX 3000+)
  gradient_checkpointing: false  # set true to reduce VRAM at ~20% speed cost
  save_every: 2000

Troubleshooting

Error Cause Fix
CUDA out of memory Batch too large Reduce batch_size or set gradient_checkpointing: true
FileNotFoundError: tokenizer.json Step 1 skipped Run train_tokenizer.py first
No shard files found Step 2 skipped Run prepare_data.py first
Loss not decreasing Wrong LR or data issue Run --preset nano first to verify setup
No module named torch Wrong Python Activate your virtual environment

Running Tests

python -m pytest tests/ -v

Project Structure

NeoLLM/
├── neollm/
│   ├── config/          # ModelConfig, TrainConfig, DataConfig dataclasses
│   ├── model/           # RMSNorm, RoPE, Attention, FFN, Transformer, NeoLLM
│   ├── data/            # BPE tokenizer, dataset sharding, DataLoader
│   ├── training/        # Trainer, AdamW optimizer, WSD LR schedule, ModelEMA
│   ├── inference/       # Sampler (greedy/top-k/top-p), InferenceEngine
│   └── utils/           # Logging, metrics
├── scripts/
│   ├── train_tokenizer.py
│   ├── prepare_data.py
│   ├── train.py
│   └── chat.py
├── configs/
│   ├── nano.yaml         # 22M params
│   ├── small.yaml        # 125M params
│   └── medium.yaml       # 370M params
├── tests/
├── kaggle_neollm_small.ipynb
├── inference_example.py
├── requirements.txt
└── pyproject.toml

License

MIT

About

Build and train your own GPT-style language model from scratch. Supports nano (22M), small (125M), and medium (370M) model sizes with full training pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors