♟️ Add comprehensive chess training example with domain tokenizer #27

dariocazzani · 2025-09-25T19:27:42Z

What does this PR do?

This PR adds a complete chess move prediction example that demonstrates training transformers on domain-specific sequential data. It includes a Lichess database downloader, PGN parser, chess-specific tokenizer, and training pipeline that learns chess patterns from grandmaster games without knowing the rules.

Details

Add examples/chess.py: Complete chess training pipeline with Lichess database downloading, PGN parsing, and move generation demo
Add examples/chess_tokenizer.py: Domain-specific tokenizer for Standard Algebraic Notation with comprehensive chess vocabulary (~10k tokens)
Add comprehensive test suite for ChessTokenizer with save/load, move encoding, castling, and promotion testing
Update README with instructions for installing optional example dependencies
Add zstandard as optional dependency for chess database decompression

Highlights

The chess tokenizer creates a deterministic vocabulary covering all valid chess moves:

@staticmethod
def _create_vocabulary() -> list[str]:
    """Generates the complete, deterministic vocabulary for chess."""
    tokens = {"[PAD]", "[UNK]", "[BOS]", "[EOS]", "*", "+", "#"}
    tokens.add("O-O")    # Kingside castling
    tokens.add("O-O-O")  # Queenside castling
    
    # Generate all possible piece moves, captures, and promotions
    for piece in ["N", "B", "R", "Q", "K"]:
        for square in squares:
            tokens.add(piece + square)  # e.g., Nf3
            tokens.add(piece + "x" + square)  # e.g., Nxd4

The training pipeline demonstrates domain adaptation with chess-optimized configuration:

def create_chess_config(tokenizer_vocab_size: int) -> ScratchGPTConfig:
    architecture = ScratchGPTArchitecture(
        block_size=256,        # Longer context for chess games
        embedding_size=384,    # Balanced for chess vocabulary
        num_heads=8,           # Good attention for chess patterns
        num_blocks=6,          # Sufficient depth for chess understanding
        vocab_size=tokenizer_vocab_size,
    )

* Add examples/chess.py: downloads and parses Lichess PGN databases for transformer training * Streams large PGN files, removes metadata/annotations, filters games with 2+ moves * Uses temporary directory, outputs clean move sequences ready for training * Add zstandard dependency in examples-dependencies optional group for PGN decompression

* Introduces `ChessTokenizer`, a serializable tokenizer with a pre-generated vocabulary for Standard Algebraic Notation (SAN). * The vocabulary includes all valid moves, promotions, castling, move numbers, check/mate symbols, and game termination markers. * Implements a robust `encode` method to correctly parse PGN strings into distinct tokens.

* Add examples/chess_tokenizer.py: deterministic tokenizer for chess moves with comprehensive SAN vocabulary * Transform chess.py into full training pipeline with ChessDataLoader, model training, and move generation demo * Add tests/examples/test_chess_tokenizer.py: comprehensive test suite for chess tokenizer functionality * Include chess-optimized model config (larger context, sliding windows) and famous opening position demos

ayeganov

Looks good, needs a few touch ups.

examples/chess.py

Co-authored-by: Aleks <ayeganov@users.noreply.github.com>

* Replace manual token generation loop with model.generate() method to eliminate code duplication * Generate chess moves one at a time instead of batch generation for clearer logic and easier debugging * Extract magic number 80 to GAME_PREVIEW_MAX_LENGTH constant and add type hints to module-level constants * Remove unused torch.nn.functional import and add missing return type hints

examples/chess.py

* Add chess training example with Lichess database parser * Add examples/chess.py: downloads and parses Lichess PGN databases for transformer training * Streams large PGN files, removes metadata/annotations, filters games with 2+ moves * Uses temporary directory, outputs clean move sequences ready for training * Add zstandard dependency in examples-dependencies optional group for PGN decompression * feat: Add deterministic SAN ChessTokenizer * Introduces `ChessTokenizer`, a serializable tokenizer with a pre-generated vocabulary for Standard Algebraic Notation (SAN). * The vocabulary includes all valid moves, promotions, castling, move numbers, check/mate symbols, and game termination markers. * Implements a robust `encode` method to correctly parse PGN strings into distinct tokens. * Complete chess training example with ChessTokenizer * Add examples/chess_tokenizer.py: deterministic tokenizer for chess moves with comprehensive SAN vocabulary * Transform chess.py into full training pipeline with ChessDataLoader, model training, and move generation demo * Add tests/examples/test_chess_tokenizer.py: comprehensive test suite for chess tokenizer functionality * Include chess-optimized model config (larger context, sliding windows) and famous opening position demos * Explicit boolean Co-authored-by: Aleks <ayeganov@users.noreply.github.com> * Refactor chess.py: remove duplication and magic numbers * Replace manual token generation loop with model.generate() method to eliminate code duplication * Generate chess moves one at a time instead of batch generation for clearer logic and easier debugging * Extract magic number 80 to GAME_PREVIEW_MAX_LENGTH constant and add type hints to module-level constants * Remove unused torch.nn.functional import and add missing return type hints --------- Co-authored-by: Aleksandr V Yeganov <ayeganov@gmail.com> Co-authored-by: Aleks <ayeganov@users.noreply.github.com>

dariocazzani and others added 3 commits September 25, 2025 12:20

dariocazzani requested a review from ayeganov September 25, 2025 19:50

ayeganov requested changes Sep 26, 2025

View reviewed changes

examples/chess.py Outdated Show resolved Hide resolved

examples/chess.py Outdated Show resolved Hide resolved

examples/chess.py Outdated Show resolved Hide resolved

examples/chess.py Outdated Show resolved Hide resolved

examples/chess.py Outdated Show resolved Hide resolved

Explicit boolean

40874ea

Co-authored-by: Aleks <ayeganov@users.noreply.github.com>

dariocazzani force-pushed the examples/chess branch from a5fb57f to fd8003b Compare October 7, 2025 19:54

dariocazzani requested a review from ayeganov October 7, 2025 19:54

dariocazzani force-pushed the examples/chess branch from fd8003b to b4441db Compare October 8, 2025 12:58

ayeganov approved these changes Oct 8, 2025

View reviewed changes

examples/chess.py Outdated Show resolved Hide resolved

ayeganov merged commit 541dd31 into main Oct 8, 2025
2 checks passed

ayeganov deleted the examples/chess branch October 8, 2025 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

♟️ Add comprehensive chess training example with domain tokenizer #27

♟️ Add comprehensive chess training example with domain tokenizer #27

Uh oh!

dariocazzani commented Sep 25, 2025

Uh oh!

ayeganov left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

♟️ Add comprehensive chess training example with domain tokenizer #27

♟️ Add comprehensive chess training example with domain tokenizer #27

Uh oh!

Conversation

dariocazzani commented Sep 25, 2025

What does this PR do?

Details

Highlights

Uh oh!

ayeganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants