Skip to content

Conversation

@dariocazzani
Copy link
Contributor

What does this PR do?

This PR adds a complete chess move prediction example that demonstrates training transformers on domain-specific sequential data. It includes a Lichess database downloader, PGN parser, chess-specific tokenizer, and training pipeline that learns chess patterns from grandmaster games without knowing the rules.

Details

  • Add examples/chess.py: Complete chess training pipeline with Lichess database downloading, PGN parsing, and move generation demo
  • Add examples/chess_tokenizer.py: Domain-specific tokenizer for Standard Algebraic Notation with comprehensive chess vocabulary (~10k tokens)
  • Add comprehensive test suite for ChessTokenizer with save/load, move encoding, castling, and promotion testing
  • Update README with instructions for installing optional example dependencies
  • Add zstandard as optional dependency for chess database decompression

Highlights

The chess tokenizer creates a deterministic vocabulary covering all valid chess moves:

@staticmethod
def _create_vocabulary() -> list[str]:
    """Generates the complete, deterministic vocabulary for chess."""
    tokens = {"[PAD]", "[UNK]", "[BOS]", "[EOS]", "*", "+", "#"}
    tokens.add("O-O")    # Kingside castling
    tokens.add("O-O-O")  # Queenside castling
    
    # Generate all possible piece moves, captures, and promotions
    for piece in ["N", "B", "R", "Q", "K"]:
        for square in squares:
            tokens.add(piece + square)  # e.g., Nf3
            tokens.add(piece + "x" + square)  # e.g., Nxd4

The training pipeline demonstrates domain adaptation with chess-optimized configuration:

def create_chess_config(tokenizer_vocab_size: int) -> ScratchGPTConfig:
    architecture = ScratchGPTArchitecture(
        block_size=256,        # Longer context for chess games
        embedding_size=384,    # Balanced for chess vocabulary
        num_heads=8,           # Good attention for chess patterns
        num_blocks=6,          # Sufficient depth for chess understanding
        vocab_size=tokenizer_vocab_size,
    )

dariocazzani and others added 3 commits September 25, 2025 12:20
* Add examples/chess.py: downloads and parses Lichess PGN databases for transformer training
* Streams large PGN files, removes metadata/annotations, filters games with 2+ moves
* Uses temporary directory, outputs clean move sequences ready for training
* Add zstandard dependency in examples-dependencies optional group for PGN decompression
* Introduces `ChessTokenizer`, a serializable tokenizer with a pre-generated vocabulary for Standard Algebraic Notation (SAN).
* The vocabulary includes all valid moves, promotions, castling, move numbers, check/mate symbols, and game termination markers.
* Implements a robust `encode` method to correctly parse PGN strings into distinct tokens.
* Add examples/chess_tokenizer.py: deterministic tokenizer for chess moves with comprehensive SAN vocabulary
* Transform chess.py into full training pipeline with ChessDataLoader, model training, and move generation demo
* Add tests/examples/test_chess_tokenizer.py: comprehensive test suite for chess tokenizer functionality
* Include chess-optimized model config (larger context, sliding windows) and famous opening position demos
Copy link
Contributor

@ayeganov ayeganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, needs a few touch ups.

Co-authored-by: Aleks <ayeganov@users.noreply.github.com>
* Replace manual token generation loop with model.generate() method to eliminate code duplication
* Generate chess moves one at a time instead of batch generation for clearer logic and easier debugging
* Extract magic number 80 to GAME_PREVIEW_MAX_LENGTH constant and add type hints to module-level constants
* Remove unused torch.nn.functional import and add missing return type hints
@ayeganov ayeganov merged commit 541dd31 into main Oct 8, 2025
2 checks passed
@ayeganov ayeganov deleted the examples/chess branch October 8, 2025 14:57
dariocazzani added a commit that referenced this pull request Oct 8, 2025
* Add chess training example with Lichess database parser

* Add examples/chess.py: downloads and parses Lichess PGN databases for transformer training
* Streams large PGN files, removes metadata/annotations, filters games with 2+ moves
* Uses temporary directory, outputs clean move sequences ready for training
* Add zstandard dependency in examples-dependencies optional group for PGN decompression

* feat: Add deterministic SAN ChessTokenizer

* Introduces `ChessTokenizer`, a serializable tokenizer with a pre-generated vocabulary for Standard Algebraic Notation (SAN).
* The vocabulary includes all valid moves, promotions, castling, move numbers, check/mate symbols, and game termination markers.
* Implements a robust `encode` method to correctly parse PGN strings into distinct tokens.

* Complete chess training example with ChessTokenizer

* Add examples/chess_tokenizer.py: deterministic tokenizer for chess moves with comprehensive SAN vocabulary
* Transform chess.py into full training pipeline with ChessDataLoader, model training, and move generation demo
* Add tests/examples/test_chess_tokenizer.py: comprehensive test suite for chess tokenizer functionality
* Include chess-optimized model config (larger context, sliding windows) and famous opening position demos

* Explicit boolean

Co-authored-by: Aleks <ayeganov@users.noreply.github.com>

* Refactor chess.py: remove duplication and magic numbers

* Replace manual token generation loop with model.generate() method to eliminate code duplication
* Generate chess moves one at a time instead of batch generation for clearer logic and easier debugging
* Extract magic number 80 to GAME_PREVIEW_MAX_LENGTH constant and add type hints to module-level constants
* Remove unused torch.nn.functional import and add missing return type hints

---------

Co-authored-by: Aleksandr V Yeganov <ayeganov@gmail.com>
Co-authored-by: Aleks <ayeganov@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants