-
Notifications
You must be signed in to change notification settings - Fork 0
♟️ Add comprehensive chess training example with domain tokenizer #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Add examples/chess.py: downloads and parses Lichess PGN databases for transformer training * Streams large PGN files, removes metadata/annotations, filters games with 2+ moves * Uses temporary directory, outputs clean move sequences ready for training * Add zstandard dependency in examples-dependencies optional group for PGN decompression
* Introduces `ChessTokenizer`, a serializable tokenizer with a pre-generated vocabulary for Standard Algebraic Notation (SAN). * The vocabulary includes all valid moves, promotions, castling, move numbers, check/mate symbols, and game termination markers. * Implements a robust `encode` method to correctly parse PGN strings into distinct tokens.
* Add examples/chess_tokenizer.py: deterministic tokenizer for chess moves with comprehensive SAN vocabulary * Transform chess.py into full training pipeline with ChessDataLoader, model training, and move generation demo * Add tests/examples/test_chess_tokenizer.py: comprehensive test suite for chess tokenizer functionality * Include chess-optimized model config (larger context, sliding windows) and famous opening position demos
ayeganov
requested changes
Sep 26, 2025
Contributor
ayeganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, needs a few touch ups.
Co-authored-by: Aleks <ayeganov@users.noreply.github.com>
a5fb57f to
fd8003b
Compare
* Replace manual token generation loop with model.generate() method to eliminate code duplication * Generate chess moves one at a time instead of batch generation for clearer logic and easier debugging * Extract magic number 80 to GAME_PREVIEW_MAX_LENGTH constant and add type hints to module-level constants * Remove unused torch.nn.functional import and add missing return type hints
fd8003b to
b4441db
Compare
ayeganov
approved these changes
Oct 8, 2025
dariocazzani
added a commit
that referenced
this pull request
Oct 8, 2025
* Add chess training example with Lichess database parser * Add examples/chess.py: downloads and parses Lichess PGN databases for transformer training * Streams large PGN files, removes metadata/annotations, filters games with 2+ moves * Uses temporary directory, outputs clean move sequences ready for training * Add zstandard dependency in examples-dependencies optional group for PGN decompression * feat: Add deterministic SAN ChessTokenizer * Introduces `ChessTokenizer`, a serializable tokenizer with a pre-generated vocabulary for Standard Algebraic Notation (SAN). * The vocabulary includes all valid moves, promotions, castling, move numbers, check/mate symbols, and game termination markers. * Implements a robust `encode` method to correctly parse PGN strings into distinct tokens. * Complete chess training example with ChessTokenizer * Add examples/chess_tokenizer.py: deterministic tokenizer for chess moves with comprehensive SAN vocabulary * Transform chess.py into full training pipeline with ChessDataLoader, model training, and move generation demo * Add tests/examples/test_chess_tokenizer.py: comprehensive test suite for chess tokenizer functionality * Include chess-optimized model config (larger context, sliding windows) and famous opening position demos * Explicit boolean Co-authored-by: Aleks <ayeganov@users.noreply.github.com> * Refactor chess.py: remove duplication and magic numbers * Replace manual token generation loop with model.generate() method to eliminate code duplication * Generate chess moves one at a time instead of batch generation for clearer logic and easier debugging * Extract magic number 80 to GAME_PREVIEW_MAX_LENGTH constant and add type hints to module-level constants * Remove unused torch.nn.functional import and add missing return type hints --------- Co-authored-by: Aleksandr V Yeganov <ayeganov@gmail.com> Co-authored-by: Aleks <ayeganov@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR adds a complete chess move prediction example that demonstrates training transformers on domain-specific sequential data. It includes a Lichess database downloader, PGN parser, chess-specific tokenizer, and training pipeline that learns chess patterns from grandmaster games without knowing the rules.
Details
examples/chess.py: Complete chess training pipeline with Lichess database downloading, PGN parsing, and move generation demoexamples/chess_tokenizer.py: Domain-specific tokenizer for Standard Algebraic Notation with comprehensive chess vocabulary (~10k tokens)ChessTokenizerwithsave/load, move encoding, castling, and promotion testingREADMEwith instructions for installing optional example dependencieszstandardas optional dependency for chess database decompressionHighlights
The chess tokenizer creates a deterministic vocabulary covering all valid chess moves:
The training pipeline demonstrates domain adaptation with chess-optimized configuration: