Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
-
Updated
Jan 30, 2023 - Python
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
An educational Python project for learning tokenization step by step by building character-level, byte-level, and BPE tokenizers from scratch.
LLM-inspired BiLSTM pipeline for real-time, multi-label toxicity inference across adversarial discourse modalities.
A clean, educational implementation of the Byte Pair Encoding algorithm used in modern language models like GPT.
Implementation of Deep Averaging Networks (DAN) for sentiment classification with experiments on GloVe embeddings and subword tokenization using Byte Pair Encoding (BPE).
Custom BPE tokenizer built from scratch on WikiText-2 (30k vocab). Covers data cleaning, deduplication, HuggingFace tokenizers training, evaluation (compression ratio, UNK-free coverage, consistency), and save/reload as PreTrainedTokenizerFast.
Paper: A Comparison of Different Tokenization Methods for the Georgian Language
This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on LLMs.
BPE & Unigram Vocab Training library
A minimal Python implementation of Byte Pair Encoding (BPE) with step-by-step visualization of merge operations and vocabulary updates.
Add a description, image, and links to the subword-tokenization topic page so that developers can more easily learn about it.
To associate your repository with the subword-tokenization topic, visit your repo's landing page and select "manage topics."