subword-tokenization

Here are 10 public repositories matching this topic...

DolbyUUU / byte_pair_encoding_BPE_subword_tokenization_implementation_python

Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python

python nlp natural-language-processing tokenizer data-preprocessing data-cleaning bpe byte-pair-encoding subword-tokenization

Updated Jan 30, 2023
Python

Burakkylmz / tokenizer-workshop

Star

An educational Python project for learning tokenization step by step by building character-level, byte-level, and BPE tokenizers from scratch.

python nlp tokenizer text-processing tokenization educational-project bpe byte-pair-encoding regextokenizer llm subword-tokenization bpe-tokenizer byte-pair-tokenizer char-tokeneizer

Updated Apr 27, 2026
Python

SD7Campeon / Comment-Toxicity-Detection-and-Classification

Star

LLM-inspired BiLSTM pipeline for real-time, multi-label toxicity inference across adversarial discourse modalities.

nlp sklearn transformer discourse-analysis multi-label-classification affective-computing keras-tensorflow text-vectorization bilstm nlp-pipeline deep-sequential-model toxicity-analysis toxicity-prediction toxicity-detection toxicity-classification llm subword-tokenization real-time-inference contextual-nlp

Updated May 15, 2025
Jupyter Notebook

moralesangel / BPE-tokenizer

Star

A clean, educational implementation of the Byte Pair Encoding algorithm used in modern language models like GPT.

nlp machine-learning natural-language-processing deep-learning tokenizer transformers text-processing gpt tokenization bpe byte-pair-encoding large-language-models llm generative-ai subword-tokenization

Updated Nov 11, 2025
Jupyter Notebook

rashi-bhansali / subword-dan-sentiment-analysis

Star

Implementation of Deep Averaging Networks (DAN) for sentiment classification with experiments on GloVe embeddings and subword tokenization using Byte Pair Encoding (BPE).

nlp deep-learning sentiment-analysis text-classification pytorch neural-networks dan glove-embeddings byte-pair-encoding subword-tokenization

Updated Jan 31, 2026
Python

SANJAI-s0 / Wikitext_2-BPE-Tokenizer

Star

Custom BPE tokenizer built from scratch on WikiText-2 (30k vocab). Covers data cleaning, deduplication, HuggingFace tokenizers training, evaluation (compression ratio, UNK-free coverage, consistency), and save/reload as PreTrainedTokenizerFast.

python nlp jupyter-notebook language-modeling text-preprocessing byte-pair-encoding huggingface tokenizers subword-tokenization bpe-tokenizer wikitext2

Updated Apr 14, 2026
Jupyter Notebook

bmikaberidze / tokenizers-for-georgian

Star

Paper: A Comparison of Different Tokenization Methods for the Georgian Language

nlp tokenization low-resource-languages georgian-language subword-tokenization

Updated Jan 9, 2026
Python

TDRH-Undergraduate-Students / tokenization

Star

This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on LLMs.

tokenization subword-tokenization