This repository contains automatically generated word frequency lists for all languages available in the Wikimedia Wikipedia dataset.
The lists are created using a unified, reproducible pipeline that performs:
- streaming data extraction
- reservoir sampling
- wiki markup cleaning
- script detection
- tokenization
- frequency analysis
The goal is to provide clean, comparable lexical resources for multilingual NLP, linguistics, and language modeling.
- Supports all languages in the
wikimedia/wikipediadataset - Streaming mode — no full dumps downloaded
- Reservoir sampling — unbiased random selection
- Seeded randomness (
seed = 42) — fully reproducible - Automatic script detection (Latin, Cyrillic, Devanagari, Hangul, CJK, Arabic, etc.)
- Cleans wiki markup using
mwparserfromhell - Generates up to 1,000,000 most frequent words per language
- Saves results in simple
.txtfiles
Each language produces a file:
freq_<language_code>.txt
Example:
freq_en.txt
freq_ru.txt
freq_uz.txt
freq_hi.txt
freq_ko.txt
Each file contains tab‑separated pairs:
word frequency
Example:
the 123456
and 98765
to 87654
The script automatically identifies the latest dump date for each language:
get_dataset_config_names("wikimedia/wikipedia")It selects the newest <date>.<lang> configuration for every language.
For each language:
- The dataset is streamed (no full download)
- Up to 200,000 articles are scanned
- Exactly 100 articles are selected using reservoir sampling
- Randomness is controlled with:
random.seed(42)This ensures reproducibility.
Each article’s text field is cleaned using:
mwparserfromhell.parse(raw).strip_code()This removes:
- templates
- tables
- categories
- HTML tags
- wiki links
- formatting
A sample of the collected articles is analyzed to determine the dominant writing system.
The script detector supports:
- Latin
- Cyrillic
- Arabic
- Devanagari
- Hangul
- CJK
- Hebrew
- Greek
- Armenian
- and many more
The detector selects the regex with the highest match score.
Tokens are extracted using the detected script regex.
The pipeline:
- lowercases all text
- removes digits
- removes punctuation
- removes URLs
- keeps only alphabetic tokens in the native script
- does not perform lemmatization
All tokens are counted using collections.Counter.
The top 1,000,000 words (or fewer if the language has fewer unique tokens) are saved.
.
├── freq_en.txt
├── freq_ru.txt
├── freq_uz.txt
├── freq_hi.txt
├── freq_ko.txt
├── freq_<lang>.txt
└── script.py
- Python 3.9+
- datasets
- mwparserfromhell
Install:
pip install datasets mwparserfromhellpython script.pyThe script will:
- detect all available Wikipedia languages
- sample 100 articles per language
- generate frequency lists
- save them to
freq_<lang>.txt