`LM-Dispersion`

Krishnaswamy Lab, Yale University

This is the official repository for the ICML 2026 paper
Dispersion loss counteracts embedding condensation and improves generalization in small language models.

Please raise issues here.

You are encouraged to read the illustrated walkthrough of the paper on the project website.

A 5-minute intro to this paper

This paper presents an observation-driven improvement on language model training.

We observe a geometric phenomenon which we term embedding condensation, where token embeddings collapse into a narrow cone-like subspace in smaller language models. We then design a training objective called dispersion loss to counteract the effect.

Feature 1: Larger model, less condensation.
Within the same model family, smaller models exhibit more severe embedding condensation, with token embeddings collapsing toward near-parallel directions, while larger models resist this collapse.

This effect is also quite robust to the choice of input datasets.

Feature 2: Reproducible when controlling for confounders.
To isolate the effect of model size from other confounding factors, we conduct a controlled experiment where we pre-train GPT2-like models, varying only the MLP dimension while keeping all other components fixed, including the number of layers, embedding dimension, dataset, and training settings. The same phenomenon is observed.

Feature 3: Condensation occurs early on.
The embedding condensation phenomenon emerges at model initialization and is gradually mitigated, not exacerbated, by pre-training.

Feature 4: Distillation is not a solution.
Knowledge distillation from a larger model does not transfer the desired resistance to embedding condensation.

Dispersion loss
Embedding condensation reduces the expressivity of Transformers by collapsing token embedding vectors into narrow cones, under-utilizing the representation space. We hypothesize that by dispersing embeddings during training, smaller models can achieve representational qualities more similar to larger models, thus narrowing the performance gap without increasing the number of parameters.

Our dispersion loss is inspired by the "Diffuse and Disperse" paper with practical modifications.

Dispersion loss counteracts the embedding condensation effect during mid-training and pre-training. A qualitative result is shown below, while more quantitative results can be found in the paper.

Disclaimers and future directions

Please see our project website for disclaimers and some future directions we suggest.

Reproduce our main observations on embedding condensation

Under key_observations.

Compute the embeddings.

# NOTE: Some runs do not have `--gpu` because that would lead to CUDA OOM on my device. If your device allows, you can turn on the `--gpu` flag.
python compute_embedding_cossim.py --model-id gpt2 --gpu && \
python compute_embedding_cossim.py --model-id gpt2-medium --gpu && \
python compute_embedding_cossim.py --model-id gpt2-large --gpu && \
python compute_embedding_cossim.py --model-id gpt2-xl --gpu

python compute_embedding_cossim.py --model-id Qwen/Qwen-1_8B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen-7B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen-14B && \
python compute_embedding_cossim.py --model-id Qwen/Qwen-72B

python compute_embedding_cossim.py --model-id Qwen/Qwen2.5-0.5B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen2.5-1.5B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen2.5-3B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen2.5-7B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen2.5-14B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen2.5-32B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen2.5-72B --gpu

python compute_embedding_cossim.py --model-id Qwen/Qwen3-0.6B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen3-1.7B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen3-4B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen3-8B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen3-14B --gpu && \
python compute_embedding_cossim.py --model-id Qwen/Qwen3-32B --gpu

python compute_embedding_cossim.py --model-id bigscience/bloom-560m --gpu && \
python compute_embedding_cossim.py --model-id bigscience/bloom-1b1 --gpu && \
python compute_embedding_cossim.py --model-id bigscience/bloom-1b7 --gpu && \
python compute_embedding_cossim.py --model-id bigscience/bloom-3b --gpu && \
python compute_embedding_cossim.py --model-id bigscience/bloom-7b1 --gpu

python compute_embedding_cossim.py --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --gpu && \
python compute_embedding_cossim.py --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --gpu && \
python compute_embedding_cossim.py --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-14B --gpu && \
python compute_embedding_cossim.py --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --gpu

Plot the embeddings

python plot_trend.py --model-id gpt2 gpt2-medium gpt2-large gpt2-xl --model-family gpt2
python plot_trend.py --model-id Qwen-Qwen-1_8B Qwen-Qwen-7B Qwen-Qwen-14B Qwen-Qwen-72B --model-family Qwen1
python plot_trend.py --model-id Qwen-Qwen2.5-0.5B Qwen-Qwen2.5-1.5B Qwen-Qwen2.5-3B Qwen-Qwen2.5-7B Qwen-Qwen2.5-14B Qwen-Qwen2.5-32B Qwen-Qwen2.5-72B --model-family Qwen2.5
python plot_trend.py --model-id Qwen-Qwen3-0.6B Qwen-Qwen3-1.7B Qwen-Qwen3-4B Qwen-Qwen3-8B Qwen-Qwen3-14B Qwen-Qwen3-32B --model-family Qwen3
python plot_trend.py --model-id bigscience-bloom-560m bigscience-bloom-1b1 bigscience-bloom-1b7 bigscience-bloom-3b bigscience-bloom-7b1 --model-family bloom
python plot_trend.py --paired --model-id Qwen-Qwen2.5-Math-1.5B Qwen-Qwen2.5-Math-7B Qwen-Qwen2.5-14B Qwen-Qwen2.5-32B deepseek-ai-DeepSeek-R1-Distill-Qwen-1.5B deepseek-ai-DeepSeek-R1-Distill-Qwen-7B deepseek-ai-DeepSeek-R1-Distill-Qwen-14B deepseek-ai-DeepSeek-R1-Distill-Qwen-32B --model-family Qwen2.5-distill

Try different metrics.

python plot_trend.py --model-id gpt2 gpt2-medium gpt2-large gpt2-xl --model-family gpt2 --last-n

Try different input datasets.

python compute_embedding_cossim.py --model-id gpt2 --gpu --dataset pubmed && \
python compute_embedding_cossim.py --model-id gpt2-medium --gpu --dataset pubmed && \
python compute_embedding_cossim.py --model-id gpt2-large --gpu --dataset pubmed && \
python compute_embedding_cossim.py --model-id gpt2-xl --gpu --dataset pubmed

python compute_embedding_cossim.py --model-id gpt2 --gpu --dataset imdb && \
python compute_embedding_cossim.py --model-id gpt2-medium --gpu --dataset imdb && \
python compute_embedding_cossim.py --model-id gpt2-large --gpu --dataset imdb && \
python compute_embedding_cossim.py --model-id gpt2-xl --gpu --dataset imdb

python compute_embedding_cossim.py --model-id gpt2 --gpu --dataset squad && \
python compute_embedding_cossim.py --model-id gpt2-medium --gpu --dataset squad && \
python compute_embedding_cossim.py --model-id gpt2-large --gpu --dataset squad && \
python compute_embedding_cossim.py --model-id gpt2-xl --gpu --dataset squad

python plot_trend.py --model-id gpt2 gpt2-medium gpt2-large gpt2-xl --model-family gpt2 --dataset pubmed
python plot_trend.py --model-id gpt2 gpt2-medium gpt2-large gpt2-xl --model-family gpt2 --dataset imdb
python plot_trend.py --model-id gpt2 gpt2-medium gpt2-large gpt2-xl --model-family gpt2 --dataset squad

Mid-training experiments

For example, under LM_dispersion/midtrain_gpt2_huggingface.

Default loss

accelerate launch midtrain_gpt2.py --lr 5e-5 --train_tokens 200_000_000 --hf_token $HUGGINGFACE_ACCESS_TOKEN --cache_dir $SCRATCH_DIR --seed $SEED --per_device_train_batch_size 32 --gradient_accumulation_steps 4

Dispersion loss

accelerate launch midtrain_gpt2.py --lr 5e-5 --train_tokens 200_000_000 --dispersion 'angular_spread' --dispersion_loc 'all' --dispersion_coeff 0.1 --hf_token $HUGGINGFACE_ACCESS_TOKEN --cache_dir $SCRATCH_DIR --seed $SEED --per_device_train_batch_size 32 --gradient_accumulation_steps 4

Baseline methods

launch midtrain_gpt2_other_counter_condensation.py --lr 5e-5 --train_tokens 200_000_000 --noisy_embedding --hf_token $HUGGINGFACE_ACCESS_TOKEN --cache_dir $SCRATCH_DIR --seed $SEED --per_device_train_batch_size 32 --gradient_accumulation_steps 4

launch midtrain_gpt2_other_counter_condensation.py --lr 5e-5 --train_tokens 200_000_000 --active_forgetting --hf_token $HUGGINGFACE_ACCESS_TOKEN --cache_dir $SCRATCH_DIR --seed $SEED --per_device_train_batch_size 32 --gradient_accumulation_steps 4

Pre-training experiments

We used Torch Titan to perform the pre-training. See this repository.

Dependencies

We developed the codebase in a miniconda environment. How we created the conda environment:

# Optional: Update to libmamba solver.
conda update -n base conda
conda install -n base conda-libmamba-solver
conda config --set solver libmamba

conda create --name dispersion pytorch==2.1.0 torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -c anaconda -c conda-forge -y
conda activate dispersion
conda install scikit-learn scikit-image pandas matplotlib seaborn tqdm -c pytorch -c anaconda -c conda-forge -y

python -m pip install webdataset einops open-clip-torch
python -m pip install git+https://github.com/openai/CLIP.git
python -m pip install diffusers["torch"]==0.21.4 transformers huggingface_hub==0.25.2
python -m pip install datasets sentencepiece
python -m pip install numpy==1.26
python -m pip install nltk

python -m pip install -U phate
python -m pip install trl bitsandbytes
python -m pip install "transformers==4.46.0"
python -m pip install -U transformers accelerate
python -m pip install lm-eval

Debug

If you receive this error:

libstdc++.so.6: version `GLIBCXX_3.4.29' not found

You can run:

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Citation

@inproceedings{liu2026dispersion,
  title={Dispersion loss counteracts embedding condensation and improves generalization in small language models},
  author={Liu, Chen and Sun, Xingzhi and Xiao, Xi and Van Tassel, Alexandre and Xu, Ke and Reimann, Kristof and Liao, Danqi and Gerstein, Mark and Wang, Tianyang and Wang, Xiao and Krishnaswamy, Smita},
  booktitle={International conference on machine learning},
  year={2026},
  organization={PMLR}
}

Acknowledgements

This work was initially motivated by the paper "A mathematical perspective on Transformers". We started this project early Apr 2025 after we watched a talk on that paper.
The design of the dispersion loss was largely inspired by Runqian and Kaiming's paper "Diffuse and Disperse: Image Generation with Representation Regularization".

Name		Name	Last commit message	Last commit date
Latest commit History 250 Commits
archived/prelim		archived/prelim
assets		assets
key_observations		key_observations
lm_dispersion		lm_dispersion
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`LM-Dispersion`

Krishnaswamy Lab, Yale University

A 5-minute intro to this paper

Disclaimers and future directions

Reproduce our main observations on embedding condensation

Mid-training experiments

Pre-training experiments

Dependencies

Debug

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LM-Dispersion

Krishnaswamy Lab, Yale University

A 5-minute intro to this paper

Disclaimers and future directions

Reproduce our main observations on embedding condensation

Mid-training experiments

Pre-training experiments

Dependencies

Debug

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`LM-Dispersion`

Packages