LLMs Interpretability

A collection of interpretability experiments and tools for analyzing the inner workings of transformer-based language models (e.g., GPT-2, DistilBERT).

🔍 Features

Activation Patching: Understand how specific layer activations affect output predictions.
Attribution Patching: Attribute model behavior to attention head inputs using gradients.
Logit Lens: Visualize token predictions across model layers.
Probing: Use learned linear classifiers to probe for linguistic knowledge in hidden states.
Dictionary Learning (via notebook): Decompose activations to discover interpretable directions.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
Dictionary_Learning.ipynb		Dictionary_Learning.ipynb
README.md		README.md
activation_patching.png		activation_patching.png
activation_patching.py		activation_patching.py
attribution_patching.py		attribution_patching.py
logit_lens.png		logit_lens.png
logit_lens.py		logit_lens.py
probe_accuracy.png		probe_accuracy.png
probe_analysis.log		probe_analysis.log
probing.py		probing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLMs Interpretability

🔍 Features

About

Uh oh!

Releases

Packages

Languages

adithya-ananth/llms-interpretability

Folders and files

Latest commit

History

Repository files navigation

LLMs Interpretability

🔍 Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages