Skip to content

adithya-ananth/llms-interpretability

Repository files navigation

LLMs Interpretability

A collection of interpretability experiments and tools for analyzing the inner workings of transformer-based language models (e.g., GPT-2, DistilBERT).

🔍 Features

  • Activation Patching: Understand how specific layer activations affect output predictions.

  • Attribution Patching: Attribute model behavior to attention head inputs using gradients.

  • Logit Lens: Visualize token predictions across model layers.

  • Probing: Use learned linear classifiers to probe for linguistic knowledge in hidden states.

  • Dictionary Learning (via notebook): Decompose activations to discover interpretable directions.

About

Implementation of LLM Interpretability techniques and algorithms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published