A collection of interpretability experiments and tools for analyzing the inner workings of transformer-based language models (e.g., GPT-2, DistilBERT).
-
Activation Patching: Understand how specific layer activations affect output predictions.
-
Attribution Patching: Attribute model behavior to attention head inputs using gradients.
-
Logit Lens: Visualize token predictions across model layers.
-
Probing: Use learned linear classifiers to probe for linguistic knowledge in hidden states.
-
Dictionary Learning (via notebook): Decompose activations to discover interpretable directions.