This is the code base for our NeurIPS 2025 paper "Revising and Falsifying Sparse Autoencoder Feature Explanations", by George Ma, Samuel Pfrommer, and Somayeh Sojoudi.
This repository provides tools and methods for automatically generating, evaluating, and iteratively improving explanations of Sparse Autoencoder (SAE) features in neural networks. The key contributions include:
- Automated explanation generation using both one-shot and iterative tree-based methods
- Simulation-based scoring that evaluates explanations by predicting feature activations
- Complementary negative examples for more robust explanation evaluation
- Support for multiple base models: Gemma-2-9B, GPT-2, and Llama-3.1-8B
- Python 3.9 or higher
- CUDA-capable GPU (recommended for running experiments)
- Virtual environment tool (venv, conda, etc.)
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate- Run the setup script:
bash setup.shThis will install system dependencies (pv, pbzip2) and the Python package with all required dependencies.
feature-interp/
├── featureinterp/ # Core library code
│ ├── explainer.py # OneShotExplainer and TreeExplainer implementations
│ ├── simulator.py # Feature activation simulation
│ ├── scoring.py # Explanation evaluation metrics
│ ├── record.py # Data structures for feature activation records
│ ├── core.py # Core data structures (StructuredExplanation, etc.)
│ ├── complexity.py # Explanation complexity analysis
│ └── ...
├── scripts/ # Experiment scripts
│ ├── generate_records.py # Generate activation records from models
│ ├── single_index_demo.py # Minimal demo for single feature
│ ├── complementary_sentences.py # Figure 3 experiments
│ ├── explainer_comparison.py # Figure 4 experiments
│ ├── polysemanticity_sweep.py # Figure 5 experiments
│ └── bash/ # Shell scripts for running experiments
├── paper_figs/ # Plotting scripts for paper figures
└── tests/ # Unit and integration tests
Run a minimal demonstration on a single SAE feature:
python scripts/single_index_demo.pyThis will:
- Load a pre-computed SAE feature record
- Generate an explanation using an LLM
- Simulate feature activations based on the explanation
- Score the explanation quality
- Display results including training examples, explanation, and simulation results
Note: This requires pre-generated records (see Data Generation below) and access to the specified LLM models.
Before running experiments, you need to generate feature activation records from a base model and its corresponding SAE:
python scripts/generate_records.py \
--write-sae-acts \
--write-holistic-acts \
--write-recordsKey parameters:
--model-name: Base model (e.g.,google/gemma-2-9b)--sae-name: SAE model name (e.g.,gemma-scope-9b-pt-res-canonical)--dataset-name: Dataset for activation collection (default:monology/pile-uncopyrighted)--max-dataset-size: Number of sequences to process (default: 100000)--max-features: Number of SAE features to analyze (default: 50)--layers: Which layers to process (all,even,odd, or comma-separated list)
This creates:
- SAE activations: Feature activation values for each token
- Holistic activations: Context-dependent activation attribution
- Records: Organized datasets including positive examples, negative examples, and similarity-based retrieval indices
The codebase supports two explanation generation methods:
Generates explanations in a single LLM call using few-shot prompting:
from featureinterp.explainer import OneShotExplainer, OneShotExplainerParams
explainer = OneShotExplainer(
model_name="meta-llama/llama-4-scout",
params=OneShotExplainerParams(
rule_cap=5, # Maximum number of rules
include_holistic_expressions=False, # Include context information
structured_explanations=True, # Use structured JSON format
)
)Uses iterative refinement with simulation-based feedback:
from featureinterp.explainer import TreeExplainer, TreeExplainerParams
explainer = TreeExplainer(
model_name="meta-llama/llama-4-scout",
simulator_factory=simulator_factory,
params=TreeExplainerParams(
depth=3, # Tree search depth
width=3, # Number of candidates to keep at each level
rule_cap=5, # Maximum number of rules
structured_explanations=True,
)
)Explanations are evaluated by simulating feature activations and comparing to ground truth:
from featureinterp.scoring import simulate_and_score
simulator = simulator_factory(explanation)
scored_simulation = await simulate_and_score(simulator, test_records)
score = scored_simulation.get_preferred_score() # Correlation coefficientEvaluates the impact of different complementary negative example strategies:
python scripts/complementary_sentences.pyThen generate figures:
python paper_figs/complementary_sentences_figs.pyCompares one-shot vs. tree explainers with various configurations:
bash scripts/bash/explainer_comparison.shThen generate figures:
python paper_figs/explainer_comparison_figs.pyAnalyzes how explanation complexity (number of rules) affects quality:
bash scripts/bash/polysemanticity_sweep.shThen generate figures:
python paper_figs/polysemanticity_figs.pyExplanations are represented as lists of rules, each with:
activates_on(string): Pattern or content the feature responds tostrength(int 0-5): Activation strength
Example:
[
{"activates_on": "quotation marks at the beginning of quotes", "strength": 4},
{"activates_on": "the start of dialogue in fiction", "strength": 3}
]The method uses various strategies for selecting negative examples:
- RANDOM: Random sequences from the dataset
- RANDOM_NEGATIVE: Random sequences with zero activation
- SIMILAR: Semantically similar sentences ranked by Sentence Transformer
- SIMILAR_NEGATIVE: Semantically similar sequences with zero activation (our method)
Beyond token-level activations, the system can analyze activation-causing tokens - earlier tokens in the sequence that cause future activations. This provides richer context for understanding feature behavior.
The experiments use several types of models:
- Explainer models: Generate natural language explanations (e.g.,
meta-llama/llama-4-scout,google/gemini-flash-1.5-8b) - Simulator models: Predict activations from explanations (e.g.,
google/gemma-2-27b-it) - Complexity analyzer: Evaluate explanation complexity (optional)
Models are loaded with quantization for efficient GPU usage (4-bit or 8-bit).
Key configuration options in experiment scripts:
# Dataset and model selection
dataset_path = 'data/pile-uncopyrighted_gemma-2-9b/records'
EXPLAINER_MODEL_NAME = "meta-llama/llama-4-scout"
SIMULATOR_MODEL_NAME = "google/gemma-2-27b-it"
# Record selection parameters
train_record_params = RecordSliceParams(
positive_examples_per_split=10,
complementary_examples_per_split=10,
complementary_record_source=ComplementaryRecordSource.SIMILAR_NEGATIVE,
)
# Inference parameters
INFERENCE_BATCH_SIZE = 2Generated data is organized as:
data/
└── {dataset}_{model}/
├── tokens.pt # Tokenized sequences
├── sae_acts/ # SAE activations per layer
│ └── {hook_name}.pt
├── holistic_acts/ # Holistic activations per layer
│ └── {hook_name}.pt
├── similarity_retriever/ # Embedding indices for retrieval
└── records/ # Organized feature records
└── {hook_name}/
└── {feature_idx}.json
Results are cached in cache/ and final results saved to results/.
The codebase supports uploading/downloading datasets via W&B:
# Upload generated data
python scripts/generate_records.py --upload-wandb
# Download pre-generated data
python scripts/generate_records.py --download-wandbIf you use results from our paper or find our paper relevant or useful, please cite:
@inproceedings{ma2025revising,
title={{Revising and Falsifying Sparse Autoencoder Feature Explanations}},
author={Ma, George and Pfrommer, Samuel and Sojoudi, Somayeh},
booktitle={The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}