A causal inference engine for deep learning training that provides structured explanations of neural network training failures. Understand why your model failed during training through semantic analysis and abductive reasoning, not raw tensor inspection.
NeuralDBG treats training as a semantic trace of learning dynamics rather than a black box. It extracts meaningful events and provides causal hypotheses about training failures, enabling researchers to:
- Identify gradient health transitions (stable -> vanishing/saturated)
- Detect activation regime shifts (normal -> saturated/dead)
- Detect optimizer instability (loss plateaus, spikes, divergence)
- Catch data anomalies (NaN, Inf, distribution shifts)
- Track propagation of instabilities through network layers
- Generate ranked causal explanations for training failures
Unlike traditional monitoring tools (TensorBoard, Weights & Biases), NeuralDBG focuses on causal inference rather than metric tracking.
| Feature | TensorBoard / W&B | NeuralDBG |
|---|---|---|
| What it shows | Graphs of loss/accuracy over time | Why the loss spiked or vanished |
| Diagnosis | Manual inspection of curves | Automated causal hypotheses |
| Actionable? | You guess the fix | Suggests root causes (LR, Init, Data) |
| Integration | Separate dashboard | One line of code in your loop |
| Privacy | Data sent to cloud | 100% Local (unless you opt-in) |
"TensorBoard tells you when it failed. NeuralDBG tells you why."
- Semantic Event Extraction: Detects meaningful transitions in training dynamics
- Causal Compression: Identifies first occurrences and propagation patterns
- Post-Mortem Reasoning: Provides ranked hypotheses about failure causes
- Optimizer Instability Detection: Tracks loss plateaus, spikes, and divergence
- Data Anomaly Detection: Catches NaN, Inf, and distribution shifts in inputs
- Event Collapsing: Merges sequential events into summary traces
- Compiler-Aware: Operates at module boundaries to survive torch.compile
- Non-Invasive: Wraps existing PyTorch training loops without code changes
- Minimal API: Focused on explanations, not raw data dumps
- Aquarium Export: JSON export for visualization in Aquarium IDE
pip install neuraldbgimport torch
import torch.nn as nn
from neuraldbg import NeuralDbg
# Your existing model and training setup
model = nn.Sequential(nn.Linear(10, 5), nn.ReLU(), nn.Linear(5, 1))
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
# Wrap your training loop
with NeuralDbg(model) as dbg:
for step, (inputs, targets) in enumerate(dataloader):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
dbg.record_loss(loss.item())
optimizer.step()
# After training failure, query for explanations
explanations = dbg.explain_failure()
print(explanations[0]) # "Gradient vanishing originated in layer 'linear1' at step 234..."# Get ranked causal hypotheses for the failure
hypotheses = dbg.get_causal_hypotheses()
# Query specific causal chains
chain = dbg.trace_causal_chain('vanishing_gradients')
# Check for coupled failures
couplings = dbg.detect_coupled_failures()
# Export to Aquarium (JSON)
dbg.export_aquarium_package('debug_session.json')with NeuralDbg(model) as dbg:
for step in range(num_steps):
dbg.step = step
output = model(inputs)
loss = criterion(output, targets)
loss.backward()
dbg.record_loss(loss.item())
optimizer.step()
# Detect loss plateaus, spikes, or divergence
hypotheses = dbg.explain_failure("optimizer_instability")
for h in hypotheses:
print(h.description)Data anomalies (NaN, Inf, distribution shifts) are detected automatically from layer inputs during the forward pass:
with NeuralDbg(model) as dbg:
# ... training loop ...
pass
hypotheses = dbg.explain_failure("data_anomaly")
for h in hypotheses:
print(h.description) # "NaN values detected in input to layer 'linear1'..."NeuralDBG has been validated across 9 architectures:
| Architecture | Failure Modes Tested |
|---|---|
| Transformer (nanoGPT) | Attention collapse, NaN softmax, LR warmup |
| GANs (DCGAN) | Vanishing, exploding, NaN injection |
| LLM fine-tuning (LoRA) | Catastrophic forgetting, loss spikes |
| Diffusion (DDPM) | NaN UNet, exploding gradients |
| LSTM / Time Series | Vanishing recurrent gradients |
| GNN (GCN/GAT) | Oversmoothing, deep GNN |
| RL (PPO-style) | Policy collapse, value explosion |
| torch.compile | Dynamo graph compatibility |
| DataParallel | Multi-GPU hook integrity |
| Failure Type | Description |
|---|---|
vanishing_gradients |
Root cause + saturation coupling |
exploding_gradients |
First layer to explode |
dead_neurons |
Neuron death in activation layers |
saturated_activations |
Activation saturation patterns |
optimizer_instability |
Loss plateaus, spikes, divergence |
data_anomaly |
NaN/Inf/distribution shift in inputs |
- Semantic Event Extractor: Detects meaningful transitions in learning dynamics
- Causal Compressor: Identifies patterns and propagation in training failures
- Post-Mortem Reasoner: Generates ranked hypotheses about failure causes
- Compiler-Aware Monitor: Operates at safe boundaries for optimization compatibility
| Event Type | Source | Detects |
|---|---|---|
gradient_health_transition |
Backward hooks | Vanishing, exploding, saturated gradients |
activation_regime_shift |
Forward hooks | Dead neurons, saturated activations |
optimizer_instability |
record_loss() |
Loss plateaus, spikes, divergence |
data_anomaly |
Forward hooks (inputs) | NaN, Inf, distribution shifts |
| Edition | Package | License | Features |
|---|---|---|---|
| Core | pip install neuraldbg |
MIT | Hooks, events, export JSON, basic heuristics |
| Engine | pip install neuraldbg-engine |
Proprietary | Full causal inference, detailed hypotheses, coupling detection |
The Core edition works standalone with basic heuristic fallbacks. Install the Engine for advanced causal reasoning.
- ML Researchers seeking causal explanations for training failures
- PhD Students analyzing learning dynamics in novel architectures
- Research Engineers understanding optimization instabilities
- PyTorch only
- Focus on semantic events, not tensor inspection
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
make bootstrap
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # WindowsMIT License - see LICENSE.md for details.
- CHANGELOG.md - Version history and notable changes
- logic_graph.md - System architecture and data flow
- docs/PHASE2_DOGFOODING.md - Detailed dogfooding scenarios
If you use NeuralDBG in your research, please cite:
@misc{neuraldbg2026,
title={NeuralDBG: A Causal Inference Engine for Deep Learning Training Dynamics},
author={SENOUVO Jacques-Charles Gad},
year={2026},
url={https://github.com/LambdaSection/NeuralDBG}
}