AstroAI-Lab · robin-janssen · Feb 13, 2026 · Jan 20, 2026 · Jan 20, 2026 · Jan 22, 2026
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -9,6 +9,10 @@ on:
     branches:
       - main
       - develop
+    types: 
+      - opened
+      - synchronize
+      - reopened
 
 permissions:
   contents: write
@@ -113,8 +117,13 @@ jobs:
           fail_ci_if_error: true # Optional, ensures the CI fails if Codecov upload fails
 
   docs:
-    if: github.ref == 'refs/heads/main' || (github.event_name == 'pull_request' && github.event.pull_request.base.ref == 'main')
+    name: Docs
     runs-on: ubuntu-latest
+    permissions:
+      contents: write   # needed to push to gh-pages
+
+    # Only build+deploy docs when main is updated (i.e., after PR merge)
+    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
 
     strategy:
       matrix:
@@ -124,7 +133,7 @@ jobs:
       - name: Check out the repository
         uses: actions/checkout@v4
         with:
-          fetch-depth: 0 # Fetch all history if necessary
+          fetch-depth: 0
 
       - name: Set up Python
         uses: actions/setup-python@v5
@@ -146,18 +155,20 @@ jobs:
       - name: Generate API Documentation with Sphinx
         run: |
           source .venv/bin/activate
-          sphinx-apidoc -o docs/ codes
+          mkdir -p docs/source/api
+          sphinx-apidoc -o docs/source/api codes
 
       - name: Build HTML with Sphinx
         run: |
           source .venv/bin/activate
-          sphinx-build -b html docs/ docs/_build
+          sphinx-build -b html docs/source docs/_build/html
 
-      - name: Deploy Sphinx API docs to gh-pages
+      - name: Deploy docs to gh-pages
         uses: peaceiris/actions-gh-pages@v4
         with:
           github_token: ${{ secrets.GITHUB_TOKEN }}
-          publish_dir: docs/_build
+          publish_dir: docs/_build/html
           publish_branch: gh-pages
           user_name: "GitHub Actions"
           user_email: "actions@github.com"
+          force_orphan: false
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,9 @@
+repos:
+  - repo: https://github.com/psf/black
+    rev: 26.1.0
+    hooks:
+      - id: black
+  - repo: https://github.com/pycqa/isort
+    rev: 7.0.0
+    hooks:
+      - id: isort
diff --git a/README.md b/README.md
@@ -1,176 +1,72 @@
 # CODES Benchmark
 
-[![codecov](https://codecov.io/github/robin-janssen/CODES-Benchmark/branch/main/graph/badge.svg?token=TNF9ISCAJK)](https://codecov.io/github/robin-janssen/CODES-Benchmark)
-![Static Badge](https://img.shields.io/badge/license-GPLv3-blue)
-![Static Badge](https://img.shields.io/badge/NeurIPS-2024-green)
+[![codecov](https://codecov.io/github/robin-janssen/CODES-Benchmark/branch/main/graph/badge.svg?token=TNF9ISCAJK)](https://codecov.io/github/robin-janssen/CODES-Benchmark) ![Static Badge](https://img.shields.io/badge/license-GPLv3-blue) ![Static Badge](https://img.shields.io/badge/NeurIPS-2024-green)
 
-🎉 CODES was accepted to the ML4PS workshop @ NeurIPS2024 🎉
+🎉 Accepted to the ML4PS workshop @ NeurIPS 2024
 
-## Benchmarking Coupled ODE Surrogates
+Benchmark coupled ODE surrogate models on curated datasets with reproducible training, evaluation, and visualization pipelines. CODES helps you answer: *Which surrogate architecture fits my data, accuracy target, and runtime budget?*
 
-CODES is a benchmark for coupled ODE surrogate models.
+## What you get
 
-<picture>
-  <!-- Dark mode SVG -->
-  <source media="(prefers-color-scheme: dark)" srcset="docs/_static/file-alt-solid-white.svg">
-  <!-- Light mode SVG -->
-  <source media="(prefers-color-scheme: light)" srcset="docs/_static/file-alt-solid.svg">
-  <!-- Fallback image (light mode by default) -->
-  <img width="14" alt="Paper on arXiv" src="docs/_static/book-solid.svg">
-</picture> CODES paper on <a href="https://arxiv.org/abs/2410.20886">arXiV</a>. <p></p>
+- Baseline surrogates (MultiONet, FullyConnected, LatentNeuralODE, LatentPoly) with configurable hyperparameters
+- Rich datasets spanning chemistry, astrophysics, and dynamical systems
+- Optional studies for interpolation/extrapolation, sparse data regimes, uncertainty estimation, and batch scaling
+- Automated reporting: accuracy tables, resource usage, gradient analyses, and dozens of diagnostic plots
 
-<picture> 
-<source srcset="docs/_static/favicon-96x96.png">
-<img width="15" alt="CODES Logo" src="docs/_static/favicon-96x96.png">
-</picture> The main documentation can be found on the <a href="https://codes-docs.web.app/index.html">CODES website</a>. <p></p>
+## Two-minute quickstart
 
-<picture>
-  <!-- Dark mode SVG -->
-  <source media="(prefers-color-scheme: dark)" srcset="docs/_static/book-solid-white.svg">
-  <!-- Light mode SVG -->
-  <source media="(prefers-color-scheme: light)" srcset="docs/_static/book-solid.svg">
-  <!-- Fallback image (light mode by default) -->
-  <img width="14" alt="CODES API Docs" src="docs/_static/book-solid.svg">
-</picture>  The technical API documentation is hosted on this <a href="https://robin-janssen.github.io/CODES-Benchmark/">GitHub Page</a>.
+**uv (recommended)**
 
-## Motivation
-
-There are many efforts to use machine learning models ("surrogates") to replace the costly numerics involved in solving coupled ODEs. But for the end user, it is not obvious how to choose the right surrogate for a given task. Usually, the best choice depends on both the dataset and the target application.
-
-Dataset specifics - how "complex" is the dataset?
-
-- How many samples are there?
-- Are the trajectories very dynamic or are the developments rather slow?
-- How dense is the distribution of initial conditions?
-- Is the data domain of interest well-covered by the domain of the training set?
-
-Task requirements:
-
-- What is the required accuracy?
-- How important is inference time? Is the training time limited?
-- Are there computational constraints (memory or processing power)?
-- Is uncertainty estimation required (e.g. to replace uncertain predictions by numerics)?
-- How much predictive flexibility is required? Do we need to interpolate or extrapolate across time?
-
-Besides these practical considerations, one overarching question is always: Does the model only learn the data, or does it "understand" something about the underlying dynamics?
-
-## Goals
-
-This benchmark aims to aid in choosing the best surrogate model for the task at hand and additionally to shed some light on the above questions.
-
-To achieve this, a selection of surrogate models are implemented in this repository. They can be trained on one of the included datasets or a custom dataset and then benchmarked on the corresponding test dataset.
-
-Some **metrics** included in the benchmark (but there is much more!):
-
-- Absolute and relative error of the models.
-- Inference time.
-- Number of trainable parameters.
-- Memory requirements (**WIP**).
-
-Besides this, there are plenty of **plots and visualisations** providing insights into the models behaviour:
-
-- Error distributions - per model, across time or per quantity.
-- Insights into interpolation and extrapolation across time.
-- Behaviour when training with sparse data or varying batch size.
-- Predictions with uncertainty and predictive uncertainty across time.
-- Correlations between the either predictive uncertainty or dynamics (gradients) of the data and the prediction error
-
-Some prime **use-cases** of the benchmark are:
-
-- Finding the best-performing surrogate on a dataset. Here, best-performing could mean high accuracy, low inference times or any other metric of interest (e.g. most accurate uncertainty estimates, ...).
-- Comparing performance of a novel surrogate architecture against the implemented baseline models.
-- Gaining insights into a dataset or comparing datasets using the built-in dataset insights.
-
-## Key Features
-
-<details>
-  <summary><b>Baseline Surrogates</b></summary>
-
-The following surrogate models are currently implemented to be benchmarked:
-
-- Fully Connected Neural Network:
-  The vanilla neural network a.k.a. multilayer perceptron.
-- DeepONet:
-  Two fully connected networks whose outputs are combined using a scalar product. In the current implementation, the surrogate comprises of only one DeepONet with multiple outputs (hence the name MultiONet).
-- Latent NeuralODE:
-  NeuralODE combined with an autoencoder that reduces the dimensionality of the dataset before solving the dynamics in the resulting latent space.
-- Latent Polynomial:
-  Uses an autoencoder similar to Latent NeuralODE, but fits a polynomial to the trajectories in the resulting latent space.
-
-</details>
-
-<details>
-  <summary><b>Baseline Datasets</b></summary>
-
-The following datasets are currently included in the benchmark:
-
-</details>
-
-<details>
-  <summary><b>Uncertainty Quantification (UQ)</b></summary>
-
-To give an uncertainty estimate that does not rely too much on the specifics of the surrogate architecture, we use DeepEnsemble for UQ.
-
-</details>
-
-<details>
-  <summary><b>Parallel Training</b></summary>
-
-To gain insights into the surrogates behaviour, many models must be trained on varying subsets of the training data. This task is trivially parallelisable. In addition to utilising all specified devices, the benchmark features some nice progress bars to gain insights into the current status of the training.
-
-</details>
-
-<details>
-  <summary><b>Plots, Plots, Plots</b></summary>
-
-While hard metrics are crucial to compare the surrogates, performance cannot always be broken down to a set of numbers. Running the benchmark creates many plots that serve to compare performance of surrogates or provide insights into the performance of each surrogate.
-
-</details>
-
-<details>
-  <summary><b>Dataset Insights (WIP)</b></summary>
-
-"Know your data" is one of the most important rules in machine learning. To aid in this, the benchmark provides plots and visualisations that should help to understand the dataset better.
-
-</details>
-
-<details>
-  <summary><b>Tabular Benchmark Results</b></summary>
+```bash
+git clone https://github.com/robin-janssen/CODES-Benchmark.git
+cd CODES-Benchmark
+uv sync                       # creates .venv from pyproject/uv.lock
+source .venv/bin/activate
+uv run python run_training.py --config configs/train_eval/config_minimal.yaml
+uv run python run_eval.py --config configs/train_eval/config_minimal.yaml
+```
 
-At the end of the benchmark, the most important metrics are displayed in a table, additionally, all metrics generated during the benchmark are provided as a csv file.
+**pip alternative**
 
-</details>
+```bash
+git clone https://github.com/robin-janssen/CODES-Benchmark.git
+cd CODES-Benchmark
+python -m venv .venv && source .venv/bin/activate
+pip install -e .
+pip install -r requirements.txt
+python run_training.py --config configs/train_eval/config_minimal.yaml
+python run_eval.py --config configs/train_eval/config_minimal.yaml
+```
 
-<details>
-  <summary><b>Reproducibility</b></summary>
+Outputs land in `trained/<training_id>`, `results/<training_id>`, and `plots/<training_id>`. The `configs/` folder contains ready-to-use templates (`train_eval/config_minimal.yaml`, `config_full.yaml`, etc.). Copy a file there and adjust datasets/surrogates/modalities before running the CLIs.
 
-Randomness is an important part of machine learning and even required in the context of UQ with DeepEnsemble, but reproducibility is key in benchmarking enterprises. The benchmark uses a custom seed that can be set by the user to ensure full reproducibility.
+## Documentation
 
-</details>
+- [Main docs & tutorials](https://robin-janssen.github.io/CODES-Benchmark/)
+- [API reference (Sphinx)](https://robin-janssen.github.io/CODES-Benchmark/modules.html)
+- [Paper on arXiv](https://arxiv.org/abs/2410.20886)
 
-<details>
-  <summary><b>Custom Datasets and Own Models</b></summary>
+The GitHub Pages site now hosts the narrative guides, configuration reference, and interactive notebooks alongside the generated API docs.
 
-To cover a wide variety of use-cases, the benchmark is designed such that adding own datasets and models is explicitly supported.
+## Repository map
 
-</details>
+| Path | Purpose |
+| --- | --- |
+| `configs/` | Ready-to-edit benchmark configs (`train_eval/`, `tuning/`, etc.) |
+| `datasets/` | Bundled datasets + download helper (`data_sources.yaml`) |
+| `codes/` | Python package with surrogates, training, tuning, and benchmarking utilities |
+| `run_training.py`, `run_eval.py`, `run_tuning.py` | CLI entry points for the main workflows |
+| `docs/` | Sphinx project powering the GitHub Pages site (guides, tutorials, API reference) |
+| `scripts/` | Convenience tooling (dataset downloads, analysis utilities) |
 
-## Quickstart
+## Contributing
 
-First, clone the [GitHub Repository](https://github.com/robin-janssen/CODES-Benchmark) with
+Pull requests are welcome! Please include documentation updates, add or update tests when you touch executable code, and run:
 
+```bash
+uv pip install --group dev
+pytest
+sphinx-build -b html docs/source/ docs/_build/html
 ```
-git clone ssh://git@github.com/robin-janssen/CODES-Benchmark
-```
-
-Optionally, you can set up a [virtual environment](https://docs.python.org/3/library/venv.html) (recommended).
-
-Then, install the required packages with
-
-```
-pip install -r requirements.txt
-```
-
-The installation is now complete. To be able to run and evaluate the benchmark, you need to first set up a configuration YAML file. There is one provided, but it should be configured. For more information, check the [configuration page](https://robin-janssen.github.io/CODES-Benchmark/documentation.html#config). There, we also offer an interactive Config-Generator tool with some explanations to help you set up your benchmark.
 
-You can also add your own datasets and models to the benchmark to evaluate them against each other or some of our baseline models. For more information on how to do this, please refer to the [documentation](https://robin-janssen.github.io/CODES-Benchmark/documentation.html).
+If you publish a new surrogate or dataset, document it under `docs/guides` / `docs/reference` so users can adopt it quickly. For questions, open an issue on GitHub.