SysBioChalmers · edkerk · May 30, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,48 @@
+name: CI
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+  workflow_dispatch:
+
+# A push that obsoletes a previous run cancels it.
+concurrency:
+  group: ci-${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  lint:
+    name: ruff
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+          cache: pip
+          cache-dependency-path: pyproject.toml
+      - run: pip install --upgrade pip
+      - run: pip install ruff
+      - run: ruff check .
+
+  test:
+    name: pytest (py${{ matrix.python }})
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python: ["3.11", "3.12", "3.13"]
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python }}
+          cache: pip
+          cache-dependency-path: pyproject.toml
+      - run: pip install --upgrade pip
+      # ``-e .[dev,plotting,excel]`` so every optional extra is exercised.
+      # Gurobi is not installable on free runners; the relevant tests
+      # skip themselves when ``optlang.gurobi_interface`` cannot import.
+      - run: pip install -e ".[dev,plotting,excel]"
+      - run: pytest -q --maxfail=5 --durations=20
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,15 +1,82 @@
 # raven-python
 
-The Python counterpart of the
-[RAVEN Toolbox 2](https://github.com/SysBioChalmers/RAVEN) (MATLAB), built on
-[cobrapy](https://github.com/opencobra/cobrapy).
-
-`raven-python` covers de-novo reconstruction (KEGG + protein homology),
-context-specific model extraction (`tINIT` / `ftINIT`), metabolic-task
-validation, gap-filling, omics ingestion, sub-cellular localisation, model
-manipulation, and YAML / SIF / Excel I/O — preserving the established RAVEN
-workflows in a Python-native form.
-
-This `main` branch is intentionally empty. Development happens on the
-`develop` branch via a series of feature branches; see the open and merged
-pull requests for the current state of the port.
+[![CI](https://github.com/SysBioChalmers/raven-python/actions/workflows/ci.yml/badge.svg)](https://github.com/SysBioChalmers/raven-python/actions/workflows/ci.yml)
+
+**Reconstruction, Analysis and Visualisation of Metabolic Networks — in Python.**
+
+`raven-python` is the Python counterpart of the
+[RAVEN Toolbox 2](https://github.com/SysBioChalmers/RAVEN) (MATLAB). It builds on
+[**cobrapy**](https://github.com/opencobra/cobrapy) for everything cobrapy already does
+well (simulation, standard analyses, SBML I/O, model manipulation) and adds the
+functionality that's unique to RAVEN:
+
+* **De novo reconstruction** from KEGG and protein homology (BLAST / DIAMOND).
+* **Context-specific models** from omics data via **tINIT / ftINIT**, with task-aware
+  gap-filling and the linear-merge MILP reduction.
+* **Metabolic-task** validation (`check_tasks`, `fitTasks`).
+* **Connectivity gap-filling** against template models.
+* **Omics integration** — Human Protein Atlas (proteomics + RNA-seq) ingestion.
+* **Sub-cellular localisation** prediction by MILP, with partial-update mode and
+  pluggable predictors (WoLF PSORT, DeepLoc, …).
+* **N-model comparison**; **reporter metabolites**; **FSEOF**; **flux sampling**.
+* **YAML I/O** following the cobra standard, plus geckopy's `ec-*` enzyme-constrained
+  fields. **SIF** export. **RAVEN-style Excel** export.
+
+The status of every RAVEN function (ported, cheatsheet-mapped to cobra, or explicitly
+not ported) is documented function-by-function in
+**[docs/raven_migration.md](docs/raven_migration.md)**.
+
+## Design principle
+
+The canonical in-memory object is always a [`cobra.Model`](https://cobrapy.readthedocs.io).
+There is no parallel RAVEN struct, no `ravenCobraWrapper`-style adapter. RAVEN-specific
+fields that cobra doesn't model natively (`rxnMiriams`, `metDeltaG`,
+`rxnConfidenceScores`, …) live in cobra's `annotation` / `notes` dictionaries. This
+avoids duplicating cobra's data model and keeps raven-python interoperable with the wider
+COBRA ecosystem.
+
+## Status
+
+raven-python has been validated against MATLAB RAVEN on **Human-GEM** (5 Hart2015 cell-line
+models, Jaccard 0.975–0.980 — see [docs/humangem_validation.md](docs/humangem_validation.md)).
+The functional scope of the original RAVEN toolbox is covered with two principled
+omissions:
+
+* **MetaCyc-based reconstruction** is not implemented and is flagged for removal from
+  MATLAB RAVEN as well — see [IMPROVEMENTS.md](IMPROVEMENTS.md) under `R-MetaCyc`.
+* **Dynamic FBA** is not implemented — several maintained Python packages already cover
+  it ([`dfba`](https://pypi.org/project/dfba/), [`reframed`](https://pypi.org/project/reframed/),
+  [`mewpy`](https://pypi.org/project/mewpy/)).
+
+What's still open is catalogued in **[docs/todo.md](docs/todo.md)** (visualisation / Phase
+6 is the main item).
+
+## Installation (development)
+
+```bash
+git clone https://github.com/SysBioChalmers/raven-python
+cd raven-python
+pip install -e ".[dev]"
+```
+
+raven-python requires Python ≥ 3.11. Genome-scale (f)tINIT MILPs currently require **Gurobi**
+([details on solver portability](docs/init_solver_benchmark.md)); toy and unit-test work
+runs on the open-source GLPK.
+
+## Documentation
+
+See **[docs/README.md](docs/README.md)** for the documentation index.
+
+## Relationship to MATLAB RAVEN
+
+`raven-python` is a derivative work and is released under the same **GPL-3.0-or-later**
+license. If you use it in scientific work, please cite the RAVEN 2 paper:
+
+> Wang H, Marcišauskas S, Sánchez BJ, Domenzain I, Hermansson D, Agren R, Nielsen J,
+> Kerkhoven EJ. (2018) RAVEN 2.0: A versatile toolbox for metabolic network
+> reconstruction and a case study on *Streptomyces coelicolor*. PLoS Comput Biol 14(10):
+> e1006541.
+
+## License
+
+[GPL-3.0-or-later](LICENSE)
diff --git a/docs/kegg_data_format.md b/docs/kegg_data_format.md
@@ -0,0 +1,72 @@
+# KEGG relational-table storage format
+
+This note records *why* raven-python stores its KEGG-derived relational tables as
+**gzipped TSV**, and what other options we deliberately deferred. It applies to
+the maintainer-built KEGG artefacts described in PLAN.md §2.3b — the `ko_reaction`,
+`organism_gene_ko`, KO-name, and reaction-flag tables.
+
+The reference GEM itself is stored as **gzipped RAVEN/cobra YAML**
+(`reference_model.yml.gz`) — RAVEN-native and MATLAB-readable, gzipped to match the
+tables (the YAML I/O transparently gzips on a `.gz` suffix). On the real KEGG dump
+this is ~1.1 MB (vs ~30 MB as SBML) for the full 12k-reaction gene-free model.
+
+End users do not build any of this: the published artefacts are fetched and cached
+under `~/.cache/raven-python/data/kegg-<version>/` by `ensure_data` (see
+`raven_python.data`), mirroring how binaries are provisioned.
+
+## Decision (current)
+
+- **Small tables** (`ko_reaction`, `ko_names`, `rxn_flags`): **gzipped TSV
+  (`.tsv.gz`)**. Each is well under 1 MB, so compression choice is irrelevant;
+  gzip keeps them MATLAB-native and dependency-free.
+- **The large `organism_gene_ko` table**: **xz-compressed TSV
+  (`organism_gene_ko.tsv.xz`), with rows sorted by `(organism, gene)`**.
+
+Why the large table differs. It carries KEGG's ~9M gene↔KO associations and
+dominates the artefact set (≈78 MB as unsorted gzipped TSV). Two cheap,
+stdlib-only changes cut that to ≈27 MB (2.9×):
+
+1. **Sort by `(organism, gene)`** before writing. Gene IDs from one organism
+   share long common prefixes (locus tags, numeric runs); sorting makes them
+   adjacent so the compressor can fold them. This alone takes 78 → 48 MB and
+   happens to match the by-organism query pattern in
+   `get_kegg_model_for_organism`. The sort is an external merge sort bounded to
+   `chunk_rows` in memory (see `stream_organism_gene_ko`), so it stays scalable.
+2. **xz instead of gzip** (Python stdlib `lzma`). Its larger dictionary captures
+   cross-row redundancy gzip's 32 KB window misses: sorted + xz reaches ≈27 MB.
+
+- **pandas reads/writes both with zero extra dependencies** — compression is
+  inferred from the `.gz`/`.xz` suffix; `lzma` and `gzip` are both stdlib, so
+  this works natively on Windows, macOS, and Linux with no external binary.
+- **MATLAB caveat:** `readtable` reads gzipped TSV after a `gunzip`, but MATLAB
+  has no built-in xz decompressor. The small tables stay MATLAB-native; the
+  large table needs an external `unxz` (or Java/`7-Zip`) before `readtable` on
+  the MATLAB side. The xz file is raven-python's (Python) primary artefact; this
+  trades a little MATLAB convenience on the one big file for a ~3× size cut.
+
+## Options considered
+
+| Format | Python cost | MATLAB cost | Notes |
+| --- | --- | --- | --- |
+| **Gzipped TSV** ✅ | none (stdlib/pandas) | none (`readtable`) | Universal, text, types re-specified on read. Chosen. |
+| Parquet | `pyarrow` or `fastparquet` (~40–60 MB wheel) as a `raven-python[kegg]` extra | needs ≥ R2019a (`parquetread`, native) | Smaller, faster, typed, columnar. Win mainly at scale / repeated random access. |
+| SQLite | none (stdlib `sqlite3`) | **needs Database Toolbox** | Rejected: the MATLAB-side toolbox requirement breaks the "same files, both languages, no extra deps" goal. |
+
+## When to revisit
+
+Reconsider Parquet (or SQLite) if any of these become true:
+
+- The `organism_gene_ko` table grows large enough that load *time* (not just
+  size — the sort+xz change above already addresses on-disk size) becomes a real
+  bottleneck. The remaining inefficiency is that building one species' model
+  still loads all ~9M rows; sorted order makes a `searchsorted`/row-group
+  by-organism read the natural next step before reaching for Parquet.
+- We start doing repeated random-access / columnar reads rather than a single
+  load-once-per-run pattern.
+- A typed, self-describing schema becomes valuable (TSV loses dtypes; they are
+  re-specified on read).
+
+If revisited, prefer **Parquet** over SQLite (no MATLAB toolbox dependency; MATLAB
+reads Parquet natively from R2019a). It could be offered as an optional
+`raven-python[kegg]` extra (pyarrow) alongside the TSV default, rather than replacing
+it — keeping the dependency-free path intact for users who don't opt in.