Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
0a9f4a1
Project scaffold: pyproject + package skeleton + README + LICENSE
edkerk May 29, 2026
b7b69ac
Add GitHub Actions CI and the maintainer-scripts README
edkerk May 29, 2026
50bea40
Add the foundation utilities: GPR, balance, parse, sort, validate
edkerk May 29, 2026
4bc0d6e
Add the model-manipulation layer (add, remove, transport, merge, etc.)
edkerk May 29, 2026
a1dc557
Add binary + data resolvers for external tools and published artefacts
edkerk May 29, 2026
6ef3357
Add YAML and SIF model I/O
edkerk May 29, 2026
7a9b69a
Add Excel export and the Standard-GEM git-layout export
edkerk May 29, 2026
cf199dc
Add BLAST and DIAMOND wrappers for protein-homology searches
edkerk May 29, 2026
eccce57
Add the homology-based draft model builder (getModelFromHomology port)
edkerk May 29, 2026
1b0df8d
Add KEGG download, dump parser and taxonomy parser
edkerk May 29, 2026
369f677
Add KEGG HMM-library build and HMM-based KO assignment
edkerk May 29, 2026
d9f100c
Add KEGG species-model assembly (per-organism reconstruction)
edkerk May 29, 2026
cf76698
Add KEGG artefact-build scripts and HMM-cutoff calibration docs
edkerk May 29, 2026
5e36aae
Add metabolic-task parsing and the check_tasks validator
edkerk May 29, 2026
2bbd4e6
Add connectivity gap-filling (MILP) against template models
edkerk May 29, 2026
b39f336
Add the tINIT (INIT) MILP and its supporting machinery
edkerk May 29, 2026
07f2fab
Add the ftINIT pipeline and task-aware gap-filling
edkerk May 29, 2026
9b04421
Add Human-GEM validation, parameter studies and cross-solver tests
edkerk May 29, 2026
2b32606
Add HPA omics ingestion (proteomics + RNA-seq)
edkerk May 29, 2026
4846909
Add FSEOF, reporter metabolites and flux sampling
edkerk May 29, 2026
f912525
Add N-model comparison (presence + Jaccard + optional task check)
edkerk May 29, 2026
9bfcd71
Add subcellular-localisation prediction (MILP) with pluggable predictors
edkerk May 29, 2026
3c12800
Add the yeast-GEM localization benchmark (real-data validation)
edkerk May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: CI

on:
push:
branches: [main]
pull_request:
workflow_dispatch:

# A push that obsoletes a previous run cancels it.
concurrency:
group: ci-${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
lint:
name: ruff
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip
cache-dependency-path: pyproject.toml
- run: pip install --upgrade pip
- run: pip install ruff
- run: ruff check .

test:
name: pytest (py${{ matrix.python }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python: ["3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python }}
cache: pip
cache-dependency-path: pyproject.toml
- run: pip install --upgrade pip
# ``-e .[dev,plotting,excel]`` so every optional extra is exercised.
# Gurobi is not installable on free runners; the relevant tests
# skip themselves when ``optlang.gurobi_interface`` cannot import.
- run: pip install -e ".[dev,plotting,excel]"
- run: pytest -q --maxfail=5 --durations=20
541 changes: 541 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

93 changes: 80 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,82 @@
# raven-python

The Python counterpart of the
[RAVEN Toolbox 2](https://github.com/SysBioChalmers/RAVEN) (MATLAB), built on
[cobrapy](https://github.com/opencobra/cobrapy).

`raven-python` covers de-novo reconstruction (KEGG + protein homology),
context-specific model extraction (`tINIT` / `ftINIT`), metabolic-task
validation, gap-filling, omics ingestion, sub-cellular localisation, model
manipulation, and YAML / SIF / Excel I/O — preserving the established RAVEN
workflows in a Python-native form.

This `main` branch is intentionally empty. Development happens on the
`develop` branch via a series of feature branches; see the open and merged
pull requests for the current state of the port.
[![CI](https://github.com/SysBioChalmers/raven-python/actions/workflows/ci.yml/badge.svg)](https://github.com/SysBioChalmers/raven-python/actions/workflows/ci.yml)

**Reconstruction, Analysis and Visualisation of Metabolic Networks — in Python.**

`raven-python` is the Python counterpart of the
[RAVEN Toolbox 2](https://github.com/SysBioChalmers/RAVEN) (MATLAB). It builds on
[**cobrapy**](https://github.com/opencobra/cobrapy) for everything cobrapy already does
well (simulation, standard analyses, SBML I/O, model manipulation) and adds the
functionality that's unique to RAVEN:

* **De novo reconstruction** from KEGG and protein homology (BLAST / DIAMOND).
* **Context-specific models** from omics data via **tINIT / ftINIT**, with task-aware
gap-filling and the linear-merge MILP reduction.
* **Metabolic-task** validation (`check_tasks`, `fitTasks`).
* **Connectivity gap-filling** against template models.
* **Omics integration** — Human Protein Atlas (proteomics + RNA-seq) ingestion.
* **Sub-cellular localisation** prediction by MILP, with partial-update mode and
pluggable predictors (WoLF PSORT, DeepLoc, …).
* **N-model comparison**; **reporter metabolites**; **FSEOF**; **flux sampling**.
* **YAML I/O** following the cobra standard, plus geckopy's `ec-*` enzyme-constrained
fields. **SIF** export. **RAVEN-style Excel** export.

The status of every RAVEN function (ported, cheatsheet-mapped to cobra, or explicitly
not ported) is documented function-by-function in
**[docs/raven_migration.md](docs/raven_migration.md)**.

## Design principle

The canonical in-memory object is always a [`cobra.Model`](https://cobrapy.readthedocs.io).
There is no parallel RAVEN struct, no `ravenCobraWrapper`-style adapter. RAVEN-specific
fields that cobra doesn't model natively (`rxnMiriams`, `metDeltaG`,
`rxnConfidenceScores`, …) live in cobra's `annotation` / `notes` dictionaries. This
avoids duplicating cobra's data model and keeps raven-python interoperable with the wider
COBRA ecosystem.

## Status

raven-python has been validated against MATLAB RAVEN on **Human-GEM** (5 Hart2015 cell-line
models, Jaccard 0.975–0.980 — see [docs/humangem_validation.md](docs/humangem_validation.md)).
The functional scope of the original RAVEN toolbox is covered with two principled
omissions:

* **MetaCyc-based reconstruction** is not implemented and is flagged for removal from
MATLAB RAVEN as well — see [IMPROVEMENTS.md](IMPROVEMENTS.md) under `R-MetaCyc`.
* **Dynamic FBA** is not implemented — several maintained Python packages already cover
it ([`dfba`](https://pypi.org/project/dfba/), [`reframed`](https://pypi.org/project/reframed/),
[`mewpy`](https://pypi.org/project/mewpy/)).

What's still open is catalogued in **[docs/todo.md](docs/todo.md)** (visualisation / Phase
6 is the main item).

## Installation (development)

```bash
git clone https://github.com/SysBioChalmers/raven-python
cd raven-python
pip install -e ".[dev]"
```

raven-python requires Python ≥ 3.11. Genome-scale (f)tINIT MILPs currently require **Gurobi**
([details on solver portability](docs/init_solver_benchmark.md)); toy and unit-test work
runs on the open-source GLPK.

## Documentation

See **[docs/README.md](docs/README.md)** for the documentation index.

## Relationship to MATLAB RAVEN

`raven-python` is a derivative work and is released under the same **GPL-3.0-or-later**
license. If you use it in scientific work, please cite the RAVEN 2 paper:

> Wang H, Marcišauskas S, Sánchez BJ, Domenzain I, Hermansson D, Agren R, Nielsen J,
> Kerkhoven EJ. (2018) RAVEN 2.0: A versatile toolbox for metabolic network
> reconstruction and a case study on *Streptomyces coelicolor*. PLoS Comput Biol 14(10):
> e1006541.

## License

[GPL-3.0-or-later](LICENSE)
117 changes: 117 additions & 0 deletions docs/humangem_validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Human-GEM cell-type model validation: raven-python vs RAVEN

Validation of raven-python's tINIT/ftINIT against MATLAB RAVEN on a real genome-scale
reconstruction (Human-GEM) using the Hart2015 RNA-seq dataset (5 cell lines: DLD1,
GBM, HCT116, HELA, RPE1). The goal is functional equivalence — do raven-python and RAVEN
extract the *same* context-specific reaction sets from the same inputs?

## Method

* **Template & inputs.** RAVEN built the ftINIT reference model from Human-GEM
(`prepHumanModelForftINIT`: remove drug/exchange/artificial reactions, set
spontaneous/custom lists) and exported it as `raven_refModel.xml` (10198 reactions).
raven-python builds on that *same* exported model, so the candidate reaction universe is
identical and set comparison is exact.
* **Scoring.** Gene scores from `log2(TPM+1)`-style expression via
`gene_scores_from_expression`, mapped to reactions through the GPR
(`score_reactions_from_genes`), matching RAVEN's `getExprForRxnScore`.
* **ftINIT.** Series `1+1` (2 staged MILP steps). RAVEN run via `ftINIT.m` with Gurobi;
raven-python via `raven_python.init.ftinit` with Gurobi (`mip_gap=0.001`, `time_limit=600`).
* **tINIT.** raven-python `get_init_model` (classic single-MILP INIT) on HCT116, compared to
the ftINIT result for the same cell line.
* **Tasks.** Two raven-python ftINIT variants: *no-task* (expression only) and
*task-constrained* (essential metabolic tasks, `metabolicTasks_Essential.txt`, force
task-essential reactions to be kept). RAVEN's reference is task-constrained.
* **Solver.** Gurobi 13.0.1 for both tools.

## Engineering findings (raven-python tractability)

Getting ftINIT to run at genome scale surfaced three issues, all now fixed and matching
RAVEN's design:

1. **O(n²) constraint construction.** Building the steady-state balances with Python
`sum()` re-canonicalises a growing sympy expression at each term; hub metabolites
(ATP/H⁺/H₂O in ~10³ reactions) made one constraint take ~minutes (≈154 s total build,
benchmark: 1500-term `sum` = 59 s vs `optlang.symbolics.add` = 0.01 s). Fixed by
building flat term lists once per reaction and summing with `optlang.symbolics.add`
(in both ftINIT and tINIT).
2. **Big-M too loose.** The on/off indicator constraints used each reaction's own bound
(±1000) as big-M; with `force_on=0.1` that is a ~10⁴ ratio → very weak LP relaxation
→ Gurobi never closes the gap. RAVEN uses a fixed big-M = 100. Adopted.
3. **Stoichiometric rescaling.** A fixed big-M=100 is only valid if no reaction needs
flux ≫100; ported RAVEN's `rescaleModelForINIT` (cap each reaction's coefficient
dynamic range at 25×, normalise mean |coeff| to 1) into `prep_init_model`. Without it
the staged MILP is infeasible (step-1 caps transports that step-0 used freely).

Net effect: a full ftINIT cell-line solve went from *not finishing* to ~200 s,
comparable to RAVEN.

## Results

### Reaction counts

| cell line | RAVEN ftINIT | raven-python ftINIT (no-task) | raven-python ftINIT (task) |
|-----------|-------------:|--------------------------:|-----------------------:|
| DLD1 | 7782 | 7744 | 7774 |
| GBM | 7668 | 7667 | 7680 |
| HCT116 | 7780 | 7752 | 7776 |
| HELA | 7832 | 7789 | 7816 |
| RPE1 | 7569 | 7564 | 7570 |

Counts agree within ~0.5 % everywhere; the task-constrained run is closest (e.g. RPE1
7570 vs 7569, HCT116 7776 vs 7780). raven-python tINIT (HCT116) gives 6024 reactions — a
smaller model, as expected from the different (classic INIT) objective.

### Agreement — raven-python (no-task) ftINIT vs RAVEN ftINIT

| cell line | shared | only raven-python | only RAVEN | Jaccard |
|-----------|-------:|--------------:|-----------:|--------:|
| DLD1 | 7667 | 77 | 115 | 0.976 |
| GBM | 7562 | 105 | 106 | 0.973 |
| HCT116 | 7675 | 77 | 105 | 0.977 |
| HELA | 7707 | 82 | 125 | 0.974 |
| RPE1 | 7470 | 94 | 99 | 0.975 |

**~97.5 % of reactions are identical** between the two independent implementations, even
though this run is *expression-only* while RAVEN's reference is task-constrained. The
"only RAVEN" surplus (≈99–125) is expected to include task-essential reactions that the
task-constrained run (below) recovers.

### Agreement — raven-python (task-constrained) ftINIT vs RAVEN ftINIT

| cell line | shared | only raven-python | only RAVEN | Jaccard |
|-----------|-------:|--------------:|-----------:|--------:|
| DLD1 | 7699 | 75 | 83 | 0.980 |
| GBM | 7588 | 92 | 80 | 0.978 |
| HCT116 | 7696 | 80 | 84 | 0.979 |
| HELA | 7735 | 81 | 97 | 0.978 |
| RPE1 | 7493 | 77 | 76 | 0.980 |

Adding the essential metabolic tasks (same task list RAVEN uses) raises agreement to
**Jaccard 0.978–0.980** and makes the disagreement symmetric (only-raven-python ≈ only-RAVEN
≈ 80), confirming the prediction: the task constraints recover RAVEN's task-essential
reactions. The residual ≈80 reactions each way out of ~7700 is at the level expected from
MIP-gap tolerance (both accept near-optimal incumbents) and alternate optima.

### raven-python tINIT vs ftINIT (HCT116)

tINIT 6024 rxns vs ftINIT 7752; shared 5957, Jaccard 0.762. tINIT is nearly a subset
(only 67 reactions unique to it) — the two methods agree on a common core, with ftINIT
keeping more (its staged formulation and task handling are less aggressive at removal).
This matches the expected tINIT/ftINIT relationship rather than indicating a defect.

## Conclusions

From identical inputs on a genome-scale human reconstruction, raven-python reproduces RAVEN's
ftINIT reaction selection to **97.5 % (no-task) and 98 % (task-constrained) set identity**
across five cell lines — strong evidence of functional equivalence between the two
independent implementations. Agreement is symmetric and the residual (~80 reactions each
way) is consistent with MIP near-optimality and alternate optima rather than any
systematic divergence.

Reaching genome-scale tractability required matching RAVEN's numerical-conditioning
choices and fixing optlang-specific construction costs (see *Engineering findings*):
fixed big-M = 100, `rescaleModelForINIT`, `optlang.symbolics.add` instead of Python
`sum()` in every MILP build (ftINIT, tINIT, and the gap-fill). With these, a
task-constrained cell-line model builds in ~15–25 min (dominated by the
essential-forced staged MILP) and a no-task one in ~3 min, comparable to RAVEN.
Loading
Loading