Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: CI

on:
push:
branches: [main]
pull_request:
workflow_dispatch:

# A push that obsoletes a previous run cancels it.
concurrency:
group: ci-${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
lint:
name: ruff
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip
cache-dependency-path: pyproject.toml
- run: pip install --upgrade pip
- run: pip install ruff
- run: ruff check .

test:
name: pytest (py${{ matrix.python }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python: ["3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python }}
cache: pip
cache-dependency-path: pyproject.toml
- run: pip install --upgrade pip
# ``-e .[dev,plotting,excel]`` so every optional extra is exercised.
# Gurobi is not installable on free runners; the relevant tests
# skip themselves when ``optlang.gurobi_interface`` cannot import.
- run: pip install -e ".[dev,plotting,excel]"
- run: pytest -q --maxfail=5 --durations=20
541 changes: 541 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

93 changes: 80 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,82 @@
# raven-python

The Python counterpart of the
[RAVEN Toolbox 2](https://github.com/SysBioChalmers/RAVEN) (MATLAB), built on
[cobrapy](https://github.com/opencobra/cobrapy).

`raven-python` covers de-novo reconstruction (KEGG + protein homology),
context-specific model extraction (`tINIT` / `ftINIT`), metabolic-task
validation, gap-filling, omics ingestion, sub-cellular localisation, model
manipulation, and YAML / SIF / Excel I/O — preserving the established RAVEN
workflows in a Python-native form.

This `main` branch is intentionally empty. Development happens on the
`develop` branch via a series of feature branches; see the open and merged
pull requests for the current state of the port.
[![CI](https://github.com/SysBioChalmers/raven-python/actions/workflows/ci.yml/badge.svg)](https://github.com/SysBioChalmers/raven-python/actions/workflows/ci.yml)

**Reconstruction, Analysis and Visualisation of Metabolic Networks — in Python.**

`raven-python` is the Python counterpart of the
[RAVEN Toolbox 2](https://github.com/SysBioChalmers/RAVEN) (MATLAB). It builds on
[**cobrapy**](https://github.com/opencobra/cobrapy) for everything cobrapy already does
well (simulation, standard analyses, SBML I/O, model manipulation) and adds the
functionality that's unique to RAVEN:

* **De novo reconstruction** from KEGG and protein homology (BLAST / DIAMOND).
* **Context-specific models** from omics data via **tINIT / ftINIT**, with task-aware
gap-filling and the linear-merge MILP reduction.
* **Metabolic-task** validation (`check_tasks`, `fitTasks`).
* **Connectivity gap-filling** against template models.
* **Omics integration** — Human Protein Atlas (proteomics + RNA-seq) ingestion.
* **Sub-cellular localisation** prediction by MILP, with partial-update mode and
pluggable predictors (WoLF PSORT, DeepLoc, …).
* **N-model comparison**; **reporter metabolites**; **FSEOF**; **flux sampling**.
* **YAML I/O** following the cobra standard, plus geckopy's `ec-*` enzyme-constrained
fields. **SIF** export. **RAVEN-style Excel** export.

The status of every RAVEN function (ported, cheatsheet-mapped to cobra, or explicitly
not ported) is documented function-by-function in
**[docs/raven_migration.md](docs/raven_migration.md)**.

## Design principle

The canonical in-memory object is always a [`cobra.Model`](https://cobrapy.readthedocs.io).
There is no parallel RAVEN struct, no `ravenCobraWrapper`-style adapter. RAVEN-specific
fields that cobra doesn't model natively (`rxnMiriams`, `metDeltaG`,
`rxnConfidenceScores`, …) live in cobra's `annotation` / `notes` dictionaries. This
avoids duplicating cobra's data model and keeps raven-python interoperable with the wider
COBRA ecosystem.

## Status

raven-python has been validated against MATLAB RAVEN on **Human-GEM** (5 Hart2015 cell-line
models, Jaccard 0.975–0.980 — see [docs/humangem_validation.md](docs/humangem_validation.md)).
The functional scope of the original RAVEN toolbox is covered with two principled
omissions:

* **MetaCyc-based reconstruction** is not implemented and is flagged for removal from
MATLAB RAVEN as well — see [IMPROVEMENTS.md](IMPROVEMENTS.md) under `R-MetaCyc`.
* **Dynamic FBA** is not implemented — several maintained Python packages already cover
it ([`dfba`](https://pypi.org/project/dfba/), [`reframed`](https://pypi.org/project/reframed/),
[`mewpy`](https://pypi.org/project/mewpy/)).

What's still open is catalogued in **[docs/todo.md](docs/todo.md)** (visualisation / Phase
6 is the main item).

## Installation (development)

```bash
git clone https://github.com/SysBioChalmers/raven-python
cd raven-python
pip install -e ".[dev]"
```

raven-python requires Python ≥ 3.11. Genome-scale (f)tINIT MILPs currently require **Gurobi**
([details on solver portability](docs/init_solver_benchmark.md)); toy and unit-test work
runs on the open-source GLPK.

## Documentation

See **[docs/README.md](docs/README.md)** for the documentation index.

## Relationship to MATLAB RAVEN

`raven-python` is a derivative work and is released under the same **GPL-3.0-or-later**
license. If you use it in scientific work, please cite the RAVEN 2 paper:

> Wang H, Marcišauskas S, Sánchez BJ, Domenzain I, Hermansson D, Agren R, Nielsen J,
> Kerkhoven EJ. (2018) RAVEN 2.0: A versatile toolbox for metabolic network
> reconstruction and a case study on *Streptomyces coelicolor*. PLoS Comput Biol 14(10):
> e1006541.

## License

[GPL-3.0-or-later](LICENSE)
72 changes: 72 additions & 0 deletions docs/kegg_data_format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# KEGG relational-table storage format

This note records *why* raven-python stores its KEGG-derived relational tables as
**gzipped TSV**, and what other options we deliberately deferred. It applies to
the maintainer-built KEGG artefacts described in PLAN.md §2.3b — the `ko_reaction`,
`organism_gene_ko`, KO-name, and reaction-flag tables.

The reference GEM itself is stored as **gzipped RAVEN/cobra YAML**
(`reference_model.yml.gz`) — RAVEN-native and MATLAB-readable, gzipped to match the
tables (the YAML I/O transparently gzips on a `.gz` suffix). On the real KEGG dump
this is ~1.1 MB (vs ~30 MB as SBML) for the full 12k-reaction gene-free model.

End users do not build any of this: the published artefacts are fetched and cached
under `~/.cache/raven-python/data/kegg-<version>/` by `ensure_data` (see
`raven_python.data`), mirroring how binaries are provisioned.

## Decision (current)

- **Small tables** (`ko_reaction`, `ko_names`, `rxn_flags`): **gzipped TSV
(`.tsv.gz`)**. Each is well under 1 MB, so compression choice is irrelevant;
gzip keeps them MATLAB-native and dependency-free.
- **The large `organism_gene_ko` table**: **xz-compressed TSV
(`organism_gene_ko.tsv.xz`), with rows sorted by `(organism, gene)`**.

Why the large table differs. It carries KEGG's ~9M gene↔KO associations and
dominates the artefact set (≈78 MB as unsorted gzipped TSV). Two cheap,
stdlib-only changes cut that to ≈27 MB (2.9×):

1. **Sort by `(organism, gene)`** before writing. Gene IDs from one organism
share long common prefixes (locus tags, numeric runs); sorting makes them
adjacent so the compressor can fold them. This alone takes 78 → 48 MB and
happens to match the by-organism query pattern in
`get_kegg_model_for_organism`. The sort is an external merge sort bounded to
`chunk_rows` in memory (see `stream_organism_gene_ko`), so it stays scalable.
2. **xz instead of gzip** (Python stdlib `lzma`). Its larger dictionary captures
cross-row redundancy gzip's 32 KB window misses: sorted + xz reaches ≈27 MB.

- **pandas reads/writes both with zero extra dependencies** — compression is
inferred from the `.gz`/`.xz` suffix; `lzma` and `gzip` are both stdlib, so
this works natively on Windows, macOS, and Linux with no external binary.
- **MATLAB caveat:** `readtable` reads gzipped TSV after a `gunzip`, but MATLAB
has no built-in xz decompressor. The small tables stay MATLAB-native; the
large table needs an external `unxz` (or Java/`7-Zip`) before `readtable` on
the MATLAB side. The xz file is raven-python's (Python) primary artefact; this
trades a little MATLAB convenience on the one big file for a ~3× size cut.

## Options considered

| Format | Python cost | MATLAB cost | Notes |
| --- | --- | --- | --- |
| **Gzipped TSV** ✅ | none (stdlib/pandas) | none (`readtable`) | Universal, text, types re-specified on read. Chosen. |
| Parquet | `pyarrow` or `fastparquet` (~40–60 MB wheel) as a `raven-python[kegg]` extra | needs ≥ R2019a (`parquetread`, native) | Smaller, faster, typed, columnar. Win mainly at scale / repeated random access. |
| SQLite | none (stdlib `sqlite3`) | **needs Database Toolbox** | Rejected: the MATLAB-side toolbox requirement breaks the "same files, both languages, no extra deps" goal. |

## When to revisit

Reconsider Parquet (or SQLite) if any of these become true:

- The `organism_gene_ko` table grows large enough that load *time* (not just
size — the sort+xz change above already addresses on-disk size) becomes a real
bottleneck. The remaining inefficiency is that building one species' model
still loads all ~9M rows; sorted order makes a `searchsorted`/row-group
by-organism read the natural next step before reaching for Parquet.
- We start doing repeated random-access / columnar reads rather than a single
load-once-per-run pattern.
- A typed, self-describing schema becomes valuable (TSV loses dtypes; they are
re-specified on read).

If revisited, prefer **Parquet** over SQLite (no MATLAB toolbox dependency; MATLAB
reads Parquet natively from R2019a). It could be offered as an optional
`raven-python[kegg]` extra (pyarrow) alongside the TSV default, rather than replacing
it — keeping the dependency-free path intact for users who don't opt in.
Loading
Loading