Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: CI

on:
push:
branches: [main]
pull_request:
workflow_dispatch:

# A push that obsoletes a previous run cancels it.
concurrency:
group: ci-${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
lint:
name: ruff
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip
cache-dependency-path: pyproject.toml
- run: pip install --upgrade pip
- run: pip install ruff
- run: ruff check .

test:
name: pytest (py${{ matrix.python }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python: ["3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python }}
cache: pip
cache-dependency-path: pyproject.toml
- run: pip install --upgrade pip
# ``-e .[dev,plotting,excel]`` so every optional extra is exercised.
# Gurobi is not installable on free runners; the relevant tests
# skip themselves when ``optlang.gurobi_interface`` cannot import.
- run: pip install -e ".[dev,plotting,excel]"
- run: pytest -q --maxfail=5 --durations=20
541 changes: 541 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

93 changes: 80 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,82 @@
# raven-python

The Python counterpart of the
[RAVEN Toolbox 2](https://github.com/SysBioChalmers/RAVEN) (MATLAB), built on
[cobrapy](https://github.com/opencobra/cobrapy).

`raven-python` covers de-novo reconstruction (KEGG + protein homology),
context-specific model extraction (`tINIT` / `ftINIT`), metabolic-task
validation, gap-filling, omics ingestion, sub-cellular localisation, model
manipulation, and YAML / SIF / Excel I/O — preserving the established RAVEN
workflows in a Python-native form.

This `main` branch is intentionally empty. Development happens on the
`develop` branch via a series of feature branches; see the open and merged
pull requests for the current state of the port.
[![CI](https://github.com/SysBioChalmers/raven-python/actions/workflows/ci.yml/badge.svg)](https://github.com/SysBioChalmers/raven-python/actions/workflows/ci.yml)

**Reconstruction, Analysis and Visualisation of Metabolic Networks — in Python.**

`raven-python` is the Python counterpart of the
[RAVEN Toolbox 2](https://github.com/SysBioChalmers/RAVEN) (MATLAB). It builds on
[**cobrapy**](https://github.com/opencobra/cobrapy) for everything cobrapy already does
well (simulation, standard analyses, SBML I/O, model manipulation) and adds the
functionality that's unique to RAVEN:

* **De novo reconstruction** from KEGG and protein homology (BLAST / DIAMOND).
* **Context-specific models** from omics data via **tINIT / ftINIT**, with task-aware
gap-filling and the linear-merge MILP reduction.
* **Metabolic-task** validation (`check_tasks`, `fitTasks`).
* **Connectivity gap-filling** against template models.
* **Omics integration** — Human Protein Atlas (proteomics + RNA-seq) ingestion.
* **Sub-cellular localisation** prediction by MILP, with partial-update mode and
pluggable predictors (WoLF PSORT, DeepLoc, …).
* **N-model comparison**; **reporter metabolites**; **FSEOF**; **flux sampling**.
* **YAML I/O** following the cobra standard, plus geckopy's `ec-*` enzyme-constrained
fields. **SIF** export. **RAVEN-style Excel** export.

The status of every RAVEN function (ported, cheatsheet-mapped to cobra, or explicitly
not ported) is documented function-by-function in
**[docs/raven_migration.md](docs/raven_migration.md)**.

## Design principle

The canonical in-memory object is always a [`cobra.Model`](https://cobrapy.readthedocs.io).
There is no parallel RAVEN struct, no `ravenCobraWrapper`-style adapter. RAVEN-specific
fields that cobra doesn't model natively (`rxnMiriams`, `metDeltaG`,
`rxnConfidenceScores`, …) live in cobra's `annotation` / `notes` dictionaries. This
avoids duplicating cobra's data model and keeps raven-python interoperable with the wider
COBRA ecosystem.

## Status

raven-python has been validated against MATLAB RAVEN on **Human-GEM** (5 Hart2015 cell-line
models, Jaccard 0.975–0.980 — see [docs/humangem_validation.md](docs/humangem_validation.md)).
The functional scope of the original RAVEN toolbox is covered with two principled
omissions:

* **MetaCyc-based reconstruction** is not implemented and is flagged for removal from
MATLAB RAVEN as well — see [IMPROVEMENTS.md](IMPROVEMENTS.md) under `R-MetaCyc`.
* **Dynamic FBA** is not implemented — several maintained Python packages already cover
it ([`dfba`](https://pypi.org/project/dfba/), [`reframed`](https://pypi.org/project/reframed/),
[`mewpy`](https://pypi.org/project/mewpy/)).

What's still open is catalogued in **[docs/todo.md](docs/todo.md)** (visualisation / Phase
6 is the main item).

## Installation (development)

```bash
git clone https://github.com/SysBioChalmers/raven-python
cd raven-python
pip install -e ".[dev]"
```

raven-python requires Python ≥ 3.11. Genome-scale (f)tINIT MILPs currently require **Gurobi**
([details on solver portability](docs/init_solver_benchmark.md)); toy and unit-test work
runs on the open-source GLPK.

## Documentation

See **[docs/README.md](docs/README.md)** for the documentation index.

## Relationship to MATLAB RAVEN

`raven-python` is a derivative work and is released under the same **GPL-3.0-or-later**
license. If you use it in scientific work, please cite the RAVEN 2 paper:

> Wang H, Marcišauskas S, Sánchez BJ, Domenzain I, Hermansson D, Agren R, Nielsen J,
> Kerkhoven EJ. (2018) RAVEN 2.0: A versatile toolbox for metabolic network
> reconstruction and a case study on *Streptomyces coelicolor*. PLoS Comput Biol 14(10):
> e1006541.

## License

[GPL-3.0-or-later](LICENSE)
236 changes: 236 additions & 0 deletions docs/maintaining_binaries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
# Maintaining bundled binaries (BLAST+, DIAMOND, …)

Audience: **raven-python maintainers / the GitHub repo owner.** This explains how
raven-python ships external command-line tools, how to update their versions, and how
to build **minimal-footprint** ZIPs to attach to a GitHub release.

> End users never read this. They get a binary automatically via `ensure_binary`,
> or use their own (system/conda) install. This doc is only for whoever publishes
> the release assets.

---

## 1. How binary provisioning works

raven-python does **not** vendor binaries in the git repo or on PyPI. Instead:

1. For each tool we publish **version-pinned ZIPs as GitHub release assets**.
2. A **registry** (`src/raven_python/binaries_registry.json`) maps each *bundle* to its
version, the executables it provides, and per-platform `{asset, sha256}`.
3. At run time `raven_python.binaries.ensure_binary("blastp")` resolves a tool in this
order — and only reaches the download as a last resort:

```
explicit binary= arg → env var (RAVEN_PYTHON_BLASTP / RAVEN_PYTHON_DIAMOND / …)
→ shutil.which on PATH (system / conda / apt / brew)
→ ensure_binary: download the pinned ZIP → verify SHA256 → cache → return path
→ actionable error (with conda / manual instructions)
```

So a pre-installed binary always wins; the bundle is the zero-setup fallback.
Pinning the version makes reconstruction **reproducible**.

A *bundle* can provide several executables from one download (e.g. the `blast`
bundle provides both `blastp` and `makeblastdb`), so they are fetched once.

---

## 2. What raven-python actually needs — ship only these

Distribute the **minimum** set of executables. Everything else (other suite
tools, docs, examples, changelogs) must be excluded.

| Bundle | Executables to include | Everything else |
|---|---|---|
| `diamond` | `diamond` | — (it is a single static binary) |
| `blast` | `blastp`, `makeblastdb` | **drop** `blastn`, `tblastn`, `psiblast`, `rpsblast`, `blast_formatter`, `*_vdb`, the `doc/`, `ChangeLog`, `README`, ~30 other tools |

(Confirmed against RAVEN `getBlast`/`getDiamond`: only `makeblastdb`+`blastp`, and
`diamond` for its `makedb`/`blastp` subcommands, are ever invoked.)

For BLAST+ this is the big win: the full NCBI suite is ~hundreds of MB; two
binaries (stripped) are a small fraction.

---

## 3. Asset & ZIP conventions

**Asset filename:** `<bundle>-<version>-<os>-<arch>.zip`

- `<os>` ∈ `linux`, `macos`, `windows`
- `<arch>` ∈ `x86_64`, `arm64`
- examples: `diamond-2.1.11-linux-x86_64.zip`, `blast-2.16.0-macos-arm64.zip`

**ZIP layout — flat, executables at the root, plus the upstream licence:**

```
diamond-2.1.11-linux-x86_64.zip
├── diamond
└── LICENSE

blast-2.16.0-linux-x86_64.zip
├── blastp
├── makeblastdb
└── LICENSE
```

No nested `bin/`, no extra files. `ensure_binary` extracts the ZIP into the cache
and expects the executable at the top level.

---

## 4. Step-by-step: add or update a version

Example: bump DIAMOND to a new version for Linux x86-64. Repeat per `(os, arch)`.

1. **Download the official upstream build** (never rebuild from source unless you
must):
- DIAMOND → <https://github.com/bbuchfink/diamond/releases>
(`diamond-linux64.tar.gz`, `diamond-macos.tar.gz`)
- BLAST+ → <https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/> or a
pinned version dir (`ncbi-blast-<ver>+-x64-linux.tar.gz`,
`-x64-macosx.tar.gz`, `-aarch64-linux.tar.gz`, `-x64-win64.tar.gz`).
- Record the upstream URL **and** its published checksum for provenance.
2. **Extract only the needed executables** (see §2) to a clean staging dir.
3. **Strip debug symbols** to shrink (skip on Windows / signed macOS builds):
```bash
strip diamond # or: strip blastp makeblastdb
```
4. **Smoke-test the stripped binaries in a clean shell** (no other tools on PATH):
```bash
./diamond --version
./blastp -version && ./makeblastdb -version
```
If they fail for a missing shared library, add that `.so`/`.dylib` to the ZIP
(rare — NCBI/DIAMOND release builds are largely self-contained).
5. **Add the upstream licence file** as `LICENSE` (see §6).
6. **Zip with max compression, flat layout:**
```bash
zip -9 -j diamond-2.1.11-linux-x86_64.zip diamond LICENSE
# -j junks paths so entries sit at the ZIP root
```
7. **Compute the SHA256:**
```bash
sha256sum diamond-2.1.11-linux-x86_64.zip # shasum -a 256 on macOS
```
8. **Attach the ZIP to a raven-python GitHub release** (a release tagged for the binary
set, e.g. `binaries-2024.06`, keeps them independent of code releases).
9. **Update the registry** `src/raven_python/binaries_registry.json` — bump `version`
and set the per-platform `asset` + `sha256`:
```json
{
"diamond": {
"version": "2.1.11",
"provides": ["diamond"],
"platforms": {
"linux-x86_64": {
"asset": "diamond-2.1.11-linux-x86_64.zip",
"url": "https://github.com/SysBioChalmers/raven-python/releases/download/binaries-2024.06/diamond-2.1.11-linux-x86_64.zip",
"sha256": "<sha256>"
}
}
},
"blast": {
"version": "2.16.0",
"provides": ["blastp", "makeblastdb"],
"platforms": { "linux-x86_64": { "asset": "...", "url": "...", "sha256": "..." } }
}
}
```
10. **Commit the registry change**, run the homology tests, and (if you have the
binary) confirm `ensure_binary("diamond", version="2.1.11")` downloads,
verifies, and runs.

---

## 5. Keeping the footprint minimal — checklist

- ✅ Only the executables in §2 (for BLAST+, exactly `blastp` + `makeblastdb`).
- ✅ `strip` the binaries (often halves their size).
- ✅ `zip -9 -j` (max compression, flat — no `bin/`, no folders).
- ✅ Exactly one extra file: `LICENSE`.
- ❌ No docs, examples, `ChangeLog`, `README`, man pages, test data, or sibling tools.
- ❌ No `.dSYM`/debug bundles; no duplicate static `.a` libraries.
- ➕ Only add a shared library if step-4 testing proves it is required.

---

## 6. Platform / architecture matrix & licensing

**Coverage = what you build.** Start with `linux-x86_64` (CI default), then add
`macos-arm64`, `macos-x86_64`, `linux-arm64`, `windows-x86_64` as capacity allows.
For any `(os, arch)` **not** in the registry, `ensure_binary` raises an actionable
error pointing to conda (`conda install -c bioconda diamond blast`) or a manual
install — that is the documented fallback, not a failure to fix urgently.

**Licensing (must comply when redistributing):**

- **BLAST+** — produced by NCBI (US Government); **public domain**, free to
redistribute. Include NCBI's `LICENSE` for courtesy/provenance.
- **DIAMOND** — **GPLv3**. Redistribution is allowed; you **must** include the
GPLv3 licence text in the ZIP and keep the binary unmodified (or offer source).
- **HMMER** (future) — BSD-3-Clause; include its `LICENSE`.

Always ship the upstream licence in the ZIP, and keep a `BINARIES_PROVENANCE.md`
(or a note in the release body) recording, per asset: upstream URL, upstream
version, upstream checksum, and the SHA256 you published.

### Native OS support per tool

raven-python invokes each tool through `subprocess.run([resolved_path, …])` — that
call is itself cross-platform, so the real constraint is whether a given tool has
a binary that runs natively on each OS. It varies:

| Tool | Linux | macOS (incl. arm64) | Windows (native) |
|---|---|---|---|
| BLAST+ (`blastp`, `makeblastdb`) | ✅ | ✅ | ✅ (NCBI ships Windows builds) |
| DIAMOND | ✅ | ✅ | ⚠️ native build exists but Linux-first |
| HMMER (`hmmbuild`/`hmmpress`/`hmmsearch`/`hmmscan`) | ✅ | ✅ | ❌ no official native build |
| MAFFT | ✅ | ✅ | ⚠️ Windows package is a wrapper |
| CD-HIT | ✅ | ✅ | ❌ no Windows build exists |

Implications:

- **Linux / macOS** — everything works. `conda install -c bioconda hmmer mafft
cd-hit blast diamond`, or point the `RAVEN_PYTHON_*` env vars at your installs.
- **Native Windows** — the homology track (BLAST+/DIAMOND) works, but the **KEGG
HMM build (3b.3) and HMM query (3b.5) do not**: HMMER and CD-HIT have no Windows
binaries, and bioconda has no Windows packages for any of them. Bundling can't
fix this — there is no binary to bundle.
- **Windows users should run raven-python inside WSL2** (or a Linux container), where
every tool is native Linux. raven-python does **not** replicate RAVEN's
`getWSLpath`/`wsl …` path translation: it calls the resolved binary directly, so
mixing native-Windows Python with WSL binaries is unsupported — keep the whole
stack inside WSL2.
- The common end-user paths — homology reconstruction and the KEGG *species* model
(3b.4) — need no HMMER/MAFFT/CD-HIT, so they are fully cross-platform.

---

## 7. Emitting the registry entry

After building the per-platform ZIPs (named `<bundle>-<version>-<os>-<arch>.zip`)
and uploading them to the release, generate the `_REGISTRY` entry — checksums and
URLs — with [`scripts/make_registry_snippet.py`](../scripts/README.md):

```bash
python scripts/make_registry_snippet.py binary --bundle blast --version 2.16.0 \
--provides blastp makeblastdb --dir zips \
--base-url https://github.com/ORG/raven-python/releases/download/blast-2.16.0
```

It prints the ready-to-paste `_REGISTRY["blast"]` block; its SHA256 helper is the
same one `ensure_binary` verifies with, so the checksums always match. (Producing
the minimal ZIPs themselves — download upstream, `strip`, `zip -9 -j`, add
`LICENSE` per §3–§6 — is still a manual/per-tool step.)

---

## 8. Adding a new tool later (e.g. HMMER for KEGG reconstruction)

1. Decide the **minimal executable set** (e.g. HMMER → `hmmsearch`, `hmmscan`,
maybe `hmmbuild`/`hmmpress`).
2. Add a bundle entry to the registry with `provides` listing those executables.
3. Build/attach ZIPs per §3–§4; include the tool's licence (§6).
4. The wrappers call `ensure_binary("hmmsearch", …)` with the same resolution
order — no new provisioning code needed.
Loading
Loading