Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
0a9f4a1
Project scaffold: pyproject + package skeleton + README + LICENSE
edkerk May 29, 2026
b7b69ac
Add GitHub Actions CI and the maintainer-scripts README
edkerk May 29, 2026
50bea40
Add the foundation utilities: GPR, balance, parse, sort, validate
edkerk May 29, 2026
4bc0d6e
Add the model-manipulation layer (add, remove, transport, merge, etc.)
edkerk May 29, 2026
a1dc557
Add binary + data resolvers for external tools and published artefacts
edkerk May 29, 2026
6ef3357
Add YAML and SIF model I/O
edkerk May 29, 2026
7a9b69a
Add Excel export and the Standard-GEM git-layout export
edkerk May 29, 2026
cf199dc
Add BLAST and DIAMOND wrappers for protein-homology searches
edkerk May 29, 2026
eccce57
Add the homology-based draft model builder (getModelFromHomology port)
edkerk May 29, 2026
1b0df8d
Add KEGG download, dump parser and taxonomy parser
edkerk May 29, 2026
369f677
Add KEGG HMM-library build and HMM-based KO assignment
edkerk May 29, 2026
d9f100c
Add KEGG species-model assembly (per-organism reconstruction)
edkerk May 29, 2026
cf76698
Add KEGG artefact-build scripts and HMM-cutoff calibration docs
edkerk May 29, 2026
5e36aae
Add metabolic-task parsing and the check_tasks validator
edkerk May 29, 2026
2bbd4e6
Add connectivity gap-filling (MILP) against template models
edkerk May 29, 2026
b39f336
Add the tINIT (INIT) MILP and its supporting machinery
edkerk May 29, 2026
07f2fab
Add the ftINIT pipeline and task-aware gap-filling
edkerk May 29, 2026
9b04421
Add Human-GEM validation, parameter studies and cross-solver tests
edkerk May 29, 2026
2b32606
Add HPA omics ingestion (proteomics + RNA-seq)
edkerk May 29, 2026
4846909
Add FSEOF, reporter metabolites and flux sampling
edkerk May 29, 2026
f912525
Add N-model comparison (presence + Jaccard + optional task check)
edkerk May 29, 2026
9bfcd71
Add subcellular-localisation prediction (MILP) with pluggable predictors
edkerk May 29, 2026
3c12800
Add the yeast-GEM localization benchmark (real-data validation)
edkerk May 29, 2026
79ed77c
Add the documentation index, RAVEN migration map and CHANGELOG
edkerk May 29, 2026
6255ff0
Add known-issues catalogue with closed sweep A–F regression notes
edkerk May 29, 2026
0f3ec0d
Add the consolidated MATLAB RAVEN back-port proposals doc
edkerk May 29, 2026
b20a89e
feat(io.yaml): factor model_from_yaml_data out of read_yaml_model
edkerk May 29, 2026
43a5b16
fix(io.yaml): drop metaData/version/_yaml_sections from doc['notes']
edkerk May 29, 2026
948f4e3
Merge remote-tracking branch 'origin/develop' into feature/quality-an…
edkerk May 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 195 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Changelog

Milestones in the raven-python port. For function-level status see
[docs/raven_migration.md](docs/raven_migration.md); for open work see
[docs/todo.md](docs/todo.md).

## Infrastructure

* **GitHub Actions CI** ([.github/workflows/ci.yml](.github/workflows/ci.yml)) —
ruff + pytest matrix over Python 3.11/3.12/3.13. Tests that require Gurobi
auto-skip (no Gurobi on free runners); the known HiGHS upstream blocker
(`hybrid_interface.Configuration` rejects `lp_method='primal'`) is marked
`xfail(strict=True)` so CI flips red when optlang fixes it.

## Quality sweep — known-issues section F (design-choice divergences)

Closed the five items in section F (the "design choices that differ from RAVEN"
backlog from the original review). Three docstring/comment fixes; two code
fixes with matching MATLAB back-port proposals in IMPROVEMENTS.md (FS4, B2).

* `run_init` docstring spells out the score-0 semantics divergence between
classic INIT and ftINIT.
* `get_init_model` inaccurate "same regime" comment replaced with an accurate
description of the conservative pre-filter.
* `fseof` classifier now uses the slope of `|flux|` (`linregress(enforced, |flux|)`)
instead of first-vs-last endpoints. A track whose endpoints straddle a
peak/trough no longer ends up mislabelled.
* `reporter_metabolites` docstring documents the one-sided p-value + z-score
ordering vs RAVEN's two-tailed sort, and points at the up/down split via
`gene_fold_changes`.
* `get_elemental_balance` now reports `unknown` for empty-stoichiometry
reactions (previously vacuously `balanced`). Original review attributed the
bug to `check_model`; the actual code is in `balance.py`.

Two new regression tests (F3 in `test_analysis_fseof.py`, F5 in
`test_utils_balance.py`). [docs/known_issues.md](docs/known_issues.md) now
fully closed (all sections A–F).

## Quality sweep — known-issues sections C / D / E

Closed all the robustness, efficiency, and dead-code items in one pass.

**Robustness (C):**
* `constrain_reversible_reactions` wraps FVA in try/except + NaN check; both
backend-raised `OptimizationError` and silent-NaN returns now surface as one
clear `RuntimeError` (the original `abs(NaN) < eps` silently no-op'd).
* `ensure_binary` downloads through `.part` + `os.replace`, matching `data.py` —
an interrupted download leaves a `.part`, never a half-complete `.zip`.
* `parse_task_list` (.xlsx) checks `wb.sheetnames` before lookup; missing
`TASKS` sheet now raises a clear `ValueError` instead of a bare `KeyError`.
* `parse_taxonomy` pads with explicit `""` when a depth level is skipped and
warns once.

**Efficiency (D):**
* `group_linear_reactions` rewritten with a metabolite worklist (re-enqueue
the mets touched by each merge); same observable result, O(n+m) work per
pass instead of restarting the full scan after every merge.
* `parse_kegg_reactions` now caches the parsed stoichiometry on each
`KeggReaction.stoichiometry`; `build_reference_model` reuses it instead of
re-parsing.

**Dead code (E):**
* Dropped `KeggReaction.modules` and `.rhea` (parsed but never consumed).
* Dropped the vestigial `only_genes_in_models` parameter from `_ortholog_map`.

Six new regression tests; the only one without a test is the `.part` atomic
download (defensive, needs urlopen mocking).

## Quality sweep — known-issues section B

Closed all four "silent misbehaviour" items from [docs/known_issues.md](docs/known_issues.md):
* `merge_models` warns on `formula` / `charge` conflicts when two source models
share a name[comp] but disagree (used to silently keep the first-seen).
* `add_reactions_from_equations` warns when creating a metabolite in an
unregistered compartment — both the `mets_by="id"` and `mets_by="name"` paths
(id-mode used to skip the check entirely, an asymmetry).
* `parse_task_list` warns when continuation data appears before any task ID
has been seen (used to silently drop the orphan row).
* `export_model_to_sif` warns up front when a custom label map sends two
distinct ids to the same label (used to silently collapse nodes).
Four new regression tests cover them.

## Quality sweep — known-issues section A

Closed all six "latent edge-case bug" items from [docs/known_issues.md](docs/known_issues.md):
* `add_reactions_from_equations` no longer misparses `"2 oxoglutarate"` (or any
leading-number metabolite name) — the resolver tries the full token before
splitting off a coefficient.
* `add_reactions_from_equations` warns when an equation's terms cancel to a
zero-metabolite reaction.
* `add_reactions_from_model` tracks ids minted within the batch so two source
metabolites whose ids both collide with the draft don't collapse onto the
same generated id.
* `add_transport_reactions` warns on duplicate metabolite names in the source
or target compartment instead of silently dropping all but one.
* `connect_blocked_reactions` membership-guards the FVA result before
`.at[]` lookup.
* `assign_kos` rejects `cutoff >= 1` up front — would have crashed inside the
ratio filter at `log(best_evalue) == 0`.
Six new regression tests cover the user-reachable cases.

## Phase 7 — Localization

* **Sub-cellular localisation by MILP.** [`localization.predict_localization`](src/raven_python/localization/predict.py)
+ [`apply_localization`](src/raven_python/localization/predict.py). Deterministic (not simulated
annealing); caller-passed `reactions_to_relocate` set with everything else pinned;
incomplete-model tolerant (no silent reaction removal); `apply=False` returns a diff
preview; multi-compartment by default with primary-free, extras-penalised scoring.
* **Predictor loaders.** [`load_wolfpsort`, `load_deeploc`](src/raven_python/localization/scores.py),
with the `gene × compartment` DataFrame contract open for any predictor.
* **Compartment helpers** ([`manipulation/compartments.py`](src/raven_python/manipulation/compartments.py)):
`merge_compartments`, `copy_to_compartment` — useful standalone for model curation.
* **Real-data validation on yeast-GEM** ([docs/yeast_localization_benchmark.md](docs/yeast_localization_benchmark.md))
— accuracy 0.72 → 0.39 on 298 GPR'd reactions as confident predictor mis-scoring rises
from 0 % to 50 %; perfect on compartments with disjoint gene sets (c/g/lp/p/v/vm), and
surfaces a `transport_cost` calibration insight for soft-probability score tables.

## Phase 5 — Data integration & analysis

* **Reporter metabolites, FSEOF, random sampling** ([`analysis/`](src/raven_python/analysis/)).
* **HPA omics ingestion** ([`omics.parse_hpa`, `parse_hpa_rna`, `hpa_gene_scores`, `rna_gene_scores`](src/raven_python/omics/hpa.py))
— pandas-tidy DataFrames replace RAVEN's sparse-matrix layout; scoring adapters reuse the
existing GPR walk.
* **N-model comparison** ([`comparison.compare_models`](src/raven_python/comparison/compare.py)).
* **Dynamic FBA** is **not ported** — established Python packages cover it (`dfba`,
`reframed`, `mewpy`).

## Phase 4d — ftINIT

* **ftINIT pipeline** ([`init.ftinit`](src/raven_python/init/ftinit.py)) — staged MILP, linear merge,
task-aware gap-filling, gene pruning.
* **Validated against MATLAB RAVEN on Human-GEM.** 5 Hart2015 cell-line models;
Jaccard 0.973–0.977 (no-task) and 0.978–0.980 (task-constrained). See
[docs/humangem_validation.md](docs/humangem_validation.md).
* **Parameter calibration & input-robustness study** ([docs/init_param_calibration.md](docs/init_param_calibration.md))
— `mip_gap=0.01` is the genome-scale full-pipeline sweet spot (~37% faster than 0.001 at
Jaccard 0.995); pipeline is robust to expression noise (Jaccard 0.92–0.95) but sensitive
to sparsity (50–70% dropout → Jaccard 0.59–0.71); the task + gap-fill layer keeps the
essential-task pass-rate at 67–69/69 across the gradient, whereas tINIT-without-it passes
only 35/69 even on clean data.
* **Cross-solver portability** ([docs/init_solver_benchmark.md](docs/init_solver_benchmark.md))
+ [`tests/test_init_solvers.py`](tests/test_init_solvers.py): Gurobi and GLPK pass at toy
scale; only Gurobi is viable at genome scale today (HiGHS hits an upstream optlang
`clone()` bug; GLPK ignores `configuration.timeout` on MIP).
* **Engineering wins surfaced by the genome-scale work:** `check_tasks` and
`fill_tasks._feasible` rewritten in-place (~12× each); `optlang.symbolics.add` builds
in the MILP construction (the O(n²) sympy `sum()` blow-up was the original genome-scale
blocker); bounded gap-fill MILP; `rescaleModelForINIT` ported.

## Phase 4c — tINIT

* **INIT MILP and the tINIT pipeline** ([`init.run_init`](src/raven_python/init/init.py),
[`init.get_init_model`](src/raven_python/init/build.py)). Clean optlang reformulation;
RNA-seq scoring via `5·ln(level/ref)`-clamped.

## Phase 4b — Gap-filling

* **Connectivity gap-filling** ([`gapfilling.connect_blocked_reactions`](src/raven_python/gapfilling/fill.py))
— MILP. Targeted (toward objective) mode delegates to `cobra.gapfill`.

## Phase 4a — Metabolic tasks

* **Task list parsing + `check_tasks`** ([`tasks/`](src/raven_python/tasks/)).

## Phase 3 — Reconstruction

* **Homology-based draft** from a template GEM + BLAST/DIAMOND wrappers
([`reconstruction/homology/`](src/raven_python/reconstruction/homology/)) — with structured
improvements over RAVEN's `getModelFromHomology` (see IMPROVEMENTS H1–H6).
* **KEGG five-step pipeline** ([`reconstruction/kegg/`](src/raven_python/reconstruction/kegg/)):
dump → parser → HMM library builder → species model → HMM-query draft.
* **MetaCyc reconstruction** **not ported** (and flagged for removal from MATLAB RAVEN —
see IMPROVEMENTS R-MetaCyc).

## Phase 2 — I/O

* **YAML** aligned to cobra's `!!omap` writer + RAVEN-only fields preserved into `.notes`,
plus geckopy `ec-*` for enzyme-constrained models
([`io/yaml.py`](src/raven_python/io/yaml.py)).
* **SIF**, **Excel export**, and **Standard-GEM `model/<fmt>/…` git layout**
([`io/`](src/raven_python/io/)). Excel import intentionally excluded.

## Phase 1 — Foundation

* **GPR / balance / validation / parsing helpers** ([`utils/`](src/raven_python/utils/)) —
cobra-absent bits only; the rest are cheatsheeted.
* **Manipulation ergonomic layer** ([`manipulation/`](src/raven_python/manipulation/)) —
add/change/remove/transport/transfer/merge/simplify/variance + adopted transforms.
* **External-binary resolver** ([`binaries.py`](src/raven_python/binaries.py)) — version-pinned
release-ZIP registry, SHA256-verified cache.

## Phase 0 — Scaffold

* Project structure, packaging, pytest skeleton, license alignment with MATLAB RAVEN
(GPL-3.0-or-later).
Loading
Loading