Skip to content

Add yeast-GEM-derived shared modules (diff, annotation, conditions, biomass, curation)#17

Open
edkerk wants to merge 7 commits into
developfrom
feat/yeast-gem-shared
Open

Add yeast-GEM-derived shared modules (diff, annotation, conditions, biomass, curation)#17
edkerk wants to merge 7 commits into
developfrom
feat/yeast-gem-shared

Conversation

@edkerk
Copy link
Copy Markdown
Member

@edkerk edkerk commented May 30, 2026

Lands the pieces that yeast-GEM extracted upstream as part of its Python port. Four focused commits, all organism-agnostic by construction (yeast-GEM consumes them via thin wrappers with its own config / data files).

What lands

  • raven_python.comparison.diffdiff_models(a, b, *, stoichiometry_tol=1e-9, …) + DiffReport + a python -m raven_python.comparison.diff CLI. Strict two-model semantic-equality check complementing the existing N-model compare_models (presence-matrix overview). Used as a CI gate to verify two toolchains produce equivalent models.

  • raven_python.annotationadd_sbo_terms with a fully parameterised core (biomass-met / NGAM / pseudoreaction name sets, transport-detector callback, only_last_reaction_for_pseudo legacy-bug flag for yeast-GEM's lock-step migration). load_delta_g_csv / save_delta_g_csv for the standard "id, value" side-car format (column names + note key configurable).

  • raven_python.conditionsapply_condition(model, yaml_or_dict) driven by a YAML schema (prelude / cofactor pseudoreaction edits / biomass-stoichiometry delta / bounds / uptake-count check). Project-specific extensions like yeast-GEM's amino_acid_ratio are handled by callers; the upstream contract stays narrow. Also exposes set_reaction_bounds that bypasses cobra's lb<=ub validator for the legitimate cases where a condition lands on an infeasible state.

  • raven_python.biomassBiomassConfig + BiomassComponent dataclasses driving sum_biomass, scale_biomass, rescale_pseudoreaction, and set_gam. Four mass strategies (mw, mw_minus_2h, mw_minus_water, grams) cover the typical pseudoreaction setups.

  • raven_python.manipulation.find_duplicate_reactions — detection-only counterpart to the existing remove_duplicate_reactions. Returns groups instead of mutating; ignores bounds / GPRs (only stoichiometry); ignore_direction=True default matches yeast-GEM's findDuplicatedRxns.

  • raven_python.curationbatch_curate(model, *, mets_df=…, genes_df=…, rxns_df=…, rxns_coeffs_df=…) + batch_curate_from_tsv(...). DataFrame-first; the TSV path is a thin convenience. Match keys: (name, comp) for mets, gene id for genes, stoichiometric signature for reactions. "Everything after the listed core columns is MIRIAM" — column header becomes the annotation namespace key; matches the yeast-GEM convention exactly so existing TSVs work unchanged. Includes a CurationResult dataclass recording added vs updated.

  • raven_python.io.yaml — byte-compatible YAML round-trip with cobrapy and RAVEN MATLAB. The previous writer dropped !!omap tags (because _to_plain flattened cobra's OrderedDict to plain dict), so any file it produced was unreadable by RAVEN MATLAB's line-based reader. Also lost eccodes on round-trip (not in _RXN_FIELDS) and didn't recognise RAVEN MATLAB's legacy rxnNotes reaction key. Fixes:

    • _to_plain now returns OrderedDict, so ruamel re-emits the !!omap tag.
    • eccodes added to _RXN_FIELDS so it round-trips through .notes.
    • rxnNotes accepted as a read-time alias for notes; writes go out as notes (cobrapy-canonical).
    • Top-level layout reordered to match RAVEN MATLAB (metaData first, then metabolites/reactions/genes/compartments, then optional gecko_light + ec-rxns + ec-enzymes). id/name/version live inside metaData; the reader lifts them back to model.id / model.name / model.notes['version'] for cobra-shape accessors.
    • Empty-name genes no longer emit name: '' (matches RAVEN MATLAB).
    • Verified: round-tripping yeast-GEM through read_yaml_model + write_yaml_model preserves 2748/2748 metabolites, 4102/4102 reactions, 1143/1143 genes, 2411 eccodes, 3984 reaction deltaG, 2696 metabolite deltaG, 1788 SMILES, 1443 reaction notes. cobra.io.load_yaml_model and RAVEN MATLAB readYAMLmodel both read the output cleanly.
    • New regression test tests/test_io_yaml_parity.py covers every RAVEN extension, the rxnNotes alias, the SMILES YAML-special-character case, and the metaData-first layout.

Tests

84 new tests against synthetic toy models (13 diff, 25 annotation, 14 conditions, 19 biomass, 6 find_duplicate, 13 curation). Full raven-python suite still passes (1 pre-existing openpyxl-missing failure unrelated to this PR).

Lock-step verification (against MATLAB R2024b + RAVEN)

yeast-GEM uses each of these modules end-to-end via yeastgem. The Python toolchain produces SBML semantically equal to MATLAB's commitYeastModel and metrics matching MATLAB's within the tolerances documented in yeast-GEM's porting plan. Details in each commit message.

Companion PRs

yeast-GEM's pyproject.toml pins this branch as a git+https://…@feat/yeast-gem-shared URL dep; once these modules land on a tagged release the pin will switch to a version constraint.

edkerk added 4 commits May 31, 2026 01:03
Lands the upstream-shareable pieces that yeast-GEM has been implementing
locally during its Python port (see yeast-GEM/code/python/PORTING_PLAN.md
and UPSTREAM_CANDIDATES.md). These are organism-agnostic; yeast-GEM
will consume them via a Python dependency on raven-python.

New modules
-----------
raven_python.comparison.diff
    diff_models(a, b, ...) -> DiffReport — strict two-model semantic-
    equality diff. Complements the existing compare_models (N-model
    presence-matrix overview). Used as a CI gate to verify that two
    toolchains (e.g. MATLAB RAVEN vs raven_python, pre/post refactor
    of one toolchain) produce equivalent models. Includes a
    python -m raven_python.comparison.diff CLI.

raven_python.annotation.sbo
    add_sbo_terms — SBO term assignment with "fill" semantic. Default
    parameter set reproduces yeast-GEM's behaviour; biomass metabolite
    names, biomass/NGAM reaction names, and pseudoreaction substrings
    are overridable. Transport detection is pluggable (default: same-
    met-name in two compartments). Includes an `only_last_reaction_
    for_pseudo` legacy bug-compat flag for yeast-GEM's lock-step
    migration; off by default for any new caller.

raven_python.annotation.delta_g
    load_delta_g_csv / save_delta_g_csv — generic side-car CSV
    persistence for scalar notes (ΔG by default, but the note key,
    column names, and id/value mapping are all configurable).

raven_python.conditions.apply
    apply_condition(model, yaml_or_dict) — generic "apply this YAML
    condition" loader. Schema: prelude (reset_exchanges),
    cofactor_pseudoreaction (remove_mets + charge_balance_met),
    biomass_stoichiometry_delta, per-rxn bounds, expected_uptake_count.
    Project-specific extensions (e.g. yeast-GEM's amino_acid_ratio)
    are handled by the caller before/after this function — kept
    upstream-narrow on purpose. Also exposes set_reaction_bounds
    helper that bypasses cobra's lb<=ub validator for the (legitimate)
    cases where a condition lands on an infeasible bound state.

Tests
-----
46 new tests across the three modules; full pre-existing raven-python
suite still passes (519 passed; 1 unrelated pre-existing openpyxl
ImportError in tests/test_io_git.py; 2 skipped). ruff clean.

Not in this commit
------------------
The biomass / GAM / chemostat / fit_gam modules are still tracked as
upstream candidates in yeast-GEM/code/python/UPSTREAM_CANDIDATES.md
and remain local in yeast-GEM until phase 4 of the port (which would
ideally land them directly here).
Generic biomass-equation manipulation, extracted from yeast-GEM's
sumBioMass / scaleBioMass / rescalePseudoReaction / changeGAM as
yeast-GEM moves to depend on raven-python (yeast-GEM phase 4 of the
porting plan).

Module layout
-------------
raven_python.biomass.config
    BiomassConfig — biomass_rxn id + proton_met id + ordered tuple
    of BiomassComponent (per-component pseudoreaction name + mass-
    computation strategy).

raven_python.biomass.scale
    sum_biomass(model, config) → {component: g/gDW, ..., "total": X}
    rescale_pseudoreaction(model, config, name, factor) — multiply
        the pseudoreaction's substrate coefs by factor and rebalance
        H+ to keep ionic charge at zero.
    scale_biomass(model, config, name, new_value, balance_out=None) —
        rescale a component to a target fraction, optionally balancing
        a second component so the total stays at 1 g/gDW.

raven_python.biomass.gam
    set_gam(model, value, *, biomass_rxn, cofactor_met_names,
            ngam_rxn=None, ngam_value=None) — scales every metabolite
    in the biomass pseudoreaction whose `name` is in the supplied
    cofactor set, preserving its sign; optionally fixes the NGAM rxn
    bounds.

Mass strategies (per BiomassComponent.mass_strategy):
    "mw"               plain MW from chemical formula (carbohydrate /
                       ion / cofactor)
    "mw_minus_2h"      MW − 2.016 g/mol per substrate (protein —
                       charged tRNAs release two protons)
    "mw_minus_water"   MW − 18.015 g/mol per substrate (RNA / DNA —
                       polymerisation releases one water)
    "grams"            stoichiometry already in g/gDW (lipid backbone)

Tests: 19 new tests over a synthetic toy model that exercises every
mass strategy, the H+ charge rebalance, scale_biomass with and
without balance_out, set_gam on cofactor mets (and the NGAM bound
path).
…iant)

Detection-only counterpart to remove_duplicate_reactions. Returns
duplicate groups instead of mutating the model. Ignores bounds /
GPR / objective — only stoichiometry is compared, mirroring the
typical curation use case ("find reactions that could be merged").

A new ``ignore_direction=True`` default (yeast-GEM convention)
treats A→B and B→A as duplicates. Set False to require identical
orientation.

Used by yeast-GEM's modelTests port (Tier 3 / phase 5) to flag
duplicate reactions during curation review.
Generic batch curation engine extracted from yeast-GEM's MATLAB
curateMetsRxnsGenes (yeast-GEM phase 6). Adds or updates
metabolites, reactions and genes from pandas DataFrames; a
batch_curate_from_tsv convenience wrapper reads the equivalent TSVs.

Schema (matches yeast-GEM's data/modelCuration/template/ layout):

  mets_df          metNames, comps, formula, charge, inchi, metNotes
                   + any number of MIRIAM-namespace columns
  genes_df         genes, geneShortNames + MIRIAM columns
  rxns_df          rxnNames, grRules, lb, ub, rev, subSystems,
                   eccodes, rxnNotes, rxnReferences,
                   rxnConfidenceScores + MIRIAM columns
  rxns_coeffs_df   rxnNames, metNames, comps, coefficient
                   (one row per (reaction, metabolite) pair)

Match keys:
  metabolites — (name, compartment) tuple
  genes       — gene id
  reactions   — stoichiometric signature

Existing entities get their annotations overwritten (warning emitted);
new entities are added with fresh ids generated from the supplied
``met_id_prefix`` / ``rxn_id_prefix`` (defaults M_ / R_ per the BiGG
convention; yeast-GEM passes s_ / r_). Width of the existing
zero-padded suffix is preserved so s_0001 → s_0002, not s_2.

"Everything after the core columns is MIRIAM" — the header of any
extra column becomes the annotation namespace key. Matches MATLAB
behaviour exactly so yeast-GEM's existing TSVs work unchanged, and
projects with different MIRIAM column sets need no code change.

CurationResult dataclass records what was added vs updated so
callers can verify in tests / CI.

Tests: 13 new (add/update mets, add/update genes, add/update rxns
by stoichiometry, miriam auto-detect, id-width preservation,
combined mets+rxns in one call, missing-metabolite error,
batch_curate_from_tsv round trip).
edkerk added 3 commits May 31, 2026 07:03
Three things this fixes:

  1. write_yaml_model dropped the !!omap tags entirely. _to_plain
     was flattening cobra's OrderedDict to plain dict, which causes
     ruamel to emit ordinary block mappings. RAVEN MATLAB's reader
     is a line-based parser keyed on !!omap and therefore could not
     load any file we wrote. _to_plain now returns OrderedDict so
     ruamel re-emits the !!omap tag.

  2. eccodes was lost on round-trip — it wasn't in _RXN_FIELDS, so
     read_yaml_model didn't capture it into .notes and
     write_yaml_model couldn't lift it back. Added.

  3. RAVEN MATLAB writes reaction notes as 'rxnNotes'; cobrapy and
     this writer use 'notes'. Added a read-time alias so existing
     yeast-GEM YAML files (which still say 'rxnNotes') load
     cleanly. Writes go out as 'notes' (cobrapy-canonical).

Top-level layout now matches RAVEN MATLAB: metaData first, then
metabolites / reactions / genes / compartments, then optional
gecko_light + ec-rxns + ec-enzymes. id/name/version live inside
metaData (RAVEN convention) — cobrapy reading these files still
works, but cobra_model.id ends up None because cobrapy doesn't
know about metaData. raven_python.read_yaml_model lifts both
metaData.id/name/version onto model.id / model.name /
model.notes['version'] so the rest of the codebase doesn't care
which layout the file used.

Empty-name genes are no longer emitted as  — that's a
cobrapy quirk that drifts yeast-GEM YAML files away from RAVEN
MATLAB's output.

Verified end-to-end:

  *  cobra.io.load_yaml_model reads every file the new writer
     produces (yeast-GEM and a synthetic fixture).
  *  RAVEN MATLAB readYAMLmodel reads every file the new writer
     produces.
  *  Round-tripping yeast-GEM through raven_python preserves
     2748/2748 metabolites, 4102/4102 reactions, 1143/1143 genes,
     2411 eccodes, 3984 reaction deltaG, 2696 metabolite deltaG,
     1788 SMILES, 1443 rxn-notes — no semantic drift.

Tests
-----
  *  tests/test_io_yaml_parity.py is new: covers every RAVEN
     extension, the rxnNotes legacy alias, the SMILES YAML-special
     character case, metaData-first layout, and cobra readability.
  *  tests/test_io_yaml.py::test_output_is_cobra_readable adjusts
     for the metaData layout (cobra recovers mets/rxns/annotation
     but not model.id, by design).
PyYAML is not a project dependency; raven-python uses ruamel.yaml
(already pulled in via cobra) everywhere else. The conditions
module and its tests still imported PyYAML, which broke pytest
collection on clean CI runners with 'No module named yaml'.

Both apply.py and the test now use a YAML(typ='safe') instance
from ruamel.yaml — same plain-dict semantics as PyYAML's
safe_load / safe_dump, no new dependency.
Adds docs/reference/yaml_format.md as the canonical schema reference
for the cross-toolchain YAML format (cobrapy / raven-python / RAVEN
MATLAB). Covers the full document shape, per-entry field order,
RAVEN extensions, the GECKO ec-* sections, the metaData provenance
block, number / string / quoting rules, and the cross-reader
interoperability matrix. Linked from docs/reference/index.md and
the I/O guide.

Reader fix: pre-shim RAVEN MATLAB writes emitted GECKO models
with geckoLight: "true" inside the metaData block (not as a
top-level gecko_light). The reader now lifts that legacy key out
of metaData so model.ec.gecko_light is populated whichever
placement the file used. Round-trip writes always use the new
top-level form.

Regression tests:

  test_pre_shim_format_loads — synthetic fixture covering every
  legacy quirk we know about (--- doc marker, plain metaData,
  geckoLight inside metaData, top-level metabolite smiles,
  rxnNotes reaction key, integer bounds, double-quoted strings).
  Each quirk has its own assertion + comment.

  test_pre_shim_yeast_gem_loads_if_available — sanity-loads the
  real yeast-GEM.yml (2748 mets, 4102 rxns, 1143 genes) and
  asserts the documented preserved-counts table from the format
  reference. Skipped on CI runners where the working copy isn't
  mounted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant