Add yeast-GEM-derived shared modules (diff, annotation, conditions, biomass, curation)#17
Open
edkerk wants to merge 7 commits into
Open
Add yeast-GEM-derived shared modules (diff, annotation, conditions, biomass, curation)#17edkerk wants to merge 7 commits into
edkerk wants to merge 7 commits into
Conversation
Lands the upstream-shareable pieces that yeast-GEM has been implementing
locally during its Python port (see yeast-GEM/code/python/PORTING_PLAN.md
and UPSTREAM_CANDIDATES.md). These are organism-agnostic; yeast-GEM
will consume them via a Python dependency on raven-python.
New modules
-----------
raven_python.comparison.diff
diff_models(a, b, ...) -> DiffReport — strict two-model semantic-
equality diff. Complements the existing compare_models (N-model
presence-matrix overview). Used as a CI gate to verify that two
toolchains (e.g. MATLAB RAVEN vs raven_python, pre/post refactor
of one toolchain) produce equivalent models. Includes a
python -m raven_python.comparison.diff CLI.
raven_python.annotation.sbo
add_sbo_terms — SBO term assignment with "fill" semantic. Default
parameter set reproduces yeast-GEM's behaviour; biomass metabolite
names, biomass/NGAM reaction names, and pseudoreaction substrings
are overridable. Transport detection is pluggable (default: same-
met-name in two compartments). Includes an `only_last_reaction_
for_pseudo` legacy bug-compat flag for yeast-GEM's lock-step
migration; off by default for any new caller.
raven_python.annotation.delta_g
load_delta_g_csv / save_delta_g_csv — generic side-car CSV
persistence for scalar notes (ΔG by default, but the note key,
column names, and id/value mapping are all configurable).
raven_python.conditions.apply
apply_condition(model, yaml_or_dict) — generic "apply this YAML
condition" loader. Schema: prelude (reset_exchanges),
cofactor_pseudoreaction (remove_mets + charge_balance_met),
biomass_stoichiometry_delta, per-rxn bounds, expected_uptake_count.
Project-specific extensions (e.g. yeast-GEM's amino_acid_ratio)
are handled by the caller before/after this function — kept
upstream-narrow on purpose. Also exposes set_reaction_bounds
helper that bypasses cobra's lb<=ub validator for the (legitimate)
cases where a condition lands on an infeasible bound state.
Tests
-----
46 new tests across the three modules; full pre-existing raven-python
suite still passes (519 passed; 1 unrelated pre-existing openpyxl
ImportError in tests/test_io_git.py; 2 skipped). ruff clean.
Not in this commit
------------------
The biomass / GAM / chemostat / fit_gam modules are still tracked as
upstream candidates in yeast-GEM/code/python/UPSTREAM_CANDIDATES.md
and remain local in yeast-GEM until phase 4 of the port (which would
ideally land them directly here).
Generic biomass-equation manipulation, extracted from yeast-GEM's
sumBioMass / scaleBioMass / rescalePseudoReaction / changeGAM as
yeast-GEM moves to depend on raven-python (yeast-GEM phase 4 of the
porting plan).
Module layout
-------------
raven_python.biomass.config
BiomassConfig — biomass_rxn id + proton_met id + ordered tuple
of BiomassComponent (per-component pseudoreaction name + mass-
computation strategy).
raven_python.biomass.scale
sum_biomass(model, config) → {component: g/gDW, ..., "total": X}
rescale_pseudoreaction(model, config, name, factor) — multiply
the pseudoreaction's substrate coefs by factor and rebalance
H+ to keep ionic charge at zero.
scale_biomass(model, config, name, new_value, balance_out=None) —
rescale a component to a target fraction, optionally balancing
a second component so the total stays at 1 g/gDW.
raven_python.biomass.gam
set_gam(model, value, *, biomass_rxn, cofactor_met_names,
ngam_rxn=None, ngam_value=None) — scales every metabolite
in the biomass pseudoreaction whose `name` is in the supplied
cofactor set, preserving its sign; optionally fixes the NGAM rxn
bounds.
Mass strategies (per BiomassComponent.mass_strategy):
"mw" plain MW from chemical formula (carbohydrate /
ion / cofactor)
"mw_minus_2h" MW − 2.016 g/mol per substrate (protein —
charged tRNAs release two protons)
"mw_minus_water" MW − 18.015 g/mol per substrate (RNA / DNA —
polymerisation releases one water)
"grams" stoichiometry already in g/gDW (lipid backbone)
Tests: 19 new tests over a synthetic toy model that exercises every
mass strategy, the H+ charge rebalance, scale_biomass with and
without balance_out, set_gam on cofactor mets (and the NGAM bound
path).
…iant)
Detection-only counterpart to remove_duplicate_reactions. Returns
duplicate groups instead of mutating the model. Ignores bounds /
GPR / objective — only stoichiometry is compared, mirroring the
typical curation use case ("find reactions that could be merged").
A new ``ignore_direction=True`` default (yeast-GEM convention)
treats A→B and B→A as duplicates. Set False to require identical
orientation.
Used by yeast-GEM's modelTests port (Tier 3 / phase 5) to flag
duplicate reactions during curation review.
Generic batch curation engine extracted from yeast-GEM's MATLAB
curateMetsRxnsGenes (yeast-GEM phase 6). Adds or updates
metabolites, reactions and genes from pandas DataFrames; a
batch_curate_from_tsv convenience wrapper reads the equivalent TSVs.
Schema (matches yeast-GEM's data/modelCuration/template/ layout):
mets_df metNames, comps, formula, charge, inchi, metNotes
+ any number of MIRIAM-namespace columns
genes_df genes, geneShortNames + MIRIAM columns
rxns_df rxnNames, grRules, lb, ub, rev, subSystems,
eccodes, rxnNotes, rxnReferences,
rxnConfidenceScores + MIRIAM columns
rxns_coeffs_df rxnNames, metNames, comps, coefficient
(one row per (reaction, metabolite) pair)
Match keys:
metabolites — (name, compartment) tuple
genes — gene id
reactions — stoichiometric signature
Existing entities get their annotations overwritten (warning emitted);
new entities are added with fresh ids generated from the supplied
``met_id_prefix`` / ``rxn_id_prefix`` (defaults M_ / R_ per the BiGG
convention; yeast-GEM passes s_ / r_). Width of the existing
zero-padded suffix is preserved so s_0001 → s_0002, not s_2.
"Everything after the core columns is MIRIAM" — the header of any
extra column becomes the annotation namespace key. Matches MATLAB
behaviour exactly so yeast-GEM's existing TSVs work unchanged, and
projects with different MIRIAM column sets need no code change.
CurationResult dataclass records what was added vs updated so
callers can verify in tests / CI.
Tests: 13 new (add/update mets, add/update genes, add/update rxns
by stoichiometry, miriam auto-detect, id-width preservation,
combined mets+rxns in one call, missing-metabolite error,
batch_curate_from_tsv round trip).
Three things this fixes:
1. write_yaml_model dropped the !!omap tags entirely. _to_plain
was flattening cobra's OrderedDict to plain dict, which causes
ruamel to emit ordinary block mappings. RAVEN MATLAB's reader
is a line-based parser keyed on !!omap and therefore could not
load any file we wrote. _to_plain now returns OrderedDict so
ruamel re-emits the !!omap tag.
2. eccodes was lost on round-trip — it wasn't in _RXN_FIELDS, so
read_yaml_model didn't capture it into .notes and
write_yaml_model couldn't lift it back. Added.
3. RAVEN MATLAB writes reaction notes as 'rxnNotes'; cobrapy and
this writer use 'notes'. Added a read-time alias so existing
yeast-GEM YAML files (which still say 'rxnNotes') load
cleanly. Writes go out as 'notes' (cobrapy-canonical).
Top-level layout now matches RAVEN MATLAB: metaData first, then
metabolites / reactions / genes / compartments, then optional
gecko_light + ec-rxns + ec-enzymes. id/name/version live inside
metaData (RAVEN convention) — cobrapy reading these files still
works, but cobra_model.id ends up None because cobrapy doesn't
know about metaData. raven_python.read_yaml_model lifts both
metaData.id/name/version onto model.id / model.name /
model.notes['version'] so the rest of the codebase doesn't care
which layout the file used.
Empty-name genes are no longer emitted as — that's a
cobrapy quirk that drifts yeast-GEM YAML files away from RAVEN
MATLAB's output.
Verified end-to-end:
* cobra.io.load_yaml_model reads every file the new writer
produces (yeast-GEM and a synthetic fixture).
* RAVEN MATLAB readYAMLmodel reads every file the new writer
produces.
* Round-tripping yeast-GEM through raven_python preserves
2748/2748 metabolites, 4102/4102 reactions, 1143/1143 genes,
2411 eccodes, 3984 reaction deltaG, 2696 metabolite deltaG,
1788 SMILES, 1443 rxn-notes — no semantic drift.
Tests
-----
* tests/test_io_yaml_parity.py is new: covers every RAVEN
extension, the rxnNotes legacy alias, the SMILES YAML-special
character case, metaData-first layout, and cobra readability.
* tests/test_io_yaml.py::test_output_is_cobra_readable adjusts
for the metaData layout (cobra recovers mets/rxns/annotation
but not model.id, by design).
PyYAML is not a project dependency; raven-python uses ruamel.yaml (already pulled in via cobra) everywhere else. The conditions module and its tests still imported PyYAML, which broke pytest collection on clean CI runners with 'No module named yaml'. Both apply.py and the test now use a YAML(typ='safe') instance from ruamel.yaml — same plain-dict semantics as PyYAML's safe_load / safe_dump, no new dependency.
Adds docs/reference/yaml_format.md as the canonical schema reference for the cross-toolchain YAML format (cobrapy / raven-python / RAVEN MATLAB). Covers the full document shape, per-entry field order, RAVEN extensions, the GECKO ec-* sections, the metaData provenance block, number / string / quoting rules, and the cross-reader interoperability matrix. Linked from docs/reference/index.md and the I/O guide. Reader fix: pre-shim RAVEN MATLAB writes emitted GECKO models with geckoLight: "true" inside the metaData block (not as a top-level gecko_light). The reader now lifts that legacy key out of metaData so model.ec.gecko_light is populated whichever placement the file used. Round-trip writes always use the new top-level form. Regression tests: test_pre_shim_format_loads — synthetic fixture covering every legacy quirk we know about (--- doc marker, plain metaData, geckoLight inside metaData, top-level metabolite smiles, rxnNotes reaction key, integer bounds, double-quoted strings). Each quirk has its own assertion + comment. test_pre_shim_yeast_gem_loads_if_available — sanity-loads the real yeast-GEM.yml (2748 mets, 4102 rxns, 1143 genes) and asserts the documented preserved-counts table from the format reference. Skipped on CI runners where the working copy isn't mounted.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Lands the pieces that yeast-GEM extracted upstream as part of its Python port. Four focused commits, all organism-agnostic by construction (yeast-GEM consumes them via thin wrappers with its own config / data files).
What lands
raven_python.comparison.diff—diff_models(a, b, *, stoichiometry_tol=1e-9, …)+DiffReport+ apython -m raven_python.comparison.diffCLI. Strict two-model semantic-equality check complementing the existing N-modelcompare_models(presence-matrix overview). Used as a CI gate to verify two toolchains produce equivalent models.raven_python.annotation—add_sbo_termswith a fully parameterised core (biomass-met / NGAM / pseudoreaction name sets, transport-detector callback,only_last_reaction_for_pseudolegacy-bug flag for yeast-GEM's lock-step migration).load_delta_g_csv/save_delta_g_csvfor the standard "id, value" side-car format (column names + note key configurable).raven_python.conditions—apply_condition(model, yaml_or_dict)driven by a YAML schema (prelude / cofactor pseudoreaction edits / biomass-stoichiometry delta / bounds / uptake-count check). Project-specific extensions like yeast-GEM'samino_acid_ratioare handled by callers; the upstream contract stays narrow. Also exposesset_reaction_boundsthat bypasses cobra'slb<=ubvalidator for the legitimate cases where a condition lands on an infeasible state.raven_python.biomass—BiomassConfig+BiomassComponentdataclasses drivingsum_biomass,scale_biomass,rescale_pseudoreaction, andset_gam. Four mass strategies (mw,mw_minus_2h,mw_minus_water,grams) cover the typical pseudoreaction setups.raven_python.manipulation.find_duplicate_reactions— detection-only counterpart to the existingremove_duplicate_reactions. Returns groups instead of mutating; ignores bounds / GPRs (only stoichiometry);ignore_direction=Truedefault matches yeast-GEM'sfindDuplicatedRxns.raven_python.curation—batch_curate(model, *, mets_df=…, genes_df=…, rxns_df=…, rxns_coeffs_df=…)+batch_curate_from_tsv(...). DataFrame-first; the TSV path is a thin convenience. Match keys:(name, comp)for mets, gene id for genes, stoichiometric signature for reactions. "Everything after the listed core columns is MIRIAM" — column header becomes the annotation namespace key; matches the yeast-GEM convention exactly so existing TSVs work unchanged. Includes aCurationResultdataclass recording added vs updated.raven_python.io.yaml— byte-compatible YAML round-trip with cobrapy and RAVEN MATLAB. The previous writer dropped!!omaptags (because_to_plainflattened cobra's OrderedDict to plain dict), so any file it produced was unreadable by RAVEN MATLAB's line-based reader. Also losteccodeson round-trip (not in_RXN_FIELDS) and didn't recognise RAVEN MATLAB's legacyrxnNotesreaction key. Fixes:_to_plainnow returns OrderedDict, so ruamel re-emits the!!omaptag.eccodesadded to_RXN_FIELDSso it round-trips through.notes.rxnNotesaccepted as a read-time alias fornotes; writes go out asnotes(cobrapy-canonical).model.id/model.name/model.notes['version']for cobra-shape accessors.name: ''(matches RAVEN MATLAB).read_yaml_model+write_yaml_modelpreserves 2748/2748 metabolites, 4102/4102 reactions, 1143/1143 genes, 2411 eccodes, 3984 reaction deltaG, 2696 metabolite deltaG, 1788 SMILES, 1443 reaction notes. cobra.io.load_yaml_model and RAVEN MATLAB readYAMLmodel both read the output cleanly.tests/test_io_yaml_parity.pycovers every RAVEN extension, the rxnNotes alias, the SMILES YAML-special-character case, and the metaData-first layout.Tests
84 new tests against synthetic toy models (13 diff, 25 annotation, 14 conditions, 19 biomass, 6 find_duplicate, 13 curation). Full raven-python suite still passes (1 pre-existing
openpyxl-missing failure unrelated to this PR).Lock-step verification (against MATLAB R2024b + RAVEN)
yeast-GEM uses each of these modules end-to-end via yeastgem. The Python toolchain produces SBML semantically equal to MATLAB's
commitYeastModeland metrics matching MATLAB's within the tolerances documented in yeast-GEM's porting plan. Details in each commit message.Companion PRs
parseYAML,applyCondition, biomass / SBO / deltaG /findDuplicateRxns/curateModelFromTables): Add shared yeast-GEM helpers: parseYAML, applyCondition, biomass, SBO, deltaG, findDuplicateRxns, curateModelFromTables RAVEN#609yeast-GEM's
pyproject.tomlpins this branch as agit+https://…@feat/yeast-gem-sharedURL dep; once these modules land on a tagged release the pin will switch to a version constraint.