Skip to content

Add shared yeast-GEM helpers: parseYAML, applyCondition, biomass, SBO, deltaG, findDuplicateRxns, curateModelFromTables#609

Open
edkerk wants to merge 7 commits into
mainfrom
feat/yeast-gem-shared
Open

Add shared yeast-GEM helpers: parseYAML, applyCondition, biomass, SBO, deltaG, findDuplicateRxns, curateModelFromTables#609
edkerk wants to merge 7 commits into
mainfrom
feat/yeast-gem-shared

Conversation

@edkerk
Copy link
Copy Markdown
Member

@edkerk edkerk commented May 30, 2026

Generic helpers extracted from yeast-GEM's MATLAB code as part of its Python port. All organism-agnostic; yeast-GEM consumes them via thin shims with its own data files / id prefixes.

What lands

I/O & condition handling

  • io/parseYAML.m — generic YAML reader for arbitrary documents. Complements the existing model-format-only readYAMLmodel. Delegates to py.yaml.safe_load via MATLAB's Python bridge; py-side returns are recursively converted to native MATLAB struct / cell. Used by yeast-GEM to read its data/conditions/*.yml and data/yeastgem/ids.yml.

  • core/applyCondition.m — apply a deterministic "condition" to a model from a YAML file or pre-parsed struct. Schema: prelude (reset_exchanges), cofactor_pseudoreaction (metabolite removals + automatic charge rebalance of an H+ met), biomass_stoichiometry_delta (combine with existing coefficients), per-reaction bounds diff, expected_uptake_count sanity check. Project-specific extensions (yeast-GEM's amino_acid_ratio step) are handled by the caller — upstream contract stays narrow.

  • core/curateModelFromTables.m — generic batch-curation engine extracted from yeast-GEM's curateMetsRxnsGenes. Adds or updates metabolites / reactions / genes from TSV inputs. New parameters metPrefix and rxnPrefix (default 'M_' / 'R_') let any GEM project use the engine with its own id conventions; yeast-GEM keeps a thin curateMetsRxnsGenes shim that pins 's_' / 'r_' and forwards here. Schema otherwise identical to the legacy yeast-GEM version, so existing yeast-GEM TSVs work unchanged.

Biomass helpers

  • core/getBiomassFractions.m — compute mass fraction (g/gDW) per biomass component plus the total, parameterised by a biomassConfig struct. Supports four mass_strategy modes: mw, mw_minus_2h (charged tRNAs), mw_minus_water (RNA / DNA polymerisation), grams (stoichiometry already in g/gDW). yeast-GEM's sumBioMass becomes a thin shim.

  • core/scaleBiomassFraction.m — rescale a component to a target g/gDW, optionally adjusting a second component so the total stays at 1 g/gDW. Replaces the body of yeast-GEM's scaleBioMass.

  • core/scaleBiomassPseudoreaction.m — multiply the substrate coefficients of one component pseudoreaction by a factor and rebalance H+ via biomassConfig.proton_met to preserve charge neutrality. Replaces the body of yeast-GEM's rescalePseudoReaction.

  • core/setGAM.m — set the GAM coefficient on the biomass pseudoreaction for a configurable list of cofactor metabolite names; optionally fix an NGAM reaction's bounds. Replaces the body of yeast-GEM's changeGAM.

Annotation helpers

  • core/assignSBOterms.m — generic SBO term assignment ported from yeast-GEM's addSBOterms. Parameterised entirely by an opts struct (biomass met names / suffixes, biomass / NGAM reaction names, pseudoreaction substrings). Writes via editMiriam(..., 'fill') so pre-existing SBO annotations are preserved. Includes opts.onlyLastReactionForPseudo as a bug-compat flag — the legacy yeast-GEM loop used for i = numel(model.rxns) instead of 1:numel(...), tagging only the last reaction; default off, yeast-GEM's shim turns it on so its saveYeastModel output stays byte-stable.

  • io/loadDeltaGfromCSV.m / io/saveDeltaGtoCSV.m — generic load / save of metDeltaG and rxnDeltaG fields against two-column CSVs (id, deltaG). yeast-GEM's loadDeltaG / saveDeltaG become thin wrappers that supply the project paths.

YAML I/O alignment

  • io/writeYAMLmodel.m + io/readYAMLmodel.m — three-way alignment between RAVEN MATLAB, raven-python, and cobrapy YAML output. Cobrapy is treated as canonical for the core (id / name / metabolites / reactions / genes / compartments / annotation / bounds); RAVEN-only fields (inchis, deltaG, metFrom, eccodes, rxnFrom, references, confidence_score, protein, metaData, ec-rxns / ec-enzymes) are additive extensions kept in their familiar positions. Writer changes:

    • Field order now matches cobrapy (charge before formula; notes before annotation; objective_coefficient after gene_reaction_rule; subsystem before annotation).
    • preserveQuotes default flipped to false to match cobrapy's minimal-quoting style; values that need quoting (SMILES strings with [O-], booleans, leading flow indicators, embedded : ) are still quoted defensively via a new needsYamlQuoting helper.
    • Reaction notes are now written as notes (cobrapy-canonical), not rxnNotes. Reader accepts both for backward compat.
    • Metabolite smiles moves into annotation.smiles (cobrapy convention). Reader still accepts top-level smiles.
    • gecko_light is emitted at the top level (matches raven-python) instead of inside metaData.
    • Whole-number bounds emit as 1000.0 not 1000 (matches cobrapy / Python float repr).
    • Empty metabolites: blocks emit !!omap [] so the file is a valid YAML 1.2 document.
    • Document-start marker (---) dropped to match cobrapy's bare !!omap root.

    Reader changes: accepts cobrapy's root-level id / name / version / gecko_light; accepts the !!omap-tagged metaData header; accepts both notes and rxnNotes reaction keys; defensive against subSystems shapes produced by the reader's own post-processing flatten.

    Verified: cobra.io.load_yaml_model and raven_python.read_yaml_model both successfully read every file the new writer produces (cobra mini.yml and the 2748-met yeast-GEM.yml). Round-tripping yeast-GEM through writeYAMLmodel preserves 2748/2748 metabolites, 4102/4102 reactions, 1143/1143 genes, 2411 eccodes, 3984 reaction deltaG, 2696 metabolite deltaG, 1788 SMILES, 1443 reaction notes — no semantic drift.

Misc

  • core/findDuplicateRxns.m — return reaction pairs sharing identical stoichiometry, treating A->B and B->A as duplicates by default. Counterpart of raven_python.manipulation.find_duplicate_reactions and the upstream version of yeast-GEM's findDuplicatedRxns (which becomes a thin printing wrapper).

Lock-step verification

A MATLAB equivalence harness loaded yeast-GEM and ran every legacy implementation (extracted from git show HEAD:...) against its refactored shim on the same model state, asserting bitwise match on the S matrix, bounds, and miriam annotations (or near-zero float diff for deltaG / sumBioMass scalars). 10 / 10 checks pass — sumBioMass, rescalePseudoReaction (all 9 components incl. lipid aggregation), scaleBioMass (with balance_out), changeGAM (with and without NGAM), addSBOterms (mets and rxns), loadDeltaG, saveDeltaGtoCSV round-trip, and findDuplicateRxns. Details in each commit message and in yeast-GEM's PORTING_PLAN.md.

Companion PRs

edkerk added 2 commits May 30, 2026 02:02
Two generic helpers extracted from yeast-GEM's MATLAB port (see the
yeast-GEM/code/python/PORTING_PLAN.md and UPSTREAM_CANDIDATES.md
documents for the broader migration). Both are organism-agnostic and
useful to any GEM project that wants to keep configuration / condition
presets as data rather than as code.

io/readYAML.m
    Read an arbitrary YAML document into a MATLAB struct / cell tree.
    Complements readYAMLmodel, which is specialised for the cobra model
    schema; readYAML is for free-form configuration files.
    Delegates to py.yaml.safe_load via MATLAB's Python bridge, with a
    recursive py.dict / py.list -> struct / cell converter and a
    matlabFieldName sanitiser for non-alphanumeric YAML keys.

core/applyCondition.m
    Apply a deterministic "condition" to a model. The schema is
    intentionally narrow (prelude.reset_exchanges, cofactor pseudo-
    reaction metabolite removals + charge rebalance, biomass
    stoichiometry delta, per-reaction bounds diff, expected_uptake_count
    sanity check). The function accepts either a YAML file path or a
    pre-parsed struct.
    Project-specific extensions (e.g. yeast-GEM's amino_acid_ratio step
    that rewrites a protein pseudoreaction from a side-car TSV) are
    handled by the *caller* before / after this function — the upstream
    contract is intentionally kept narrow.
Generic batch curation engine extracted from yeast-GEM's
curateMetsRxnsGenes (see yeast-GEM phase 6 of the porting plan).
Adds or updates metabolites, reactions and genes from TSV files;
the schemas match yeast-GEM's data/modelCuration/template/ layout
exactly so existing TSVs work unchanged.

Differences from the yeast-GEM original:

- Renamed to curateModelFromTables (the organism-agnostic name).
  yeast-GEM keeps a thin curateMetsRxnsGenes shim that adds the
  yeast prefixes and forwards here.
- metPrefix and rxnPrefix are now parameters with defaults 'M_' and
  'R_' (cobrapy / BiGG convention). Pass 's_' and 'r_' for yeast-GEM.
- Behaviour otherwise identical: match metabolites by name[comp],
  reactions by stoichiometry, genes by id; anything after the listed
  core columns becomes a MIRIAM annotation whose namespace key is the
  column header.
The original name shared a prefix with readYAMLmodel (which reads
cobra-format model YAMLs); typo / autocomplete made wrong-function
calls easy. parseYAML uses a different verb to avoid that. Function
docstring, error identifiers, and the single call site in
core/applyCondition.m are updated.
@edkerk edkerk force-pushed the feat/yeast-gem-shared branch from efbd30e to e71960b Compare May 30, 2026 23:19
edkerk added 3 commits May 31, 2026 01:38
Moves yeast-GEM's biomass-pseudoreaction helpers upstream so any
project with a parameterisable biomass equation can share them.
Mirrors raven_python.biomass; the yeast-GEM-side shims (sumBioMass,
scaleBioMass, rescalePseudoReaction, changeGAM) become thin wrappers
that build a biomassConfig from data/yeastgem/ids.yml.

  getBiomassFractions(model, cfg)
      Compute g/gDW per component plus the total. Supports four
      mass strategies: mw, mw_minus_2h, mw_minus_water, grams.
  scaleBiomassFraction(model, cfg, name, value, balanceOut)
      Rescale a component to a target fraction; optionally balance
      a second component so the total stays at 1 g/gDW.
  scaleBiomassPseudoreaction(model, cfg, name, factor)
      Multiply substrate coefficients by factor, rebalance H+ via
      cfg.proton_met to keep charge neutrality.
  setGAM(model, value, biomassRxn, cofactorMetNames, ngamRxn, ngamValue)
      Set the GAM coefficient for the configured cofactors, and
      optionally fix the NGAM reaction's bounds.
assignSBOterms
  Generic SBO term assignment ported from yeast-GEM's addSBOterms.
  Parameterised entirely by an opts struct (biomass metabolite
  names / suffixes, biomass / NGAM reaction names, pseudoreaction
  substrings). Writes via editMiriam(..., 'fill') so existing
  SBO annotations are preserved.

  Includes opts.onlyLastReactionForPseudo as a bug-compat flag —
  the legacy yeast-GEM loop used 'for i = numel(model.rxns)'
  instead of '1:numel(...)', tagging only the last reaction.
  Default off; the yeast-GEM shim sets it to true to keep its
  saveYeastModel output byte-equivalent.

loadDeltaGfromCSV / saveDeltaGtoCSV
  Generic load / save of metDeltaG and rxnDeltaG fields against
  any two-column CSV (id, deltaG). yeast-GEM's loadDeltaG /
  saveDeltaG become thin wrappers that supply the project paths.
Find reaction pairs that share identical stoichiometry, treating
A->B and B->A as duplicates by default. Counterpart of
raven_python.manipulation.find_duplicate_reactions and the
upstream version of yeast-GEM's findDuplicatedRxns (which becomes
a thin printing wrapper).
@edkerk edkerk changed the title Add readYAML, applyCondition, and curateModelFromTables (shared with yeast-GEM) Add shared yeast-GEM helpers: parseYAML, applyCondition, biomass, SBO, deltaG, findDuplicateRxns, curateModelFromTables May 30, 2026
Tightens the YAML reader and writer so the three toolchains (RAVEN
MATLAB, raven-python, cobrapy) produce / consume the same file
shape. Cobrapy is the canonical core; RAVEN extras (inchis, deltaG,
metFrom, eccodes, rxnFrom, references, confidence_score, protein,
metaData, ec-rxns / ec-enzymes) are additive, namespaced extensions.

Writer changes
--------------
- Field order in metabolites and reactions now matches cobrapy
  (charge before formula; notes before annotation; objective_coefficient
  right after gene_reaction_rule; subsystem before annotation).
  RAVEN-only extras follow the cobrapy block, in a stable order.

- preserveQuotes default flipped to false. Strings now go out
  unquoted, matching cobrapy's minimal-quoting style. needsYamlQuoting
  is consulted per value so anything that would otherwise be
  misparsed (leading flow indicators, embedded ': ', booleans /
  null tokens) is still quoted defensively. SMILES strings — which
  routinely contain '[' / ']' — fall into this branch automatically.

- Reaction notes are now written as 'notes' (not 'rxnNotes').
  Cobrapy and raven_python use 'notes'; the reader (below) still
  accepts the old key as an alias so existing yeast-GEM YAML files
  load without churn.

- Metabolite SMILES is emitted inside the annotation block
  (annotation.smiles) rather than as a top-level metabolite key.
  Cobrapy's convention; raven_python already does it this way.
  Reader still accepts top-level smiles for backward compatibility.

- gecko_light is written as a top-level scalar (matches
  raven_python.io.yaml) instead of buried inside metaData. The
  reader continues to accept both spellings (gecko_light and
  geckoLight).

- Empty reaction.metabolites blocks are written 'metabolites:
  !!omap []' to stay a valid YAML 1.2 document (the previous
  output emitted an empty !!omap with no value, which both ruamel
  and PyYAML reject).

- writeMetadata respects preserveQuotes; emits the block as
  '!!omap'; preserves model.date if present (no longer stomps the
  date on every write); skips the 'name: blankName' / 'id: blankID'
  shape when those fields are actually populated.

- Numbers come out via formatNumber: whole-number floats become
  '1000.0' (cobrapy convention) rather than '1000' (Python int
  repr). Avoids ScalarInt vs ScalarFloat ruamel diff noise on
  round-trip.

- Document-start marker (---) dropped to match cobrapy's bare
  '!!omap' root.

Reader changes
--------------
- Accepts cobrapy's root-level id / name / version / gecko_light
  keys. Earlier reader treated them as 'Unknown entry' once the
  top-level walker fell out of a known section. Means cobra-written
  YAML files can be ingested directly (previously failed with
  '"id" field cannot be empty').

- '- metaData: !!omap' (the !!omap-tagged variant emitted by
  cobrapy's CommentedMap dumper) is now recognized as the metaData
  section header; the bare '- metaData:' form still works.

- 'notes' is accepted as the canonical reaction-side key; 'rxnNotes'
  is kept as a backward-compatible alias.

- writeField now copes with subSystems that the reader's post-
  processing flattened from a Nx1 cell-of-cells to a char column
  (or to a shorter list than numel(rxns) when subsystem coverage
  is partial). Used to crash with an index-exceeds error on cobra-
  written models like mini.yml.

Verified
--------
- cobra.io.load_yaml_model successfully reads every file produced
  by the new writer (mini.yml, yeast-GEM.yml).
- raven_python.read_yaml_model successfully reads every file
  produced by the new writer.
- Round-tripping yeast-GEM.yml through writeYAMLmodel preserves
  2748 metabolites, 4102 reactions, 1143 genes, 2411 eccodes,
  3984 reaction deltaG entries, 2696 metabolite deltaG entries,
  1788 SMILES, 1443 reaction notes — no semantic drift, full
  byte-stability through the round-trip.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant