Add shared yeast-GEM helpers: parseYAML, applyCondition, biomass, SBO, deltaG, findDuplicateRxns, curateModelFromTables#609
Open
edkerk wants to merge 7 commits into
Open
Add shared yeast-GEM helpers: parseYAML, applyCondition, biomass, SBO, deltaG, findDuplicateRxns, curateModelFromTables#609edkerk wants to merge 7 commits into
edkerk wants to merge 7 commits into
Conversation
Two generic helpers extracted from yeast-GEM's MATLAB port (see the
yeast-GEM/code/python/PORTING_PLAN.md and UPSTREAM_CANDIDATES.md
documents for the broader migration). Both are organism-agnostic and
useful to any GEM project that wants to keep configuration / condition
presets as data rather than as code.
io/readYAML.m
Read an arbitrary YAML document into a MATLAB struct / cell tree.
Complements readYAMLmodel, which is specialised for the cobra model
schema; readYAML is for free-form configuration files.
Delegates to py.yaml.safe_load via MATLAB's Python bridge, with a
recursive py.dict / py.list -> struct / cell converter and a
matlabFieldName sanitiser for non-alphanumeric YAML keys.
core/applyCondition.m
Apply a deterministic "condition" to a model. The schema is
intentionally narrow (prelude.reset_exchanges, cofactor pseudo-
reaction metabolite removals + charge rebalance, biomass
stoichiometry delta, per-reaction bounds diff, expected_uptake_count
sanity check). The function accepts either a YAML file path or a
pre-parsed struct.
Project-specific extensions (e.g. yeast-GEM's amino_acid_ratio step
that rewrites a protein pseudoreaction from a side-car TSV) are
handled by the *caller* before / after this function — the upstream
contract is intentionally kept narrow.
Generic batch curation engine extracted from yeast-GEM's curateMetsRxnsGenes (see yeast-GEM phase 6 of the porting plan). Adds or updates metabolites, reactions and genes from TSV files; the schemas match yeast-GEM's data/modelCuration/template/ layout exactly so existing TSVs work unchanged. Differences from the yeast-GEM original: - Renamed to curateModelFromTables (the organism-agnostic name). yeast-GEM keeps a thin curateMetsRxnsGenes shim that adds the yeast prefixes and forwards here. - metPrefix and rxnPrefix are now parameters with defaults 'M_' and 'R_' (cobrapy / BiGG convention). Pass 's_' and 'r_' for yeast-GEM. - Behaviour otherwise identical: match metabolites by name[comp], reactions by stoichiometry, genes by id; anything after the listed core columns becomes a MIRIAM annotation whose namespace key is the column header.
a1bc58a to
6807467
Compare
19b6efb to
efbd30e
Compare
The original name shared a prefix with readYAMLmodel (which reads cobra-format model YAMLs); typo / autocomplete made wrong-function calls easy. parseYAML uses a different verb to avoid that. Function docstring, error identifiers, and the single call site in core/applyCondition.m are updated.
efbd30e to
e71960b
Compare
Moves yeast-GEM's biomass-pseudoreaction helpers upstream so any
project with a parameterisable biomass equation can share them.
Mirrors raven_python.biomass; the yeast-GEM-side shims (sumBioMass,
scaleBioMass, rescalePseudoReaction, changeGAM) become thin wrappers
that build a biomassConfig from data/yeastgem/ids.yml.
getBiomassFractions(model, cfg)
Compute g/gDW per component plus the total. Supports four
mass strategies: mw, mw_minus_2h, mw_minus_water, grams.
scaleBiomassFraction(model, cfg, name, value, balanceOut)
Rescale a component to a target fraction; optionally balance
a second component so the total stays at 1 g/gDW.
scaleBiomassPseudoreaction(model, cfg, name, factor)
Multiply substrate coefficients by factor, rebalance H+ via
cfg.proton_met to keep charge neutrality.
setGAM(model, value, biomassRxn, cofactorMetNames, ngamRxn, ngamValue)
Set the GAM coefficient for the configured cofactors, and
optionally fix the NGAM reaction's bounds.
assignSBOterms Generic SBO term assignment ported from yeast-GEM's addSBOterms. Parameterised entirely by an opts struct (biomass metabolite names / suffixes, biomass / NGAM reaction names, pseudoreaction substrings). Writes via editMiriam(..., 'fill') so existing SBO annotations are preserved. Includes opts.onlyLastReactionForPseudo as a bug-compat flag — the legacy yeast-GEM loop used 'for i = numel(model.rxns)' instead of '1:numel(...)', tagging only the last reaction. Default off; the yeast-GEM shim sets it to true to keep its saveYeastModel output byte-equivalent. loadDeltaGfromCSV / saveDeltaGtoCSV Generic load / save of metDeltaG and rxnDeltaG fields against any two-column CSV (id, deltaG). yeast-GEM's loadDeltaG / saveDeltaG become thin wrappers that supply the project paths.
Find reaction pairs that share identical stoichiometry, treating A->B and B->A as duplicates by default. Counterpart of raven_python.manipulation.find_duplicate_reactions and the upstream version of yeast-GEM's findDuplicatedRxns (which becomes a thin printing wrapper).
Tightens the YAML reader and writer so the three toolchains (RAVEN MATLAB, raven-python, cobrapy) produce / consume the same file shape. Cobrapy is the canonical core; RAVEN extras (inchis, deltaG, metFrom, eccodes, rxnFrom, references, confidence_score, protein, metaData, ec-rxns / ec-enzymes) are additive, namespaced extensions. Writer changes -------------- - Field order in metabolites and reactions now matches cobrapy (charge before formula; notes before annotation; objective_coefficient right after gene_reaction_rule; subsystem before annotation). RAVEN-only extras follow the cobrapy block, in a stable order. - preserveQuotes default flipped to false. Strings now go out unquoted, matching cobrapy's minimal-quoting style. needsYamlQuoting is consulted per value so anything that would otherwise be misparsed (leading flow indicators, embedded ': ', booleans / null tokens) is still quoted defensively. SMILES strings — which routinely contain '[' / ']' — fall into this branch automatically. - Reaction notes are now written as 'notes' (not 'rxnNotes'). Cobrapy and raven_python use 'notes'; the reader (below) still accepts the old key as an alias so existing yeast-GEM YAML files load without churn. - Metabolite SMILES is emitted inside the annotation block (annotation.smiles) rather than as a top-level metabolite key. Cobrapy's convention; raven_python already does it this way. Reader still accepts top-level smiles for backward compatibility. - gecko_light is written as a top-level scalar (matches raven_python.io.yaml) instead of buried inside metaData. The reader continues to accept both spellings (gecko_light and geckoLight). - Empty reaction.metabolites blocks are written 'metabolites: !!omap []' to stay a valid YAML 1.2 document (the previous output emitted an empty !!omap with no value, which both ruamel and PyYAML reject). - writeMetadata respects preserveQuotes; emits the block as '!!omap'; preserves model.date if present (no longer stomps the date on every write); skips the 'name: blankName' / 'id: blankID' shape when those fields are actually populated. - Numbers come out via formatNumber: whole-number floats become '1000.0' (cobrapy convention) rather than '1000' (Python int repr). Avoids ScalarInt vs ScalarFloat ruamel diff noise on round-trip. - Document-start marker (---) dropped to match cobrapy's bare '!!omap' root. Reader changes -------------- - Accepts cobrapy's root-level id / name / version / gecko_light keys. Earlier reader treated them as 'Unknown entry' once the top-level walker fell out of a known section. Means cobra-written YAML files can be ingested directly (previously failed with '"id" field cannot be empty'). - '- metaData: !!omap' (the !!omap-tagged variant emitted by cobrapy's CommentedMap dumper) is now recognized as the metaData section header; the bare '- metaData:' form still works. - 'notes' is accepted as the canonical reaction-side key; 'rxnNotes' is kept as a backward-compatible alias. - writeField now copes with subSystems that the reader's post- processing flattened from a Nx1 cell-of-cells to a char column (or to a shorter list than numel(rxns) when subsystem coverage is partial). Used to crash with an index-exceeds error on cobra- written models like mini.yml. Verified -------- - cobra.io.load_yaml_model successfully reads every file produced by the new writer (mini.yml, yeast-GEM.yml). - raven_python.read_yaml_model successfully reads every file produced by the new writer. - Round-tripping yeast-GEM.yml through writeYAMLmodel preserves 2748 metabolites, 4102 reactions, 1143 genes, 2411 eccodes, 3984 reaction deltaG entries, 2696 metabolite deltaG entries, 1788 SMILES, 1443 reaction notes — no semantic drift, full byte-stability through the round-trip.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Generic helpers extracted from yeast-GEM's MATLAB code as part of its Python port. All organism-agnostic; yeast-GEM consumes them via thin shims with its own data files / id prefixes.
What lands
I/O & condition handling
io/parseYAML.m— generic YAML reader for arbitrary documents. Complements the existing model-format-onlyreadYAMLmodel. Delegates topy.yaml.safe_loadvia MATLAB's Python bridge; py-side returns are recursively converted to native MATLABstruct/cell. Used by yeast-GEM to read itsdata/conditions/*.ymlanddata/yeastgem/ids.yml.core/applyCondition.m— apply a deterministic "condition" to a model from a YAML file or pre-parsed struct. Schema:prelude(reset_exchanges),cofactor_pseudoreaction(metabolite removals + automatic charge rebalance of anH+met),biomass_stoichiometry_delta(combine with existing coefficients), per-reactionboundsdiff,expected_uptake_countsanity check. Project-specific extensions (yeast-GEM'samino_acid_ratiostep) are handled by the caller — upstream contract stays narrow.core/curateModelFromTables.m— generic batch-curation engine extracted from yeast-GEM'scurateMetsRxnsGenes. Adds or updates metabolites / reactions / genes from TSV inputs. New parametersmetPrefixandrxnPrefix(default'M_'/'R_') let any GEM project use the engine with its own id conventions; yeast-GEM keeps a thincurateMetsRxnsGenesshim that pins's_'/'r_'and forwards here. Schema otherwise identical to the legacy yeast-GEM version, so existing yeast-GEM TSVs work unchanged.Biomass helpers
core/getBiomassFractions.m— compute mass fraction (g/gDW) per biomass component plus the total, parameterised by abiomassConfigstruct. Supports fourmass_strategymodes:mw,mw_minus_2h(charged tRNAs),mw_minus_water(RNA / DNA polymerisation),grams(stoichiometry already in g/gDW). yeast-GEM'ssumBioMassbecomes a thin shim.core/scaleBiomassFraction.m— rescale a component to a target g/gDW, optionally adjusting a second component so the total stays at 1 g/gDW. Replaces the body of yeast-GEM'sscaleBioMass.core/scaleBiomassPseudoreaction.m— multiply the substrate coefficients of one component pseudoreaction by a factor and rebalance H+ viabiomassConfig.proton_metto preserve charge neutrality. Replaces the body of yeast-GEM'srescalePseudoReaction.core/setGAM.m— set the GAM coefficient on the biomass pseudoreaction for a configurable list of cofactor metabolite names; optionally fix an NGAM reaction's bounds. Replaces the body of yeast-GEM'schangeGAM.Annotation helpers
core/assignSBOterms.m— generic SBO term assignment ported from yeast-GEM'saddSBOterms. Parameterised entirely by anoptsstruct (biomass met names / suffixes, biomass / NGAM reaction names, pseudoreaction substrings). Writes viaeditMiriam(..., 'fill')so pre-existing SBO annotations are preserved. Includesopts.onlyLastReactionForPseudoas a bug-compat flag — the legacy yeast-GEM loop usedfor i = numel(model.rxns)instead of1:numel(...), tagging only the last reaction; default off, yeast-GEM's shim turns it on so itssaveYeastModeloutput stays byte-stable.io/loadDeltaGfromCSV.m/io/saveDeltaGtoCSV.m— generic load / save ofmetDeltaGandrxnDeltaGfields against two-column CSVs (id, deltaG). yeast-GEM'sloadDeltaG/saveDeltaGbecome thin wrappers that supply the project paths.YAML I/O alignment
io/writeYAMLmodel.m+io/readYAMLmodel.m— three-way alignment between RAVEN MATLAB, raven-python, and cobrapy YAML output. Cobrapy is treated as canonical for the core (id / name / metabolites / reactions / genes / compartments / annotation / bounds); RAVEN-only fields (inchis, deltaG, metFrom, eccodes, rxnFrom, references, confidence_score, protein, metaData, ec-rxns / ec-enzymes) are additive extensions kept in their familiar positions. Writer changes:preserveQuotesdefault flipped tofalseto match cobrapy's minimal-quoting style; values that need quoting (SMILES strings with[O-], booleans, leading flow indicators, embedded:) are still quoted defensively via a newneedsYamlQuotinghelper.notes(cobrapy-canonical), notrxnNotes. Reader accepts both for backward compat.smilesmoves intoannotation.smiles(cobrapy convention). Reader still accepts top-levelsmiles.gecko_lightis emitted at the top level (matches raven-python) instead of inside metaData.1000.0not1000(matches cobrapy / Python float repr).metabolites:blocks emit!!omap []so the file is a valid YAML 1.2 document.---) dropped to match cobrapy's bare!!omaproot.Reader changes: accepts cobrapy's root-level
id/name/version/gecko_light; accepts the!!omap-taggedmetaDataheader; accepts bothnotesandrxnNotesreaction keys; defensive against subSystems shapes produced by the reader's own post-processing flatten.Verified: cobra.io.load_yaml_model and raven_python.read_yaml_model both successfully read every file the new writer produces (cobra mini.yml and the 2748-met yeast-GEM.yml). Round-tripping yeast-GEM through writeYAMLmodel preserves 2748/2748 metabolites, 4102/4102 reactions, 1143/1143 genes, 2411 eccodes, 3984 reaction deltaG, 2696 metabolite deltaG, 1788 SMILES, 1443 reaction notes — no semantic drift.
Misc
core/findDuplicateRxns.m— return reaction pairs sharing identical stoichiometry, treating A->B and B->A as duplicates by default. Counterpart ofraven_python.manipulation.find_duplicate_reactionsand the upstream version of yeast-GEM'sfindDuplicatedRxns(which becomes a thin printing wrapper).Lock-step verification
A MATLAB equivalence harness loaded yeast-GEM and ran every legacy implementation (extracted from
git show HEAD:...) against its refactored shim on the same model state, asserting bitwise match on the S matrix, bounds, and miriam annotations (or near-zero float diff fordeltaG/sumBioMassscalars). 10 / 10 checks pass —sumBioMass,rescalePseudoReaction(all 9 components incl. lipid aggregation),scaleBioMass(withbalance_out),changeGAM(with and without NGAM),addSBOterms(mets and rxns),loadDeltaG,saveDeltaGtoCSVround-trip, andfindDuplicateRxns. Details in each commit message and in yeast-GEM's PORTING_PLAN.md.Companion PRs