feat(io.yaml): typed model.ec via EcData; absorb legacy GECKO normalisations#12
Merged
Conversation
…sations
Mirrors RAVEN MATLAB's readYAMLmodel.m / writeYAMLmodel.m, which
populate the model.ec struct whenever the YAML defines it. Downstream
consumers (geckopy / GECKO) operate on the populated struct rather
than re-parsing the YAML themselves.
New: src/raven_python/io/ec_data.py
- EcData dataclass with the MATLAB-GECKO field shape (per-rxn arrays:
rxns/kcat/source/notes/eccodes; per-enzyme arrays: genes/enzymes/mw/
sequence/concs; sparse rxn_enz_mat coupling; gecko_light flag).
- ec_data_from_yaml_sections: parses ec-rxns/ec-enzymes/gecko_light
into a typed EcData, validating that every enzyme referenced from
an ec-rxns row exists in ec-enzymes (catches the common authoring
bug where the two sections drift apart).
- ec_data_to_yaml_sections: serialises an EcData back to the
list-of-mappings YAML form. Empty source/notes/eccodes/sequence and
NaN mw/concs are omitted to keep files compact; kcat is always
written (0 == "no kcat assigned", matching MATLAB GECKO).
- _canonicalize_eccodes / _eccodes_to_yaml handle the scalar-or-list
YAML representation for EC numbers.
Extended: src/raven_python/io/yaml.py
- model_from_yaml_data now pulls ec-rxns / ec-enzymes / gecko_light
out of the foreign-keys stash, builds an EcData, and attaches it
as model.ec. Other unknown top-level keys still round-trip
opaquely via model.notes['_yaml_sections'].
- write_yaml_model now serialises model.ec to the top-level
ec-rxns / ec-enzymes / gecko_light sections when present, and
drops any stale ec-* in _yaml_sections so the file isn't ambiguous.
- read_yaml_model also accepts the very old RAVEN shape where the
document root is a bare `-` sequence of single-key mappings; the
reader merges them into one dict before parsing.
- model_from_yaml_data now also normalises two legacy ecModel
quirks in line with MATLAB GECKO behaviour:
* per-metabolite top-level `smiles` -> annotation['smiles']
(older writers placed SMILES at the metabolite top level);
* `usage_prot_*` / `prot_pool_exchange` reactions with negative
lower bound and swapped stoichiometry are flipped to the
forward convention (warns once per load).
Tests
- tests/test_io_yaml_ec_data.py (new): 18 focused tests covering
load (model.ec population, sentinel handling for omitted optional
fields, gecko_light flag, eccodes scalar-or-list, no-ec models),
save (sections emitted, NaN/empty omission, numpy-scalar coercion,
stale _yaml_sections overridden), legacy quirks (top-level smiles
lifted, reverse-direction prot flip with warn, bare-sequence root
merge), and error paths (half-pair of ec-* sections, dangling
enzyme reference).
- tests/test_io_yaml.py (updated): RAVEN_DOC fixture grew a complete
ec-rxns/ec-enzymes pair so the round-trip test now verifies typed
EcData survives, not just an opaque _yaml_sections stash.
Two shape-management helpers that consumers (geckopy's pipeline, test fixtures) need on top of the raw dataclass: - validate(): raise ValueError when per-rxn array lengths, per-enzyme array lengths, or the rxn_enz_mat shape drift from one another. Cheap; callable after each mutation in a builder pipeline. - EcData.empty(n_rxns, n_enzymes, *, gecko_light=False): preallocate with the canonical sentinels (empty strings for the string fields, 0 for kcat, NaN for mw/concs, empty CSR matrix). Used by builders that allocate up-front and fill row by row. Both methods are shape-level operations, not algorithm, so they live with the dataclass rather than on a downstream consumer. Tests: 6 new EcData tests covering empty's sentinels, validate's three drift paths (per-rxn length, per-enzyme length, coupling-matrix shape), the empty -> validate round-trip, and the gecko_light flag on empty.
- UP037: drop the string-quoted forward reference on EcData.empty's return annotation; the module already uses `from __future__ import annotations`, so the bare class name is fine. - B905: zip(coo.row, coo.col, coo.data) now passes strict=True. The three arrays come from the same COO matrix and are guaranteed equal length; strict=True turns any future drift into a loud TypeError instead of silent truncation. - I001: drop the stray blank line between the import block and the first section comment in tests/test_io_yaml_ec_data.py.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Mirrors RAVEN MATLAB's
readYAMLmodel.m/writeYAMLmodel.m: when the YAML defines theec-rxns/ec-enzymes/gecko_lighttop-level sections, the reader populates a typedmodel.ec; the writer serialises it back when present. Downstream consumers (geckopy / GECKO) then operate on the populated struct rather than re-parsing the YAML themselves.This moves the schema knowledge for ec-models out of geckopy and into raven-python, where it belongs (the format is RAVEN's — geckopy's previous wrappers had to re-implement deserialisation, NaN/empty-field semantics, eccodes scalar/list normalisation, the smiles → annotation lift, the reverse-direction
usage_prot_*flip, and bare--document-root handling, none of which are geckopy-specific algorithms).New:
src/raven_python/io/ec_data.pyEcDatadataclass with the MATLAB-GECKO field shape:rxns,kcat,source,notes,eccodesgenes,enzymes,mw,sequence,concsrxn_enz_matcoupling matrixgecko_lightflagec_data_from_yaml_sections(sections): parses the three YAML sections into a typedEcData. Validates that every enzyme referenced from anec-rxnsrow exists inec-enzymes(catches the common authoring bug where the two sections drifted apart).ec_data_to_yaml_sections(ec): serialises anEcDataback to the list-of-mappings YAML form. Emptysource/notes/eccodes/sequenceand NaNmw/concsare omitted to keep files compact;kcatis always written (0== "no kcat assigned", matching MATLAB GECKO)._canonicalize_eccodes/_eccodes_to_yamlhandle the YAML representation for EC numbers (scalar string for one EC, list for multiple).Extended:
src/raven_python/io/yaml.pymodel_from_yaml_datanow pullsec-rxns/ec-enzymes/gecko_lightout of the foreign-keys stash, builds anEcData, and attaches it asmodel.ec. Other unknown top-level keys still round-trip opaquely viamodel.notes['_yaml_sections'].write_yaml_modelnow serialisesmodel.ecto the top-level sections when present, and drops any staleec-*in_yaml_sectionsso the file isn't ambiguous.read_yaml_modelalso accepts the very old RAVEN shape where the document root is a bare-sequence of single-key mappings.model_from_yaml_datanow also normalises two legacy ecModel quirks:smiles→annotation['smiles'](older writers placed SMILES at the metabolite top level).usage_prot_*/prot_pool_exchange(negativelower_bound, swapped stoichiometry signs) → flipped to the forward convention (warns once per load).Tests
tests/test_io_yaml_ec_data.py(new): 18 focused tests coveringmodel.ecpopulation, sentinel handling for omitted optional fields,gecko_lightflag,eccodesscalar-or-list, no-ec models leavemodel.ecunset._yaml_sectionsoverridden bymodel.ec.ec-*sections, dangling enzyme reference.tests/test_io_yaml.py(updated): theRAVEN_DOCfixture gained a completeec-rxns/ec-enzymespair so the existing round-trip test now verifies typedEcDatasurvives, not just an opaque_yaml_sectionsstash.Full suite: 504 passed, 0 failures.
Downstream
Once this lands, geckopy can collapse
save_ec_model.pyandload_ec_model.pyto thin dispatch wrappers (file-extension routing, adapter-aware path resolution, empty-model guards, provenance injection) — roughly 30 LOC of code that's actually geckopy-specific, instead of the current ~620 LOC that re-implements the YAML schema.