feat(io.yaml): typed model.ec via EcData; absorb legacy GECKO normalisations by edkerk · Pull Request #12 · SysBioChalmers/raven-python

edkerk · 2026-05-30T17:50:10Z

Summary

Mirrors RAVEN MATLAB's readYAMLmodel.m / writeYAMLmodel.m: when the YAML defines the ec-rxns / ec-enzymes / gecko_light top-level sections, the reader populates a typed model.ec; the writer serialises it back when present. Downstream consumers (geckopy / GECKO) then operate on the populated struct rather than re-parsing the YAML themselves.

This moves the schema knowledge for ec-models out of geckopy and into raven-python, where it belongs (the format is RAVEN's — geckopy's previous wrappers had to re-implement deserialisation, NaN/empty-field semantics, eccodes scalar/list normalisation, the smiles → annotation lift, the reverse-direction usage_prot_* flip, and bare-- document-root handling, none of which are geckopy-specific algorithms).

New: `src/raven_python/io/ec_data.py`

EcData dataclass with the MATLAB-GECKO field shape:
- per-rxn arrays: rxns, kcat, source, notes, eccodes
- per-enzyme arrays: genes, enzymes, mw, sequence, concs
- sparse rxn_enz_mat coupling matrix
- gecko_light flag
ec_data_from_yaml_sections(sections): parses the three YAML sections into a typed EcData. Validates that every enzyme referenced from an ec-rxns row exists in ec-enzymes (catches the common authoring bug where the two sections drifted apart).
ec_data_to_yaml_sections(ec): serialises an EcData back to the list-of-mappings YAML form. Empty source / notes / eccodes / sequence and NaN mw / concs are omitted to keep files compact; kcat is always written (0 == "no kcat assigned", matching MATLAB GECKO).
_canonicalize_eccodes / _eccodes_to_yaml handle the YAML representation for EC numbers (scalar string for one EC, list for multiple).

Extended: `src/raven_python/io/yaml.py`

model_from_yaml_data now pulls ec-rxns / ec-enzymes / gecko_light out of the foreign-keys stash, builds an EcData, and attaches it as model.ec. Other unknown top-level keys still round-trip opaquely via model.notes['_yaml_sections'].
write_yaml_model now serialises model.ec to the top-level sections when present, and drops any stale ec-* in _yaml_sections so the file isn't ambiguous.
read_yaml_model also accepts the very old RAVEN shape where the document root is a bare - sequence of single-key mappings.
model_from_yaml_data now also normalises two legacy ecModel quirks:
- per-metabolite top-level smiles → annotation['smiles'] (older writers placed SMILES at the metabolite top level).
- reverse-direction usage_prot_* / prot_pool_exchange (negative lower_bound, swapped stoichiometry signs) → flipped to the forward convention (warns once per load).

Tests

tests/test_io_yaml_ec_data.py (new): 18 focused tests covering
- load: model.ec population, sentinel handling for omitted optional fields, gecko_light flag, eccodes scalar-or-list, no-ec models leave model.ec unset.
- save: sections emitted, NaN/empty omission, numpy-scalar coercion, stale _yaml_sections overridden by model.ec.
- legacy quirks: top-level smiles lifted, reverse-direction prot flip with warn, bare-sequence root merge.
- error paths: half-pair of ec-* sections, dangling enzyme reference.
tests/test_io_yaml.py (updated): the RAVEN_DOC fixture gained a complete ec-rxns/ec-enzymes pair so the existing round-trip test now verifies typed EcData survives, not just an opaque _yaml_sections stash.

Full suite: 504 passed, 0 failures.

Downstream

Once this lands, geckopy can collapse save_ec_model.py and load_ec_model.py to thin dispatch wrappers (file-extension routing, adapter-aware path resolution, empty-model guards, provenance injection) — roughly 30 LOC of code that's actually geckopy-specific, instead of the current ~620 LOC that re-implements the YAML schema.

…sations Mirrors RAVEN MATLAB's readYAMLmodel.m / writeYAMLmodel.m, which populate the model.ec struct whenever the YAML defines it. Downstream consumers (geckopy / GECKO) operate on the populated struct rather than re-parsing the YAML themselves. New: src/raven_python/io/ec_data.py - EcData dataclass with the MATLAB-GECKO field shape (per-rxn arrays: rxns/kcat/source/notes/eccodes; per-enzyme arrays: genes/enzymes/mw/ sequence/concs; sparse rxn_enz_mat coupling; gecko_light flag). - ec_data_from_yaml_sections: parses ec-rxns/ec-enzymes/gecko_light into a typed EcData, validating that every enzyme referenced from an ec-rxns row exists in ec-enzymes (catches the common authoring bug where the two sections drift apart). - ec_data_to_yaml_sections: serialises an EcData back to the list-of-mappings YAML form. Empty source/notes/eccodes/sequence and NaN mw/concs are omitted to keep files compact; kcat is always written (0 == "no kcat assigned", matching MATLAB GECKO). - _canonicalize_eccodes / _eccodes_to_yaml handle the scalar-or-list YAML representation for EC numbers. Extended: src/raven_python/io/yaml.py - model_from_yaml_data now pulls ec-rxns / ec-enzymes / gecko_light out of the foreign-keys stash, builds an EcData, and attaches it as model.ec. Other unknown top-level keys still round-trip opaquely via model.notes['_yaml_sections']. - write_yaml_model now serialises model.ec to the top-level ec-rxns / ec-enzymes / gecko_light sections when present, and drops any stale ec-* in _yaml_sections so the file isn't ambiguous. - read_yaml_model also accepts the very old RAVEN shape where the document root is a bare `-` sequence of single-key mappings; the reader merges them into one dict before parsing. - model_from_yaml_data now also normalises two legacy ecModel quirks in line with MATLAB GECKO behaviour: * per-metabolite top-level `smiles` -> annotation['smiles'] (older writers placed SMILES at the metabolite top level); * `usage_prot_*` / `prot_pool_exchange` reactions with negative lower bound and swapped stoichiometry are flipped to the forward convention (warns once per load). Tests - tests/test_io_yaml_ec_data.py (new): 18 focused tests covering load (model.ec population, sentinel handling for omitted optional fields, gecko_light flag, eccodes scalar-or-list, no-ec models), save (sections emitted, NaN/empty omission, numpy-scalar coercion, stale _yaml_sections overridden), legacy quirks (top-level smiles lifted, reverse-direction prot flip with warn, bare-sequence root merge), and error paths (half-pair of ec-* sections, dangling enzyme reference). - tests/test_io_yaml.py (updated): RAVEN_DOC fixture grew a complete ec-rxns/ec-enzymes pair so the round-trip test now verifies typed EcData survives, not just an opaque _yaml_sections stash.

Two shape-management helpers that consumers (geckopy's pipeline, test fixtures) need on top of the raw dataclass: - validate(): raise ValueError when per-rxn array lengths, per-enzyme array lengths, or the rxn_enz_mat shape drift from one another. Cheap; callable after each mutation in a builder pipeline. - EcData.empty(n_rxns, n_enzymes, *, gecko_light=False): preallocate with the canonical sentinels (empty strings for the string fields, 0 for kcat, NaN for mw/concs, empty CSR matrix). Used by builders that allocate up-front and fill row by row. Both methods are shape-level operations, not algorithm, so they live with the dataclass rather than on a downstream consumer. Tests: 6 new EcData tests covering empty's sentinels, validate's three drift paths (per-rxn length, per-enzyme length, coupling-matrix shape), the empty -> validate round-trip, and the gecko_light flag on empty.

- UP037: drop the string-quoted forward reference on EcData.empty's return annotation; the module already uses `from __future__ import annotations`, so the bare class name is fine. - B905: zip(coo.row, coo.col, coo.data) now passes strict=True. The three arrays come from the same COO matrix and are guaranteed equal length; strict=True turns any future drift into a loud TypeError instead of silent truncation. - I001: drop the stray blank line between the import block and the first section comment in tests/test_io_yaml_ec_data.py.

edkerk added 3 commits May 30, 2026 19:49

edkerk merged commit 41c2b72 into develop May 30, 2026
4 checks passed

edkerk deleted the feat/ec-model-yaml-io branch May 30, 2026 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(io.yaml): typed model.ec via EcData; absorb legacy GECKO normalisations#12

feat(io.yaml): typed model.ec via EcData; absorb legacy GECKO normalisations#12
edkerk merged 3 commits into
developfrom
feat/ec-model-yaml-io

edkerk commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edkerk commented May 30, 2026

Summary

New: src/raven_python/io/ec_data.py

Extended: src/raven_python/io/yaml.py

Tests

Downstream

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New: `src/raven_python/io/ec_data.py`

Extended: `src/raven_python/io/yaml.py`