Skip to content

feat(io.yaml): typed model.ec via EcData; absorb legacy GECKO normalisations#12

Merged
edkerk merged 3 commits into
developfrom
feat/ec-model-yaml-io
May 30, 2026
Merged

feat(io.yaml): typed model.ec via EcData; absorb legacy GECKO normalisations#12
edkerk merged 3 commits into
developfrom
feat/ec-model-yaml-io

Conversation

@edkerk
Copy link
Copy Markdown
Member

@edkerk edkerk commented May 30, 2026

Summary

Mirrors RAVEN MATLAB's readYAMLmodel.m / writeYAMLmodel.m: when the YAML defines the ec-rxns / ec-enzymes / gecko_light top-level sections, the reader populates a typed model.ec; the writer serialises it back when present. Downstream consumers (geckopy / GECKO) then operate on the populated struct rather than re-parsing the YAML themselves.

This moves the schema knowledge for ec-models out of geckopy and into raven-python, where it belongs (the format is RAVEN's — geckopy's previous wrappers had to re-implement deserialisation, NaN/empty-field semantics, eccodes scalar/list normalisation, the smiles → annotation lift, the reverse-direction usage_prot_* flip, and bare-- document-root handling, none of which are geckopy-specific algorithms).

New: src/raven_python/io/ec_data.py

  • EcData dataclass with the MATLAB-GECKO field shape:
    • per-rxn arrays: rxns, kcat, source, notes, eccodes
    • per-enzyme arrays: genes, enzymes, mw, sequence, concs
    • sparse rxn_enz_mat coupling matrix
    • gecko_light flag
  • ec_data_from_yaml_sections(sections): parses the three YAML sections into a typed EcData. Validates that every enzyme referenced from an ec-rxns row exists in ec-enzymes (catches the common authoring bug where the two sections drifted apart).
  • ec_data_to_yaml_sections(ec): serialises an EcData back to the list-of-mappings YAML form. Empty source / notes / eccodes / sequence and NaN mw / concs are omitted to keep files compact; kcat is always written (0 == "no kcat assigned", matching MATLAB GECKO).
  • _canonicalize_eccodes / _eccodes_to_yaml handle the YAML representation for EC numbers (scalar string for one EC, list for multiple).

Extended: src/raven_python/io/yaml.py

  • model_from_yaml_data now pulls ec-rxns / ec-enzymes / gecko_light out of the foreign-keys stash, builds an EcData, and attaches it as model.ec. Other unknown top-level keys still round-trip opaquely via model.notes['_yaml_sections'].
  • write_yaml_model now serialises model.ec to the top-level sections when present, and drops any stale ec-* in _yaml_sections so the file isn't ambiguous.
  • read_yaml_model also accepts the very old RAVEN shape where the document root is a bare - sequence of single-key mappings.
  • model_from_yaml_data now also normalises two legacy ecModel quirks:
    • per-metabolite top-level smilesannotation['smiles'] (older writers placed SMILES at the metabolite top level).
    • reverse-direction usage_prot_* / prot_pool_exchange (negative lower_bound, swapped stoichiometry signs) → flipped to the forward convention (warns once per load).

Tests

  • tests/test_io_yaml_ec_data.py (new): 18 focused tests covering
    • load: model.ec population, sentinel handling for omitted optional fields, gecko_light flag, eccodes scalar-or-list, no-ec models leave model.ec unset.
    • save: sections emitted, NaN/empty omission, numpy-scalar coercion, stale _yaml_sections overridden by model.ec.
    • legacy quirks: top-level smiles lifted, reverse-direction prot flip with warn, bare-sequence root merge.
    • error paths: half-pair of ec-* sections, dangling enzyme reference.
  • tests/test_io_yaml.py (updated): the RAVEN_DOC fixture gained a complete ec-rxns/ec-enzymes pair so the existing round-trip test now verifies typed EcData survives, not just an opaque _yaml_sections stash.

Full suite: 504 passed, 0 failures.

Downstream

Once this lands, geckopy can collapse save_ec_model.py and load_ec_model.py to thin dispatch wrappers (file-extension routing, adapter-aware path resolution, empty-model guards, provenance injection) — roughly 30 LOC of code that's actually geckopy-specific, instead of the current ~620 LOC that re-implements the YAML schema.

edkerk added 3 commits May 30, 2026 19:49
…sations

Mirrors RAVEN MATLAB's readYAMLmodel.m / writeYAMLmodel.m, which
populate the model.ec struct whenever the YAML defines it. Downstream
consumers (geckopy / GECKO) operate on the populated struct rather
than re-parsing the YAML themselves.

New: src/raven_python/io/ec_data.py
- EcData dataclass with the MATLAB-GECKO field shape (per-rxn arrays:
  rxns/kcat/source/notes/eccodes; per-enzyme arrays: genes/enzymes/mw/
  sequence/concs; sparse rxn_enz_mat coupling; gecko_light flag).
- ec_data_from_yaml_sections: parses ec-rxns/ec-enzymes/gecko_light
  into a typed EcData, validating that every enzyme referenced from
  an ec-rxns row exists in ec-enzymes (catches the common authoring
  bug where the two sections drift apart).
- ec_data_to_yaml_sections: serialises an EcData back to the
  list-of-mappings YAML form. Empty source/notes/eccodes/sequence and
  NaN mw/concs are omitted to keep files compact; kcat is always
  written (0 == "no kcat assigned", matching MATLAB GECKO).
- _canonicalize_eccodes / _eccodes_to_yaml handle the scalar-or-list
  YAML representation for EC numbers.

Extended: src/raven_python/io/yaml.py
- model_from_yaml_data now pulls ec-rxns / ec-enzymes / gecko_light
  out of the foreign-keys stash, builds an EcData, and attaches it
  as model.ec. Other unknown top-level keys still round-trip
  opaquely via model.notes['_yaml_sections'].
- write_yaml_model now serialises model.ec to the top-level
  ec-rxns / ec-enzymes / gecko_light sections when present, and
  drops any stale ec-* in _yaml_sections so the file isn't ambiguous.
- read_yaml_model also accepts the very old RAVEN shape where the
  document root is a bare `-` sequence of single-key mappings; the
  reader merges them into one dict before parsing.
- model_from_yaml_data now also normalises two legacy ecModel
  quirks in line with MATLAB GECKO behaviour:
  * per-metabolite top-level `smiles` -> annotation['smiles']
    (older writers placed SMILES at the metabolite top level);
  * `usage_prot_*` / `prot_pool_exchange` reactions with negative
    lower bound and swapped stoichiometry are flipped to the
    forward convention (warns once per load).

Tests
- tests/test_io_yaml_ec_data.py (new): 18 focused tests covering
  load (model.ec population, sentinel handling for omitted optional
  fields, gecko_light flag, eccodes scalar-or-list, no-ec models),
  save (sections emitted, NaN/empty omission, numpy-scalar coercion,
  stale _yaml_sections overridden), legacy quirks (top-level smiles
  lifted, reverse-direction prot flip with warn, bare-sequence root
  merge), and error paths (half-pair of ec-* sections, dangling
  enzyme reference).
- tests/test_io_yaml.py (updated): RAVEN_DOC fixture grew a complete
  ec-rxns/ec-enzymes pair so the round-trip test now verifies typed
  EcData survives, not just an opaque _yaml_sections stash.
Two shape-management helpers that consumers (geckopy's pipeline,
test fixtures) need on top of the raw dataclass:

- validate(): raise ValueError when per-rxn array lengths, per-enzyme
  array lengths, or the rxn_enz_mat shape drift from one another.
  Cheap; callable after each mutation in a builder pipeline.
- EcData.empty(n_rxns, n_enzymes, *, gecko_light=False): preallocate
  with the canonical sentinels (empty strings for the string fields,
  0 for kcat, NaN for mw/concs, empty CSR matrix). Used by builders
  that allocate up-front and fill row by row.

Both methods are shape-level operations, not algorithm, so they live
with the dataclass rather than on a downstream consumer.

Tests: 6 new EcData tests covering empty's sentinels, validate's
three drift paths (per-rxn length, per-enzyme length, coupling-matrix
shape), the empty -> validate round-trip, and the gecko_light flag
on empty.
- UP037: drop the string-quoted forward reference on EcData.empty's
  return annotation; the module already uses `from __future__ import
  annotations`, so the bare class name is fine.
- B905: zip(coo.row, coo.col, coo.data) now passes strict=True. The
  three arrays come from the same COO matrix and are guaranteed
  equal length; strict=True turns any future drift into a loud
  TypeError instead of silent truncation.
- I001: drop the stray blank line between the import block and the
  first section comment in tests/test_io_yaml_ec_data.py.
@edkerk edkerk merged commit 41c2b72 into develop May 30, 2026
4 checks passed
@edkerk edkerk deleted the feat/ec-model-yaml-io branch May 30, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant