Skip to content

feat(data): shared download manifest for artefacts + binaries#16

Merged
edkerk merged 2 commits into
developfrom
feature/data-manifest
May 30, 2026
Merged

feat(data): shared download manifest for artefacts + binaries#16
edkerk merged 2 commits into
developfrom
feature/data-manifest

Conversation

@edkerk
Copy link
Copy Markdown
Member

@edkerk edkerk commented May 30, 2026

Scaffolds a single, language-agnostic manifest for the heavy assets (KEGG tables/HMMs, external binaries) so they can live outside the code repo and be served — with SHA256 verification — to both raven-python and MATLAB RAVEN.

Format

It's a superset of the two runtime registries: manifest["data"]_DATA_REGISTRY, manifest["binaries"]binaries._REGISTRY.

Wiring

  • New raven_python.manifestload_manifest / to_*_registry / load_into_registries.
  • Lazy autoload: data.ensure_* and binaries.ensure_binary populate from $RAVEN_PYTHON_MANIFEST on first use when the registry is still empty (guarded — no effect when a registry is passed explicitly or the env var is unset).
  • make_registry_snippet.py gains a manifest subcommand (computes url + sha256 + bytes, writes/updates the JSON).

Docs (data_manifest.md)

Format, Python + MATLAB consumer sketches, GitHub Releases vs Zenodo hosting, and the answer to "auto-upload via GitHub": the native GitHub↔Zenodo integration only archives the repo source zip (not attached release assets), so a release→Zenodo GitHub Action (template included) is the right GitHub-centric path for the binaries. Plus per-asset recommendations (DIAMOND GPL-3.0 caveat; KEGG redistribution-rights caveat; don't re-host template models).

Tests

tests/test_manifest.py — converters, version guard, lazy autoload (file:// URLs), and that the repo manifests are valid. Full suite: 517 passed; docs build clean under sphinx-build -W.

No assets are published yet, so the live manifest is empty and runtime behaviour is unchanged until one is provided.

edkerk added 2 commits May 30, 2026 23:42
Introduce a single, language-agnostic manifest (data/manifest.schema.json) that lists every
downloadable data artefact and external-binary bundle with a SHA256, consumed by both
raven-python and (via the same JSON) MATLAB RAVEN. The manifest is a superset of the two
runtime registries:

* manifest["data"]     -> raven_python.data._DATA_REGISTRY
* manifest["binaries"] -> raven_python.binaries._REGISTRY

Added:
* data/manifest.schema.json (JSON Schema) + data/manifest.example.json (worked example) +
  data/manifest.json (empty, the live source of truth until assets are published).
* raven_python.manifest — load_manifest / to_*_registry / load_into_registries.
* Lazy autoload: data.ensure_* and binaries.ensure_binary populate themselves from
  $RAVEN_PYTHON_MANIFEST on first use when their registry is still empty (guarded; no effect
  when a registry is passed explicitly or the env var is unset).
* scripts/make_registry_snippet.py: a `manifest` subcommand that computes url+sha256+bytes
  and writes/updates manifest.json.
* tests/test_manifest.py (round-trip, converters, lazy autoload via file:// URLs, repo
  manifests valid).
* docs/maintenance/data_manifest.md — format, Python + MATLAB consumers, GitHub-Releases vs
  Zenodo hosting (incl. a release→Zenodo GitHub Action), and per-asset recommendations.
…n permitted

Reflect the chosen distribution model: GitHub release assets live outside the git tree, so a
separate data repository is optional — attach assets to dedicated tags (e.g. kegg-kegg116,
diamond-2.1.9) on an existing RAVEN repo and reuse the same URLs across raven-python and
MATLAB RAVEN. Use Zenodo only for DOIs or files >2 GB. KEGG artefacts are redistributed with
permission, so the prior 'confirm rights' caveat is removed. Example/schema URLs repointed
from a hypothetical raven-data repo to raven-python.
@edkerk edkerk merged commit a4bc86d into develop May 30, 2026
5 checks passed
@edkerk edkerk deleted the feature/data-manifest branch May 30, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant