feat(data): shared download manifest for artefacts + binaries#16
Merged
Conversation
Introduce a single, language-agnostic manifest (data/manifest.schema.json) that lists every downloadable data artefact and external-binary bundle with a SHA256, consumed by both raven-python and (via the same JSON) MATLAB RAVEN. The manifest is a superset of the two runtime registries: * manifest["data"] -> raven_python.data._DATA_REGISTRY * manifest["binaries"] -> raven_python.binaries._REGISTRY Added: * data/manifest.schema.json (JSON Schema) + data/manifest.example.json (worked example) + data/manifest.json (empty, the live source of truth until assets are published). * raven_python.manifest — load_manifest / to_*_registry / load_into_registries. * Lazy autoload: data.ensure_* and binaries.ensure_binary populate themselves from $RAVEN_PYTHON_MANIFEST on first use when their registry is still empty (guarded; no effect when a registry is passed explicitly or the env var is unset). * scripts/make_registry_snippet.py: a `manifest` subcommand that computes url+sha256+bytes and writes/updates manifest.json. * tests/test_manifest.py (round-trip, converters, lazy autoload via file:// URLs, repo manifests valid). * docs/maintenance/data_manifest.md — format, Python + MATLAB consumers, GitHub-Releases vs Zenodo hosting (incl. a release→Zenodo GitHub Action), and per-asset recommendations.
…n permitted Reflect the chosen distribution model: GitHub release assets live outside the git tree, so a separate data repository is optional — attach assets to dedicated tags (e.g. kegg-kegg116, diamond-2.1.9) on an existing RAVEN repo and reuse the same URLs across raven-python and MATLAB RAVEN. Use Zenodo only for DOIs or files >2 GB. KEGG artefacts are redistributed with permission, so the prior 'confirm rights' caveat is removed. Example/schema URLs repointed from a hypothetical raven-data repo to raven-python.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scaffolds a single, language-agnostic manifest for the heavy assets (KEGG tables/HMMs, external binaries) so they can live outside the code repo and be served — with SHA256 verification — to both raven-python and MATLAB RAVEN.
Format
data/manifest.schema.json— JSON Schemadata/manifest.example.json— worked exampledata/manifest.json— live (empty until assets are published)It's a superset of the two runtime registries:
manifest["data"]→_DATA_REGISTRY,manifest["binaries"]→binaries._REGISTRY.Wiring
raven_python.manifest—load_manifest/to_*_registry/load_into_registries.data.ensure_*andbinaries.ensure_binarypopulate from$RAVEN_PYTHON_MANIFESTon first use when the registry is still empty (guarded — no effect when a registry is passed explicitly or the env var is unset).make_registry_snippet.pygains amanifestsubcommand (computes url + sha256 + bytes, writes/updates the JSON).Docs (data_manifest.md)
Format, Python + MATLAB consumer sketches, GitHub Releases vs Zenodo hosting, and the answer to "auto-upload via GitHub": the native GitHub↔Zenodo integration only archives the repo source zip (not attached release assets), so a release→Zenodo GitHub Action (template included) is the right GitHub-centric path for the binaries. Plus per-asset recommendations (DIAMOND GPL-3.0 caveat; KEGG redistribution-rights caveat; don't re-host template models).
Tests
tests/test_manifest.py— converters, version guard, lazy autoload (file:// URLs), and that the repo manifests are valid. Full suite: 517 passed; docs build clean undersphinx-build -W.No assets are published yet, so the live manifest is empty and runtime behaviour is unchanged until one is provided.