Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions data/manifest.example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"manifest_version": 1,
"generated": "2026-05-30",
"data": {
"kegg": {
"version": "kegg116",
"description": "KEGG reference model, KO/reaction tables, and prokaryote/eukaryote HMM libraries for getKEGGModelForOrganism.",
"license": "Derived from the KEGG database; redistributed with permission from KEGG.",
"doi": "10.5281/zenodo.0000000",
"source": "https://github.com/SysBioChalmers/raven-python/releases/tag/kegg-kegg116",
"files": {
"reference_model.yml.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/reference_model.yml.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
"ko_reaction.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/ko_reaction.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
"ko_names.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/ko_names.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
"organism_gene_ko.tsv.xz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/organism_gene_ko.tsv.xz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
"rxn_flags.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/rxn_flags.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
"prokaryotes.hmm": { "url": "https://zenodo.org/records/0000000/files/prokaryotes.hmm", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 }
}
}
},
"binaries": {
"diamond": {
"version": "2.1.9",
"provides": ["diamond"],
"description": "DIAMOND protein aligner (homology-based reconstruction).",
"license": "GPL-3.0-only — ship the upstream COPYING alongside each ZIP.",
"platforms": {
"linux-x86_64": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9/diamond-2.1.9-linux-x86_64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
"macos-arm64": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9/diamond-2.1.9-macos-arm64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
"windows-x86_64": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9/diamond-2.1.9-windows-x86_64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 }
}
}
}
}
6 changes: 6 additions & 0 deletions data/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"manifest_version": 1,
"generated": "2026-05-30",
"data": {},
"binaries": {}
}
82 changes: 82 additions & 0 deletions data/manifest.schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://github.com/SysBioChalmers/raven-python/manifest.schema.json",
"title": "RAVEN data/binary manifest",
"description": "Language-agnostic registry of downloadable raven-python / RAVEN data artefacts and external binary bundles. Consumed by the Python resolvers (raven_python.data / raven_python.binaries) and by MATLAB RAVEN. Every file carries a SHA256 so consumers verify integrity after download.",
"type": "object",
"required": ["manifest_version"],
"additionalProperties": false,
"properties": {
"manifest_version": {
"type": "integer",
"const": 1,
"description": "Format version of this manifest document."
},
"generated": {
"type": "string",
"description": "ISO-8601 date the manifest was generated (informational)."
},
"data": {
"type": "object",
"description": "Data-artefact datasets, keyed by dataset id (e.g. 'kegg'). Maps onto raven_python.data._DATA_REGISTRY.",
"additionalProperties": { "$ref": "#/$defs/dataset" }
},
"binaries": {
"type": "object",
"description": "External command-line tool bundles, keyed by bundle id (e.g. 'blast', 'diamond', 'hmmer'). Maps onto raven_python.binaries._REGISTRY.",
"additionalProperties": { "$ref": "#/$defs/bundle" }
}
},
"$defs": {
"file": {
"type": "object",
"required": ["url", "sha256"],
"additionalProperties": false,
"properties": {
"url": { "type": "string", "format": "uri", "description": "Direct download URL (GitHub release asset, Zenodo file, etc.)." },
"sha256": { "type": "string", "pattern": "^[0-9a-f]{64}$", "description": "Lowercase hex SHA256 of the file." },
"bytes": { "type": "integer", "minimum": 0, "description": "File size in bytes (informational; for progress bars / sanity checks)." }
}
},
"dataset": {
"type": "object",
"required": ["version", "files"],
"additionalProperties": false,
"properties": {
"version": { "type": "string", "description": "Dataset version tag, e.g. 'kegg116'. Used in the cache path." },
"description": { "type": "string" },
"license": { "type": "string", "description": "SPDX id or free text. NOTE: KEGG-derived artefacts are subject to KEGG's terms — confirm redistribution rights before publishing." },
"doi": { "type": "string", "description": "Zenodo (or other) DOI for this dataset version, if archived." },
"source": { "type": "string", "format": "uri", "description": "Human-facing page for the release/record (GitHub release or Zenodo landing page)." },
"files": {
"type": "object",
"minProperties": 1,
"description": "Artefact files keyed by filename.",
"additionalProperties": { "$ref": "#/$defs/file" }
}
}
},
"bundle": {
"type": "object",
"required": ["version", "provides", "platforms"],
"additionalProperties": false,
"properties": {
"version": { "type": "string", "description": "Upstream tool version, e.g. '2.16.0'." },
"provides": {
"type": "array",
"items": { "type": "string" },
"minItems": 1,
"description": "Executable names this bundle provides, e.g. ['blastp', 'makeblastdb']."
},
"description": { "type": "string" },
"license": { "type": "string", "description": "Upstream tool license (e.g. DIAMOND is GPL-3.0-only — ship its license text alongside the ZIP)." },
"platforms": {
"type": "object",
"minProperties": 1,
"description": "One entry per platform, keyed '<os>-<arch>' (e.g. 'linux-x86_64', 'macos-arm64', 'windows-x86_64'). Matches raven_python.binaries._platform_key().",
"additionalProperties": { "$ref": "#/$defs/file" }
}
}
}
}
}
155 changes: 155 additions & 0 deletions docs/maintenance/data_manifest.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Data & binary manifest

Large artefacts (KEGG tables / HMMs, template models) and external-binary bundles
(BLAST / DIAMOND / HMMER) are **not** committed to the code repository. They are published
as downloadable assets and described by a single, language-agnostic **manifest** that both
raven-python and MATLAB RAVEN read. Every file carries a **SHA256**, so consumers verify
integrity after download.

- Format: [`data/manifest.schema.json`](https://github.com/SysBioChalmers/raven-python/blob/develop/data/manifest.schema.json) (JSON Schema)
- Worked example: [`data/manifest.example.json`](https://github.com/SysBioChalmers/raven-python/blob/develop/data/manifest.example.json)
- Live manifest: [`data/manifest.json`](https://github.com/SysBioChalmers/raven-python/blob/develop/data/manifest.json) (empty until assets are published)

The manifest is a superset of the two runtime registries:

| Manifest section | Runtime registry |
| --- | --- |
| `data` | {data}`raven_python.data._DATA_REGISTRY` |
| `binaries` | `raven_python.binaries._REGISTRY` |

```json
{
"manifest_version": 1,
"data": { "<dataset>": { "version": "...", "doi": "...", "files": { "<name>": {"url": "...", "sha256": "...", "bytes": 0} } } },
"binaries": { "<bundle>": { "version": "...", "provides": ["..."], "platforms": { "<os>-<arch>": {"url": "...", "sha256": "...", "bytes": 0} } } }
}
```

## Consuming it — Python

Point raven-python at a manifest and the resolvers populate themselves on first use,
verifying each download's checksum:

```bash
export RAVEN_PYTHON_MANIFEST=https://github.com/SysBioChalmers/raven-python/releases/download/manifest-v1/manifest.json
```

```python
from raven_python import manifest
manifest.load_into_registries() # or load_into_registries("/path/or/url")
# now data.ensure_kegg_data() / binaries.ensure_binary("diamond") resolve from the manifest
```

If `RAVEN_PYTHON_MANIFEST` is set, `data.ensure_*` and `binaries.ensure_binary` load it
lazily — no explicit call needed.

## Consuming it — MATLAB

The same JSON is trivial to read from MATLAB (`webread` + `jsondecode`), download
(`websave`), and verify (Java's `MessageDigest`, always available in MATLAB):

```matlab
function file = ensureDataFile(manifestUrl, dataset, name, cacheDir)
m = jsondecode(webread(manifestUrl, weboptions('ContentType','text')));
entry = m.data.(dataset).files.(matlab.lang.makeValidName(name));
file = fullfile(cacheDir, name);
if ~isfile(file)
websave(file, entry.url);
end
assert(strcmp(sha256(file), entry.sha256), 'SHA256 mismatch for %s', name);
end

function hex = sha256(file)
fid = fopen(file, 'r'); raw = fread(fid, Inf, '*uint8'); fclose(fid);
md = java.security.MessageDigest.getInstance('SHA-256');
md.update(raw);
hex = lower(reshape(dec2hex(typecast(md.digest(), 'uint8'))', 1, []));
end
```

## Publishing — generating manifest entries

After uploading a release's files, add/update an entry with the maintainer script
([`scripts/make_registry_snippet.py`](https://github.com/SysBioChalmers/raven-python/blob/develop/scripts/make_registry_snippet.py)),
which computes each SHA256 and byte size:

```bash
python scripts/make_registry_snippet.py manifest --manifest data/manifest.json \
--target data --dataset kegg --version kegg116 --dir artefacts \
--base-url https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116 \
--doi 10.5281/zenodo.0000000 --source https://zenodo.org/records/0000000

python scripts/make_registry_snippet.py manifest --manifest data/manifest.json \
--target binary --bundle diamond --version 2.1.9 --provides diamond --dir zips \
--base-url https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9 \
--license GPL-3.0-only
```

## Where to host

Release **assets are stored separately from the git tree** (GitHub keeps them in a blob
store), so attaching them to a release does **not** bloat the repository. A dedicated assets
repository is therefore **optional** — attach the assets to releases on an existing RAVEN
repo (this one, or MATLAB [RAVEN](https://github.com/SysBioChalmers/RAVEN)) and have **both
packages reuse the same release-asset URLs** via this manifest.

Use **dedicated tags** for the assets — e.g. `kegg-kegg116`, `diamond-2.1.9` — rather than
attaching them to code-milestone releases like `v0.1.0a1`. KEGG data updates roughly yearly
while the code changes often; dedicated tags keep the two cadences decoupled while still
living in one repository. The manifest's per-dataset `version` does the rest (it namespaces
the download cache).

Both GitHub Releases and Zenodo are just URLs in the manifest, so consumers don't care —
mix them per file:

- **GitHub Releases** — simplest, free, language-agnostic, up to **~2 GB per file**. The
default home for the manifest and most assets.
- **Zenodo** — adds a citable **DOI**, long-term archival, and handles files **larger than
2 GB** (up to 50 GB/record). Use it for individual large HMM libraries or anything you want
citable; point just that file's `url` at the Zenodo record.

### Auto-publishing to Zenodo from GitHub (only if you need DOIs / >2 GB files)

:::{important}
The **native GitHub↔Zenodo integration** (flip a switch, publish a Release → DOI) archives
the **repository source zipball** at the tag — it does **not** capture files attached to the
Release. So it only works for assets *committed into the repo*, which defeats the purpose for
multi-GB binaries. Use it for a *software* DOI, not for the data assets.
:::

If you do want Zenodo DOIs (or need to host files >2 GB), keep it GitHub-driven with a small
**GitHub Action** that, on release, uploads the assets to Zenodo via its REST API (e.g.
[`zenodraft`](https://github.com/zenodraft/zenodraft)). You cut a normal GitHub Release with
the files attached; the Action mirrors them to Zenodo and mints a new version DOI. Drop this
into whichever repo hosts the asset releases as `.github/workflows/zenodo.yml`:

```yaml
name: Mirror release assets to Zenodo
on:
release:
types: [published]
jobs:
zenodo:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: "20" }
- name: Download this release's assets
run: gh release download "${{ github.event.release.tag_name }}" --dir assets
env: { GH_TOKEN: "${{ github.token }}" }
- name: Deposit a new version on Zenodo
run: npx zenodraft@latest version create --publish ${{ vars.ZENODO_CONCEPT_DOI }} assets/*
env: { ZENODO_ACCESS_TOKEN: "${{ secrets.ZENODO_TOKEN }}" }
```

Then record the resulting DOI in the manifest via the `--doi` flag above. Net result: you only
ever interact with GitHub Releases; Zenodo archiving + DOIs happen automatically.

## Per-asset recommendations

| Asset | Home | Notes |
| --- | --- | --- |
| **Software binaries** (BLAST / DIAMOND / HMMER) | **bioconda** preferred; or release ZIPs via the resolver | DIAMOND is **GPL-3.0** — ship its license text in the ZIP; keep it as a separate asset, never bundled into the MIT wheel. |
| **KEGG HMMs / tables** | GitHub release (dedicated `kegg-*` tag); Zenodo for libraries >2 GB | Derived from the KEGG dump and **redistributed with permission from KEGG**. Note the provenance in the release notes / manifest `license`. |
| **Template models** (Human-GEM, yeast-GEM) | **Don't re-host** | Fetch from their canonical repos by pinned release tag — respects their licenses and avoids stale copies. |
4 changes: 4 additions & 0 deletions docs/maintenance/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,15 @@ rebuild and release them.
artefact releases.
- **[Maintaining binaries](maintaining_binaries.md)** — building and publishing the
external-binary (BLAST / DIAMOND / HMMER) ZIP releases.
- **[Data & binary manifest](data_manifest.md)** — the shared manifest that lists every
published artefact / binary (consumed by raven-python and MATLAB RAVEN), where to host
assets (GitHub Releases vs Zenodo), and the GitHub→Zenodo auto-publish setup.

```{toctree}
:hidden:

kegg_data_format
maintaining_kegg_data
maintaining_binaries
data_manifest
```
10 changes: 10 additions & 0 deletions docs/reference/api/resolvers.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,13 @@ Data-bundle resolver (KEGG artefacts and template-model data).
.. automodule:: raven_python.data
:members:
```

## `raven_python.manifest`

Loads a shared [data/binary manifest](../../maintenance/data_manifest.md) into the two
registries above (and is consulted lazily via `$RAVEN_PYTHON_MANIFEST`).

```{eval-rst}
.. automodule:: raven_python.manifest
:members:
```
Loading
Loading