This guide is for the package maintainer who rebuilds raven-python's KEGG
artefacts once per KEGG release. End users never do this — they download the
published, version-pinned artefacts. The build has two implemented steps so far:
3b.1 download (reconstruction/kegg/download.py) and 3b.2 parse
(reconstruction/kegg/parse.py); see PLAN.md §2.3b for the full pipeline.
The bulk KEGG dump is licensed. You need an active subscription to
ftp.kegg.net, which gives you a username and password.
The download reads your KEGG username and password from a ~/.netrc file — it
never takes them on the command line, so they stay out of your shell history and
out of ps output. Create the file (readable only by you) and add a machine
line for the KEGG host:
touch ~/.netrc && chmod 600 ~/.netrcThen add this single line to ~/.netrc, substituting your subscription
credentials:
machine ftp.kegg.net login YOUR_KEGG_USER password YOUR_KEGG_PASSWORD
Notes:
- The host must be
ftp.kegg.net— that is the machine name the downloader looks up. Amachineline for any other host is ignored. - The file must be mode
600(owner read/write only). Python'snetrcparser refuses a.netrcthat other users can read. ~/.netrcis the same conventioncurl,wgetandgituse, so if you already have one, just add theftp.kegg.netline to it.
If you keep secrets somewhere other than $HOME, point the downloader at a
different file with netrc_path=... (see below); the format is identical.
With ~/.netrc in place, no credentials need to be passed in code:
from raven_python.reconstruction.kegg import download_kegg_dump
# Reads ~/.netrc, fetches the KEGG archives, extracts and arranges them.
download_kegg_dump("keggdb")This fetches the reaction / compound / glycan / ko archives, the eukaryote and
prokaryote proteomes, and the taxonomy file; extracts them; and arranges the
flat layout the parser expects (reaction, reaction.lst,
reaction_mapformula.lst, compound = compound + glycan, compound.inchi,
ko, genes.pep = both proteomes, taxonomy).
Credential alternatives:
# A .netrc in a non-default location:
download_kegg_dump("keggdb", netrc_path="/run/secrets/kegg_netrc")
# Pass credentials explicitly (only when they come from a secret manager at
# runtime — never hardcode literals in committed code):
download_kegg_dump("keggdb", auth=("YOUR_KEGG_USER", "YOUR_KEGG_PASSWORD"))Already-downloaded files are skipped; pass force=True to re-fetch (for a new
KEGG release).
from raven_python.reconstruction.kegg import parse_kegg_dump
parse_kegg_dump("keggdb", "artefacts")This writes the gene-free reference model (reference_model.yml.gz, gzipped
RAVEN/cobra YAML) and the relational tables as gzipped TSV. See
kegg_data_format.md for what those tables contain and the
format rationale.
Build the per-domain profile-HMM libraries that the de-novo query path (3b.5)
searches. This needs HMMER (hmmbuild, hmmpress), MAFFT, and
CD-HIT on PATH (or set RAVEN_PYTHON_HMMBUILD / RAVEN_PYTHON_MAFFT /
RAVEN_PYTHON_CDHIT, etc.); install e.g. conda install -c bioconda hmmer mafft cd-hit.
OS note: these three tools run on Linux and macOS but not native Windows — on Windows, run this step inside WSL2. See the native-OS-support matrix in maintaining_binaries.md.
from raven_python.reconstruction.kegg import build_hmm_library, read_kegg_table
organism_gene_ko = read_kegg_table("artefacts/organism_gene_ko.tsv.xz")
for domain in ("prokaryotes", "eukaryotes"):
build_hmm_library(
organism_gene_ko,
"keggdb/genes.pep", # proteomes from 3b.1
"keggdb/taxonomy", # domain split, from 3b.1
f"hmms/{domain}",
domain=domain,
)For each KO in the domain it gathers the member sequences, dereplicates with
CD-HIT (~90 % identity), aligns with MAFFT, trains a profile with hmmbuild, and
finally concatenates and hmmpress-es them into a single library.hmm for fast
hmmscan querying. This is the slowest step (hours, once per KEGG release); it
skips KOs whose .hmm already exists, so it is resumable. The resulting
libraries are published as version-pinned artefacts alongside the reference model
and tables.
scripts/build_kegg_artefacts.py runs 3b.2 (+ 3b.3 with
--hmms) and lays the output out as publishable assets (<domain>.hmm named for
ensure_kegg_hmm_library):
python scripts/build_kegg_artefacts.py --keggdb keggdb --out artefacts --hmms --threads 8Upload the contents of artefacts/ to a release, then emit the registry entry for
raven_python.data._DATA_REGISTRY with scripts/make_registry_snippet.py:
python scripts/make_registry_snippet.py data --dataset kegg --version kegg116 \
--dir artefacts --base-url https://github.com/ORG/raven-python/releases/download/kegg-data-kegg116Paste the printed block into _DATA_REGISTRY; from then on ensure_data fetches
and verifies the artefacts for end users automatically.
End users do not run the steps above — the published artefacts are fetched and
cached automatically by ensure_data (raven_python.data) under
~/.cache/raven-python/data/kegg-<version>/ on first use, so the entry points below
can be called with no local paths at all (pass an explicit artefact_dir=/
library= to use your own build instead). Two runtime entry points build a draft
model from the artefacts:
- 3b.4 — species in KEGG (
get_kegg_model_for_organism_from_artefacts): no binaries needed; uses the organism's KEGG gene↔KO annotations. Fully cross-platform.organism_id="prokaryotes"/"eukaryotes"builds a whole-domain model (passtaxonomy=). - 3b.5 — organism not in KEGG (
get_kegg_model_from_sequences):hmmscan-es a proteome FASTA against the pressedlibrary.hmm, so it needs HMMER (hmmscan) — Linux/macOS or WSL2 (see the OS matrix). Tune assignment withcutoff,min_score_ratio_ko,min_score_ratio_g.