feat(rpbp): add 8 rpbp modules + fasta_gtf_bam_rpbp subworkflow#11695
Open
pinin4fjords wants to merge 22 commits into
Open
feat(rpbp): add 8 rpbp modules + fasta_gtf_bam_rpbp subworkflow#11695pinin4fjords wants to merge 22 commits into
pinin4fjords wants to merge 22 commits into
Conversation
…_rpbp subworkflows Adds the Rp-Bp Ribo-seq ORF caller (Malone et al. 2017, doi:10.1093/nar/gkw1141) as 8 split modules plus 2 orchestration subworkflows ported from nf-core/riboseq#174. Splitting per-tool (rather than wrapping rpbp's `predict-translated-orfs` umbrella command) gives independent caching on resume and lets the pipeline's own STAR alignment run instead of being re-done inside rpbp. Modules: - rpbp/buildconfig: render the Rp-Bp YAML config from pipeline-supplied fasta+gtf. - rpbp/preparegenome: build the Rp-Bp genome index (STAR index, ribosomal index, ORF BEDs). - rpbp/extractmetageneprofiles: per-read-length metagene profiles around starts. - rpbp/estimatemetagenebayesfactors: periodicity Bayes factors per read length. - rpbp/selectperiodicoffsets: pick a single P-site offset per high-quality length. - rpbp/extractorfprofiles: per-ORF P-site profile matrix using selected offsets. - rpbp/estimateorfbayesfactors: Bayesian translated-vs-untranslated model. - rpbp/selectfinalpredictionset: filter to the final predicted-ORF BED + FASTAs. Subworkflows: - bam_rpbp_predictorfs: per-sample 6-step chain on a Ribo-seq BAM with the cohort-shared annotation outputs supplied by the caller. - fasta_gtf_bam_rpbp: top-level end-to-end run; renders config, prepares the index once, then runs the per-sample chain. Containers: Wave-built `community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb` (co-installs `bioconda::rpbp=4.0.1` + `bioconda::star=2.7.11b`) for all rpbp tools; `quay.io/biocontainers/coreutils:9.5` for rpbp/buildconfig (config templating only). Versions are emitted via the topic-channel pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…across the board Bypass `prepare-rpbp-genome` umbrella; call rpbp's internal `get_orfs` function directly via a small inline Python wrapper. Skips the umbrella script's bowtie2-build (rRNA index) and STAR genome-generate steps entirely - none of the downstream Rp-Bp tools we wrap consume those indices, and upstream alignment is supplied as the BAM. rpbp/buildconfig removed - the config dict is now built in-memory inside rpbp/preparegenome from the input fasta + gtf paths. fasta_gtf_bam_rpbp loses the corresponding setup step. extractorfprofiles replicates rpbp's metagene-length filter (`ribo_utils.utils.get_periodic_lengths_and_offsets`) inline before calling extract-orf-profiles, so the upstream select-periodic-offsets output can drive --lengths/--offsets without going through rpbp's filename-driven config plumbing. Filter thresholds are exposed via ext.args2 (4 space-separated tokens, defaults match rpbp.defaults.metagene_options). All 7 inner-module tests now run the upstream chain on real chr20 data (no stub-only tests). Subworkflow tests cover bam_rpbp_predictorfs and fasta_gtf_bam_rpbp end-to-end. The 3 inner modules that need the lower threshold (extractorfprofiles / estimateorfbayesfactors / selectfinalpredictionset) carry tests/nextflow.config setting ext.args2 = '10 1 None 0.0' so the chr20 BAM produces non-empty profiles. Wave container unchanged (community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-sample 6-step chain was only ever wrapped by fasta_gtf_bam_rpbp; keeping it as a standalone subworkflow added a thin layer no realistic caller needs (rpbp's BED outputs only come from preparegenome). Inlined into fasta_gtf_bam_rpbp — now one subworkflow with 7 process invocations end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m/nf-core/modules into rpbp-add-modules-and-subworkflows
…nnel: column Cleans up the file: - Drops the multi-line header narrative recapping design choices and step ordering. - Drops the numbered "1. ... 2. ... 3. ..." inline narrative blocks. - Aligns the `// channel:` annotations in the emit block (12 of 13 lines were one column off the longest emit's reference position). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- M1/M2: meta.yml input pattern docs for `orfs_genomic_bed` and
`orfs_exons_bed` now match preparegenome's actual emit suffix
(`*.orfs-{genomic,exons}.annotated.bed.gz`). Doc-only drift before.
- M3: drop the hardcoded `--select-longest-by-stop --select-best-overlapping`
from `rpbp/selectfinalpredictionset/main.nf`; consistent with every other
rpbp module's `--num-cpus + \${args}` shape. Defaults moved to the
module's `tests/nextflow.config` and the subworkflow test config via
`ext.args`, so consuming pipelines configure them in `modules.config`.
- L1: drop `process_long` from `rpbp/estimateorfbayesfactors` so the
`process_high + process_long` lint clash goes away; pipelines that need
the extended walltime can set it via `modules.config` (`time = ...`).
- L2: narrow the `ext.args2` docstring in `rpbp/extractorfprofiles` to
match the actual filter behaviour - only the 3rd (max_bf_var) and 4th
(min_bf_likelihood) tokens accept "None"; the 1st and 2nd must be numeric.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent.yml The previous `rpbp_star:247a8ae84a6babfb` container co-installed `star=2.7.11b` for users of `prepare-rpbp-genome` / `run-rpbp-pipeline` (which call STAR internally). The modules in this PR bypass those umbrellas entirely (BAMs are supplied from external alignment), so the STAR install was dead weight. New container is built directly from each module's environment.yml (`bioconda::rpbp=4.0.1`), so the conda and docker/singularity codepaths now resolve to the same dependency set: - docker: community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b - singularity: https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…irectly Drops the hand-ported filter logic in favour of calling rpbp.ribo_utils.utils.get_periodic_lengths_and_offsets directly. The prior reimplementation had three divergences from upstream flagged in review (all only material under non-default thresholds): 1. The hard `bf_mean` filter was applied unconditionally; upstream only applies it when `max_bf_var` is set or when both `var`/`likelihood` are None. 2. Boundary operators used >=/<= where upstream uses strict >/<. 3. Only slots 3 and 4 (`max_bf_var`, `min_bf_lik`) honoured "None"; the first two would have crashed with TypeError if "None" was passed. The module now stages the input `periodic_offsets` CSV at the path rpbp.ribo_utils.filenames constructs, then calls the upstream filter function with a minimal config. Same `ext.args2` interface; any of the four tokens can be "None". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `dict(...)` keyword constructor instead of `{ "key": val, ... }`
- `pd.DataFrame.to_csv` instead of a manual TSV write loop
- Drop the two intermediate `lengths.txt`/`offsets.txt` files; extract the
bash-shell values directly from the just-written TSV with `tail | cut | tr`.
Heredoc body goes from 17 lines to 11; same behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Draft
2 tasks
Two interleaved changes: 1. Split the periodic-length filter (`get_periodic_lengths_and_offsets`) out of `rpbp/extractorfprofiles` into its own `rpbp/getperiodiclengthsoffsets` module. `extractorfprofiles` now takes a `lengths_offsets` TSV input rather than computing the filter inline. The threshold args move from `ext.args2` on `extractorfprofiles` to `ext.args` on `getperiodiclengthsoffsets`. `fasta_gtf_bam_rpbp` chains the new module between `selectperiodicoffsets` and `extractorfprofiles`. 2. Module-level tests now fetch their single immediate-upstream input from `nf-core/test-datasets:modules` (under `data/genomics/homo_sapiens/riboseq_expression/rpbp/`, added in nf-core/test-datasets#2064) instead of chaining six upstream stages in setup. Each module test runs in well under a minute. The subworkflow integration test still runs the full chain end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fallback Both Bayes-factor modules now accept Stan model lists as proper `tuple val(metaN), path(...)` inputs. The model lookup that previously ran unconditionally inside the shell block is now only triggered when the caller passes empty lists - all branching resolved in Groovy before the `"""`. - `rpbp/estimatemetagenebayesfactors` gains `tuple val(meta2), path(periodic_models, stageAs: ...)` and `tuple val(meta3), path(nonperiodic_models, stageAs: ...)`. - `rpbp/estimateorfbayesfactors` gains `tuple val(meta2), path(translated_models, ...)` and `tuple val(meta3), path(untranslated_models, ...)`. - Pass `[[], []]` to fall back to rpbp's bundled `.stan` files inside the container; pass populated channels to override. - `fasta_gtf_bam_rpbp` subworkflow wires empty placeholders by default. - Pattern matches `ensemblvep/vep` (cache + fasta optional inputs with in-container default fallback). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two structural changes coupled into one commit: 1. All meta-less `path foo` inputs become `tuple val(metaN), path(foo)`, per nf-core convention that every file input carries a meta. Affects `transcript_bed` (extractmetageneprofiles), `orfs_genomic_bed` + `exons_bed` (extractorfprofiles), `orfs_genomic_bed` (estimateorfbayesfactors), `genome_fasta` (selectfinalpredictionset). Subworkflow updated to attach `[ id: 'reference' ]` metas. 2. The two Python heredocs (`rpbp/preparegenome` and `rpbp/getperiodiclengthsoffsets`) become `templates/*.py` files. The surrounding bash setup (mkdir + chrName.txt generation for preparegenome; mkdir + cp for getperiodiclengthsoffsets) folded into the Python via `os.makedirs` + `shutil.copy`. Script blocks end in `template '<name>.py'` with no remaining bash. meta.yml updates document the new tuple inputs. Tests pass meta+file tuples instead of bare paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the `need_bundled` / `bundled_setup` Groovy plumbing with a cleaner layer-separated pattern: - Groovy stringifies user-supplied paths (or emits empty) - that's it. - Bash handles all flow control: assigns from the Groovy-interpolated string, then runs the rpbp-bundled-models lookup only if either variable is empty. No Groovy string contains a `$bash_var` literal awaiting expansion - which was a source-of-truth muddle in the previous version. Same behaviour, fewer moving parts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Move `pandas` / `rpbp` imports to the top of
`get_periodic_lengths_and_offsets.py` (E402: imports were after a
helper function definition).
- Add `# noqa: F821` on the `${task.process}` interpolation lines in
both templates. `task` is bound by Nextflow at template-resolution
time; ruff can't see it.
The `${task.process}` is Nextflow template interpolation, not Python.
Ruff treats the f-string as Python and complains about undefined `task`.
Removing the `f` prefix on the lines that only have Nextflow interp
(while keeping it on lines that actually interpolate Python expressions
like `{platform.python_version()}`) makes ruff see a plain string and
the F821 falls away naturally. Matches the existing precedent in
`variantextractor`, `cellranger/multi`, `sigprofiler` templates.
Replaces the three-line `f.write` block with a single `yaml.safe_dump` call. pyyaml is already a transitive dep of rpbp (used by prepare_rpbp_genome for config loading). Matches the nf-core/anndata/getsize and custom/orfcountmatrix precedents.
- Reorder imports (alphabetic third-party group; ruff considers `rpbp` and `yaml` peers of `pandas`). - Remove `=`-alignment spaces. - Multi-line `to_csv` call with one arg per line. No semantic change.
… rpbp Rewrite top-level descriptions on all 8 rpbp module meta.yml files plus the fasta_gtf_bam_rpbp subworkflow so a reader who knows generic NGS / Ribo-seq vocabulary but not Rp-Bp internals can understand what each step does: metagene profile construction, per-(length, offset) Bayesian periodicity scoring, P-site offset selection and filtering, per-ORF P-site count vectors, per-ORF translated-vs-untranslated Bayes factor scoring, and the final filtered prediction set with BED/DNA/protein outputs. Citation audit: corrected the Rp-Bp DOI from 10.1093/nar/gkw1141 (unrelated E. coli 5'-UTR paper) to 10.1093/nar/gkw1350 (Malone et al. 2017, Nucleic Acids Research, "Bayesian prediction of RNA translation from ribosome profiling") across all 8 module meta.yml files. Homepage, documentation and tool_dev_url verified to resolve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The optional Stan-model inputs on estimatemetagenebayesfactors and
estimateorfbayesfactors were non-functional for user-supplied files:
rpbp resolves models via CmdStanModel(exe_file=Path(pm).with_suffix(""))
and requires the pre-compiled binary to live next to the .stan file, so
staged user-supplied .stan files cannot be picked up. The umbrella
subworkflow always passed empties anyway. Both modules now always
resolve the bundled rpbp Python-package models; the remaining input
metas are renumbered to be sequential.
Also adds an empty-lengths guard to getperiodiclengthsoffsets so the
template fails loudly with a clear message when no read lengths pass
the periodicity filters, rather than silently emitting an empty TSV
that propagates to all-zero ORF output. Adds a note to preparegenome's
meta.yml flagging that it emits the *.annotated.bed.gz filenames
produced by get_orfs rather than the *.bed.gz-renamed forms produced
by the upstream prepare-rpbp-genome umbrella script; downstream
consumers in this module set reference those names explicitly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local snapshots were captured under Nextflow 26.04.1, which serialises `path "versions.yml" ... topic: versions` outputs into the same flattened tuple form the eval-style emits produce. CI runs Nextflow 25.10.2, which keeps them as `versions.yml:md5,...` (the conventional path-emit form). Regenerate the two affected snapshots so they match the CI output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pinin4fjords
added a commit
to pinin4fjords/riboseq
that referenced
this pull request
May 21, 2026
Swap modules/local/gedi/{indexgenome,price} for the nf-core/modules
versions (git_sha 51af5cb84874647aa4733742b86142025705b042). The
upstream modules differ only by convention: configurable ext.prefix
on the indexgenome output directory and an additive versions_gedi
emit alongside the versions topic. Setting ext.prefix='price_index'
on GEDI_INDEXGENOME preserves the existing published path and the
${index}/reference.oml contract consumed by GEDI_PRICE, so behaviour
is unchanged.
ribocode/* and ribotish/* were already moved to modules/nf-core/
via earlier `nf-core modules install` work, so this commit only
handles gedi.
Follow-up housekeeping once these PRs merge:
- rpbp/* via nf-core/modules#11695
- concat_gtf and filter_gtf_class_code via nf-core/modules#11729
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds 8
rpbp/*modules and thefasta_gtf_bam_rpbpsubworkflow that wrap the Rp-Bp ribosome-profiling ORF predictor (Malone et al. 2017, doi:10.1093/nar/gkw1350).What Rp-Bp does
Ribosome profiling (Ribo-seq) sequences the short mRNA fragments protected by translating ribosomes. Rp-Bp uses those reads to score, for every candidate open reading frame (ORF) in the annotation, whether it shows the 3-nucleotide periodicity that marks active translation. The output is a per-sample BED of predicted translated ORFs plus matching DNA / protein FASTAs.
Why not call the umbrella CLI
Rp-Bp ships two top-level scripts (
prepare-rpbp-genome,predict-translated-orfs) that chain the internal steps together. Both also do work the modules don't need:prepare-rpbp-genomebuilds a bowtie2 rRNA-filter index and a STAR alignment index alongside the BED prep. These modules consume pre-aligned BAMs supplied by the caller, so those indexes are unnecessary and would forceSTARandbowtie2into the module container.predict-translated-orfsinternally re-runs flexbar + bowtie + STAR on raw FASTQs, duplicating any upstream alignment the caller already produced, before getting to the actual ORF prediction logic.Splitting the underlying steps into 8 single-purpose modules:
-resumecaching.Modules
Reference prep (run once per genome):
rpbp/preparegenome- enumerates candidate ORFs from the GTF and writes the BEDs the per-sample steps consume.Per-sample chain:
rpbp/extractmetageneprofiles- per-read-length read coverage around annotated start codons (a "metagene profile").rpbp/estimatemetagenebayesfactors- Bayes-factor score per read length for "this length shows 3-nt periodicity vs not".rpbp/selectperiodicoffsets- chooses one P-site offset per read length (the canonical ribosome-position adjustment).rpbp/getperiodiclengthsoffsets- threshold-filters down to the read lengths that will drive ORF scoring. Thresholds are configurable viaext.args(4 space-separated tokens: min count, min BF mean, max BF var, min BF likelihood; defaults mirrorrpbp.defaults.metagene_options).rpbp/extractorfprofiles- builds per-ORF P-site count vectors using the surviving read lengths and offsets.rpbp/estimateorfbayesfactors- Bayes-factor score per ORF for "translated vs untranslated", via Stan models bundled in the Rp-Bp Python package.rpbp/selectfinalpredictionset- applies score / length / overlap rules and emits the final predicted-ORF BED + DNA / protein FASTAs.Subworkflow
fasta_gtf_bam_rpbpwires the modules together:rpbp/preparegenomeonce per genome, the 7-step per-sample chain for each input BAM. Designed to be consumed by any caller that has pre-aligned Ribo-seq BAMs plus the matching genome FASTA / annotation GTF.Container
All
rpbp/*modules share a Wave-built container built directly from each module'senvironment.yml(bioconda::rpbp=4.0.1):community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13bhttps://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/dataVersions emitted via
topic: versions.Test plan
Real + stub tests pass for each module and the subworkflow under
nf-core modules test --profile docker/nf-core subworkflows test --profile docker.Module-level tests consume per-stage static fixtures from
nf-core/test-datasets:modulesunderdata/genomics/homo_sapiens/riboseq_expression/rpbp/(added in nf-core/test-datasets#2064) - one immediate-upstream artefact per module rather than chaining six upstream stages in setup. Each module test runs in well under a minute (the per-ORF Bayes-factor fit is the only multi-minute step). The full chain still runs end-to-end in the subworkflow test.