feat(rpbp): add 8 rpbp modules + fasta_gtf_bam_rpbp subworkflow by pinin4fjords · Pull Request #11695 · nf-core/modules

pinin4fjords · 2026-05-19T12:29:15Z

Adds 8 rpbp/* modules and the fasta_gtf_bam_rpbp subworkflow that wrap the Rp-Bp ribosome-profiling ORF predictor (Malone et al. 2017, doi:10.1093/nar/gkw1350).

What Rp-Bp does

Ribosome profiling (Ribo-seq) sequences the short mRNA fragments protected by translating ribosomes. Rp-Bp uses those reads to score, for every candidate open reading frame (ORF) in the annotation, whether it shows the 3-nucleotide periodicity that marks active translation. The output is a per-sample BED of predicted translated ORFs plus matching DNA / protein FASTAs.

Why not call the umbrella CLI

Rp-Bp ships two top-level scripts (prepare-rpbp-genome, predict-translated-orfs) that chain the internal steps together. Both also do work the modules don't need:

prepare-rpbp-genome builds a bowtie2 rRNA-filter index and a STAR alignment index alongside the BED prep. These modules consume pre-aligned BAMs supplied by the caller, so those indexes are unnecessary and would force STAR and bowtie2 into the module container.
predict-translated-orfs internally re-runs flexbar + bowtie + STAR on raw FASTQs, duplicating any upstream alignment the caller already produced, before getting to the actual ORF prediction logic.

Splitting the underlying steps into 8 single-purpose modules:

skips the alignment/index work,
lets the modules ship a smaller Rp-Bp-only container,
gives each step independent -resume caching.

Modules

Reference prep (run once per genome):

rpbp/preparegenome - enumerates candidate ORFs from the GTF and writes the BEDs the per-sample steps consume.

Per-sample chain:

rpbp/extractmetageneprofiles - per-read-length read coverage around annotated start codons (a "metagene profile").
rpbp/estimatemetagenebayesfactors - Bayes-factor score per read length for "this length shows 3-nt periodicity vs not".
rpbp/selectperiodicoffsets - chooses one P-site offset per read length (the canonical ribosome-position adjustment).
rpbp/getperiodiclengthsoffsets - threshold-filters down to the read lengths that will drive ORF scoring. Thresholds are configurable via ext.args (4 space-separated tokens: min count, min BF mean, max BF var, min BF likelihood; defaults mirror rpbp.defaults.metagene_options).
rpbp/extractorfprofiles - builds per-ORF P-site count vectors using the surviving read lengths and offsets.
rpbp/estimateorfbayesfactors - Bayes-factor score per ORF for "translated vs untranslated", via Stan models bundled in the Rp-Bp Python package.
rpbp/selectfinalpredictionset - applies score / length / overlap rules and emits the final predicted-ORF BED + DNA / protein FASTAs.

Subworkflow

fasta_gtf_bam_rpbp wires the modules together: rpbp/preparegenome once per genome, the 7-step per-sample chain for each input BAM. Designed to be consumed by any caller that has pre-aligned Ribo-seq BAMs plus the matching genome FASTA / annotation GTF.

Container

All rpbp/* modules share a Wave-built container built directly from each module's environment.yml (bioconda::rpbp=4.0.1):

docker: community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b
singularity: https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data

Versions emitted via topic: versions.

Test plan

Real + stub tests pass for each module and the subworkflow under nf-core modules test --profile docker / nf-core subworkflows test --profile docker.

Module-level tests consume per-stage static fixtures from nf-core/test-datasets:modules under data/genomics/homo_sapiens/riboseq_expression/rpbp/ (added in nf-core/test-datasets#2064) - one immediate-upstream artefact per module rather than chaining six upstream stages in setup. Each module test runs in well under a minute (the per-ORF Bayes-factor fit is the only multi-minute step). The full chain still runs end-to-end in the subworkflow test.

…_rpbp subworkflows Adds the Rp-Bp Ribo-seq ORF caller (Malone et al. 2017, doi:10.1093/nar/gkw1141) as 8 split modules plus 2 orchestration subworkflows ported from nf-core/riboseq#174. Splitting per-tool (rather than wrapping rpbp's `predict-translated-orfs` umbrella command) gives independent caching on resume and lets the pipeline's own STAR alignment run instead of being re-done inside rpbp. Modules: - rpbp/buildconfig: render the Rp-Bp YAML config from pipeline-supplied fasta+gtf. - rpbp/preparegenome: build the Rp-Bp genome index (STAR index, ribosomal index, ORF BEDs). - rpbp/extractmetageneprofiles: per-read-length metagene profiles around starts. - rpbp/estimatemetagenebayesfactors: periodicity Bayes factors per read length. - rpbp/selectperiodicoffsets: pick a single P-site offset per high-quality length. - rpbp/extractorfprofiles: per-ORF P-site profile matrix using selected offsets. - rpbp/estimateorfbayesfactors: Bayesian translated-vs-untranslated model. - rpbp/selectfinalpredictionset: filter to the final predicted-ORF BED + FASTAs. Subworkflows: - bam_rpbp_predictorfs: per-sample 6-step chain on a Ribo-seq BAM with the cohort-shared annotation outputs supplied by the caller. - fasta_gtf_bam_rpbp: top-level end-to-end run; renders config, prepares the index once, then runs the per-sample chain. Containers: Wave-built `community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb` (co-installs `bioconda::rpbp=4.0.1` + `bioconda::star=2.7.11b`) for all rpbp tools; `quay.io/biocontainers/coreutils:9.5` for rpbp/buildconfig (config templating only). Versions are emitted via the topic-channel pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…across the board Bypass `prepare-rpbp-genome` umbrella; call rpbp's internal `get_orfs` function directly via a small inline Python wrapper. Skips the umbrella script's bowtie2-build (rRNA index) and STAR genome-generate steps entirely - none of the downstream Rp-Bp tools we wrap consume those indices, and upstream alignment is supplied as the BAM. rpbp/buildconfig removed - the config dict is now built in-memory inside rpbp/preparegenome from the input fasta + gtf paths. fasta_gtf_bam_rpbp loses the corresponding setup step. extractorfprofiles replicates rpbp's metagene-length filter (`ribo_utils.utils.get_periodic_lengths_and_offsets`) inline before calling extract-orf-profiles, so the upstream select-periodic-offsets output can drive --lengths/--offsets without going through rpbp's filename-driven config plumbing. Filter thresholds are exposed via ext.args2 (4 space-separated tokens, defaults match rpbp.defaults.metagene_options). All 7 inner-module tests now run the upstream chain on real chr20 data (no stub-only tests). Subworkflow tests cover bam_rpbp_predictorfs and fasta_gtf_bam_rpbp end-to-end. The 3 inner modules that need the lower threshold (extractorfprofiles / estimateorfbayesfactors / selectfinalpredictionset) carry tests/nextflow.config setting ext.args2 = '10 1 None 0.0' so the chr20 BAM produces non-empty profiles. Wave container unchanged (community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The per-sample 6-step chain was only ever wrapped by fasta_gtf_bam_rpbp; keeping it as a standalone subworkflow added a thin layer no realistic caller needs (rpbp's BED outputs only come from preparegenome). Inlined into fasta_gtf_bam_rpbp — now one subworkflow with 7 process invocations end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…m/nf-core/modules into rpbp-add-modules-and-subworkflows

…nnel: column Cleans up the file: - Drops the multi-line header narrative recapping design choices and step ordering. - Drops the numbered "1. ... 2. ... 3. ..." inline narrative blocks. - Aligns the `// channel:` annotations in the emit block (12 of 13 lines were one column off the longest emit's reference position). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- M1/M2: meta.yml input pattern docs for `orfs_genomic_bed` and `orfs_exons_bed` now match preparegenome's actual emit suffix (`*.orfs-{genomic,exons}.annotated.bed.gz`). Doc-only drift before. - M3: drop the hardcoded `--select-longest-by-stop --select-best-overlapping` from `rpbp/selectfinalpredictionset/main.nf`; consistent with every other rpbp module's `--num-cpus + \${args}` shape. Defaults moved to the module's `tests/nextflow.config` and the subworkflow test config via `ext.args`, so consuming pipelines configure them in `modules.config`. - L1: drop `process_long` from `rpbp/estimateorfbayesfactors` so the `process_high + process_long` lint clash goes away; pipelines that need the extended walltime can set it via `modules.config` (`time = ...`). - L2: narrow the `ext.args2` docstring in `rpbp/extractorfprofiles` to match the actual filter behaviour - only the 3rd (max_bf_var) and 4th (min_bf_likelihood) tokens accept "None"; the 1st and 2nd must be numeric. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ent.yml The previous `rpbp_star:247a8ae84a6babfb` container co-installed `star=2.7.11b` for users of `prepare-rpbp-genome` / `run-rpbp-pipeline` (which call STAR internally). The modules in this PR bypass those umbrellas entirely (BAMs are supplied from external alignment), so the STAR install was dead weight. New container is built directly from each module's environment.yml (`bioconda::rpbp=4.0.1`), so the conda and docker/singularity codepaths now resolve to the same dependency set: - docker: community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b - singularity: https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…irectly Drops the hand-ported filter logic in favour of calling rpbp.ribo_utils.utils.get_periodic_lengths_and_offsets directly. The prior reimplementation had three divergences from upstream flagged in review (all only material under non-default thresholds): 1. The hard `bf_mean` filter was applied unconditionally; upstream only applies it when `max_bf_var` is set or when both `var`/`likelihood` are None. 2. Boundary operators used >=/<= where upstream uses strict >/<. 3. Only slots 3 and 4 (`max_bf_var`, `min_bf_lik`) honoured "None"; the first two would have crashed with TypeError if "None" was passed. The module now stages the input `periodic_offsets` CSV at the path rpbp.ribo_utils.filenames constructs, then calls the upstream filter function with a minimal config. Same `ext.args2` interface; any of the four tokens can be "None". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- `dict(...)` keyword constructor instead of `{ "key": val, ... }` - `pd.DataFrame.to_csv` instead of a manual TSV write loop - Drop the two intermediate `lengths.txt`/`offsets.txt` files; extract the bash-shell values directly from the just-written TSV with `tail | cut | tr`. Heredoc body goes from 17 lines to 11; same behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two interleaved changes: 1. Split the periodic-length filter (`get_periodic_lengths_and_offsets`) out of `rpbp/extractorfprofiles` into its own `rpbp/getperiodiclengthsoffsets` module. `extractorfprofiles` now takes a `lengths_offsets` TSV input rather than computing the filter inline. The threshold args move from `ext.args2` on `extractorfprofiles` to `ext.args` on `getperiodiclengthsoffsets`. `fasta_gtf_bam_rpbp` chains the new module between `selectperiodicoffsets` and `extractorfprofiles`. 2. Module-level tests now fetch their single immediate-upstream input from `nf-core/test-datasets:modules` (under `data/genomics/homo_sapiens/riboseq_expression/rpbp/`, added in nf-core/test-datasets#2064) instead of chaining six upstream stages in setup. Each module test runs in well under a minute. The subworkflow integration test still runs the full chain end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fallback Both Bayes-factor modules now accept Stan model lists as proper `tuple val(metaN), path(...)` inputs. The model lookup that previously ran unconditionally inside the shell block is now only triggered when the caller passes empty lists - all branching resolved in Groovy before the `"""`. - `rpbp/estimatemetagenebayesfactors` gains `tuple val(meta2), path(periodic_models, stageAs: ...)` and `tuple val(meta3), path(nonperiodic_models, stageAs: ...)`. - `rpbp/estimateorfbayesfactors` gains `tuple val(meta2), path(translated_models, ...)` and `tuple val(meta3), path(untranslated_models, ...)`. - Pass `[[], []]` to fall back to rpbp's bundled `.stan` files inside the container; pass populated channels to override. - `fasta_gtf_bam_rpbp` subworkflow wires empty placeholders by default. - Pattern matches `ensemblvep/vep` (cache + fasta optional inputs with in-container default fallback). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two structural changes coupled into one commit: 1. All meta-less `path foo` inputs become `tuple val(metaN), path(foo)`, per nf-core convention that every file input carries a meta. Affects `transcript_bed` (extractmetageneprofiles), `orfs_genomic_bed` + `exons_bed` (extractorfprofiles), `orfs_genomic_bed` (estimateorfbayesfactors), `genome_fasta` (selectfinalpredictionset). Subworkflow updated to attach `[ id: 'reference' ]` metas. 2. The two Python heredocs (`rpbp/preparegenome` and `rpbp/getperiodiclengthsoffsets`) become `templates/*.py` files. The surrounding bash setup (mkdir + chrName.txt generation for preparegenome; mkdir + cp for getperiodiclengthsoffsets) folded into the Python via `os.makedirs` + `shutil.copy`. Script blocks end in `template '<name>.py'` with no remaining bash. meta.yml updates document the new tuple inputs. Tests pass meta+file tuples instead of bare paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the `need_bundled` / `bundled_setup` Groovy plumbing with a cleaner layer-separated pattern: - Groovy stringifies user-supplied paths (or emits empty) - that's it. - Bash handles all flow control: assigns from the Groovy-interpolated string, then runs the rpbp-bundled-models lookup only if either variable is empty. No Groovy string contains a `$bash_var` literal awaiting expansion - which was a source-of-truth muddle in the previous version. Same behaviour, fewer moving parts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Move `pandas` / `rpbp` imports to the top of `get_periodic_lengths_and_offsets.py` (E402: imports were after a helper function definition). - Add `# noqa: F821` on the `${task.process}` interpolation lines in both templates. `task` is bound by Nextflow at template-resolution time; ruff can't see it.

The `${task.process}` is Nextflow template interpolation, not Python. Ruff treats the f-string as Python and complains about undefined `task`. Removing the `f` prefix on the lines that only have Nextflow interp (while keeping it on lines that actually interpolate Python expressions like `{platform.python_version()}`) makes ruff see a plain string and the F821 falls away naturally. Matches the existing precedent in `variantextractor`, `cellranger/multi`, `sigprofiler` templates.

Replaces the three-line `f.write` block with a single `yaml.safe_dump` call. pyyaml is already a transitive dep of rpbp (used by prepare_rpbp_genome for config loading). Matches the nf-core/anndata/getsize and custom/orfcountmatrix precedents.

- Reorder imports (alphabetic third-party group; ruff considers `rpbp` and `yaml` peers of `pandas`). - Remove `=`-alignment spaces. - Multi-line `to_csv` call with one arg per line. No semantic change.

… rpbp Rewrite top-level descriptions on all 8 rpbp module meta.yml files plus the fasta_gtf_bam_rpbp subworkflow so a reader who knows generic NGS / Ribo-seq vocabulary but not Rp-Bp internals can understand what each step does: metagene profile construction, per-(length, offset) Bayesian periodicity scoring, P-site offset selection and filtering, per-ORF P-site count vectors, per-ORF translated-vs-untranslated Bayes factor scoring, and the final filtered prediction set with BED/DNA/protein outputs. Citation audit: corrected the Rp-Bp DOI from 10.1093/nar/gkw1141 (unrelated E. coli 5'-UTR paper) to 10.1093/nar/gkw1350 (Malone et al. 2017, Nucleic Acids Research, "Bayesian prediction of RNA translation from ribosome profiling") across all 8 module meta.yml files. Homepage, documentation and tool_dev_url verified to resolve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The optional Stan-model inputs on estimatemetagenebayesfactors and estimateorfbayesfactors were non-functional for user-supplied files: rpbp resolves models via CmdStanModel(exe_file=Path(pm).with_suffix("")) and requires the pre-compiled binary to live next to the .stan file, so staged user-supplied .stan files cannot be picked up. The umbrella subworkflow always passed empties anyway. Both modules now always resolve the bundled rpbp Python-package models; the remaining input metas are renumbered to be sequential. Also adds an empty-lengths guard to getperiodiclengthsoffsets so the template fails loudly with a clear message when no read lengths pass the periodicity filters, rather than silently emitting an empty TSV that propagates to all-zero ORF output. Adds a note to preparegenome's meta.yml flagging that it emits the *.annotated.bed.gz filenames produced by get_orfs rather than the *.bed.gz-renamed forms produced by the upstream prepare-rpbp-genome umbrella script; downstream consumers in this module set reference those names explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The local snapshots were captured under Nextflow 26.04.1, which serialises `path "versions.yml" ... topic: versions` outputs into the same flattened tuple form the eval-style emits produce. CI runs Nextflow 25.10.2, which keeps them as `versions.yml:md5,...` (the conventional path-emit form). Regenerate the two affected snapshots so they match the CI output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Swap modules/local/gedi/{indexgenome,price} for the nf-core/modules versions (git_sha 51af5cb84874647aa4733742b86142025705b042). The upstream modules differ only by convention: configurable ext.prefix on the indexgenome output directory and an additive versions_gedi emit alongside the versions topic. Setting ext.prefix='price_index' on GEDI_INDEXGENOME preserves the existing published path and the ${index}/reference.oml contract consumed by GEDI_PRICE, so behaviour is unchanged. ribocode/* and ribotish/* were already moved to modules/nf-core/ via earlier `nf-core modules install` work, so this commit only handles gedi. Follow-up housekeeping once these PRs merge: - rpbp/* via nf-core/modules#11695 - concat_gtf and filter_gtf_class_code via nf-core/modules#11729 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the size/xl label May 19, 2026

pinin4fjords changed the title ~~feat(rpbp): add 8 rpbp modules + bam_rpbp_predictorfs / fasta_gtf_bam_rpbp subworkflows~~ feat(rpbp): add 7 rpbp modules + bam_rpbp_predictorfs / fasta_gtf_bam_rpbp subworkflows May 19, 2026

pinin4fjords and others added 3 commits May 19, 2026 17:57

Merge branch 'master' into rpbp-add-modules-and-subworkflows

cce7bc2

Merge branch 'rpbp-add-modules-and-subworkflows' of https://github.co…

60401cd

…m/nf-core/modules into rpbp-add-modules-and-subworkflows

pinin4fjords changed the title ~~feat(rpbp): add 7 rpbp modules + bam_rpbp_predictorfs / fasta_gtf_bam_rpbp subworkflows~~ feat(rpbp): add 7 rpbp modules + fasta_gtf_bam_rpbp subworkflow May 19, 2026

pinin4fjords and others added 5 commits May 19, 2026 18:47

pinin4fjords mentioned this pull request May 20, 2026

Add rpbp test data: per-stage intermediates for nf-core/modules#11695 nf-core/test-datasets#2064

Draft

2 tasks

pinin4fjords changed the title ~~feat(rpbp): add 7 rpbp modules + fasta_gtf_bam_rpbp subworkflow~~ feat(rpbp): add 8 rpbp modules + fasta_gtf_bam_rpbp subworkflow May 20, 2026

pinin4fjords and others added 11 commits May 20, 2026 13:32

style(rpbp): apply ruff check --fix + ruff format to templates

5da69a3

- Reorder imports (alphabetic third-party group; ruff considers `rpbp` and `yaml` peers of `pandas`). - Remove `=`-alignment spaces. - Multi-line `to_csv` call with one arg per line. No semantic change.

Merge branch 'master' into rpbp-add-modules-and-subworkflows

3e7851e

pinin4fjords marked this pull request as ready for review May 20, 2026 19:15

pinin4fjords added the Ready for Review label May 21, 2026

pinin4fjords mentioned this pull request May 22, 2026

feat: Rp-Bp opt-in ORF caller nf-core/riboseq#186

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rpbp): add 8 rpbp modules + fasta_gtf_bam_rpbp subworkflow#11695

feat(rpbp): add 8 rpbp modules + fasta_gtf_bam_rpbp subworkflow#11695
pinin4fjords wants to merge 22 commits into
masterfrom
rpbp-add-modules-and-subworkflows

pinin4fjords commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pinin4fjords commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Rp-Bp does

Why not call the umbrella CLI

Modules

Subworkflow

Container

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pinin4fjords commented May 19, 2026 •

edited

Loading