Skip to content

feat(rpbp): add 8 rpbp modules + fasta_gtf_bam_rpbp subworkflow#11695

Open
pinin4fjords wants to merge 22 commits into
masterfrom
rpbp-add-modules-and-subworkflows
Open

feat(rpbp): add 8 rpbp modules + fasta_gtf_bam_rpbp subworkflow#11695
pinin4fjords wants to merge 22 commits into
masterfrom
rpbp-add-modules-and-subworkflows

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented May 19, 2026

Adds 8 rpbp/* modules and the fasta_gtf_bam_rpbp subworkflow that wrap the Rp-Bp ribosome-profiling ORF predictor (Malone et al. 2017, doi:10.1093/nar/gkw1350).

What Rp-Bp does

Ribosome profiling (Ribo-seq) sequences the short mRNA fragments protected by translating ribosomes. Rp-Bp uses those reads to score, for every candidate open reading frame (ORF) in the annotation, whether it shows the 3-nucleotide periodicity that marks active translation. The output is a per-sample BED of predicted translated ORFs plus matching DNA / protein FASTAs.

Why not call the umbrella CLI

Rp-Bp ships two top-level scripts (prepare-rpbp-genome, predict-translated-orfs) that chain the internal steps together. Both also do work the modules don't need:

  • prepare-rpbp-genome builds a bowtie2 rRNA-filter index and a STAR alignment index alongside the BED prep. These modules consume pre-aligned BAMs supplied by the caller, so those indexes are unnecessary and would force STAR and bowtie2 into the module container.
  • predict-translated-orfs internally re-runs flexbar + bowtie + STAR on raw FASTQs, duplicating any upstream alignment the caller already produced, before getting to the actual ORF prediction logic.

Splitting the underlying steps into 8 single-purpose modules:

  • skips the alignment/index work,
  • lets the modules ship a smaller Rp-Bp-only container,
  • gives each step independent -resume caching.

Modules

Reference prep (run once per genome):

  • rpbp/preparegenome - enumerates candidate ORFs from the GTF and writes the BEDs the per-sample steps consume.

Per-sample chain:

  • rpbp/extractmetageneprofiles - per-read-length read coverage around annotated start codons (a "metagene profile").
  • rpbp/estimatemetagenebayesfactors - Bayes-factor score per read length for "this length shows 3-nt periodicity vs not".
  • rpbp/selectperiodicoffsets - chooses one P-site offset per read length (the canonical ribosome-position adjustment).
  • rpbp/getperiodiclengthsoffsets - threshold-filters down to the read lengths that will drive ORF scoring. Thresholds are configurable via ext.args (4 space-separated tokens: min count, min BF mean, max BF var, min BF likelihood; defaults mirror rpbp.defaults.metagene_options).
  • rpbp/extractorfprofiles - builds per-ORF P-site count vectors using the surviving read lengths and offsets.
  • rpbp/estimateorfbayesfactors - Bayes-factor score per ORF for "translated vs untranslated", via Stan models bundled in the Rp-Bp Python package.
  • rpbp/selectfinalpredictionset - applies score / length / overlap rules and emits the final predicted-ORF BED + DNA / protein FASTAs.

Subworkflow

fasta_gtf_bam_rpbp wires the modules together: rpbp/preparegenome once per genome, the 7-step per-sample chain for each input BAM. Designed to be consumed by any caller that has pre-aligned Ribo-seq BAMs plus the matching genome FASTA / annotation GTF.

Container

All rpbp/* modules share a Wave-built container built directly from each module's environment.yml (bioconda::rpbp=4.0.1):

  • docker: community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b
  • singularity: https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data

Versions emitted via topic: versions.

Test plan

Real + stub tests pass for each module and the subworkflow under nf-core modules test --profile docker / nf-core subworkflows test --profile docker.

Module-level tests consume per-stage static fixtures from nf-core/test-datasets:modules under data/genomics/homo_sapiens/riboseq_expression/rpbp/ (added in nf-core/test-datasets#2064) - one immediate-upstream artefact per module rather than chaining six upstream stages in setup. Each module test runs in well under a minute (the per-ORF Bayes-factor fit is the only multi-minute step). The full chain still runs end-to-end in the subworkflow test.

…_rpbp subworkflows

Adds the Rp-Bp Ribo-seq ORF caller (Malone et al. 2017, doi:10.1093/nar/gkw1141) as
8 split modules plus 2 orchestration subworkflows ported from nf-core/riboseq#174.
Splitting per-tool (rather than wrapping rpbp's `predict-translated-orfs` umbrella
command) gives independent caching on resume and lets the pipeline's own STAR
alignment run instead of being re-done inside rpbp.

Modules:
- rpbp/buildconfig: render the Rp-Bp YAML config from pipeline-supplied fasta+gtf.
- rpbp/preparegenome: build the Rp-Bp genome index (STAR index, ribosomal index,
  ORF BEDs).
- rpbp/extractmetageneprofiles: per-read-length metagene profiles around starts.
- rpbp/estimatemetagenebayesfactors: periodicity Bayes factors per read length.
- rpbp/selectperiodicoffsets: pick a single P-site offset per high-quality length.
- rpbp/extractorfprofiles: per-ORF P-site profile matrix using selected offsets.
- rpbp/estimateorfbayesfactors: Bayesian translated-vs-untranslated model.
- rpbp/selectfinalpredictionset: filter to the final predicted-ORF BED + FASTAs.

Subworkflows:
- bam_rpbp_predictorfs: per-sample 6-step chain on a Ribo-seq BAM with the
  cohort-shared annotation outputs supplied by the caller.
- fasta_gtf_bam_rpbp: top-level end-to-end run; renders config, prepares the
  index once, then runs the per-sample chain.

Containers: Wave-built `community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb`
(co-installs `bioconda::rpbp=4.0.1` + `bioconda::star=2.7.11b`) for all rpbp tools;
`quay.io/biocontainers/coreutils:9.5` for rpbp/buildconfig (config templating only).
Versions are emitted via the topic-channel pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…across the board

Bypass `prepare-rpbp-genome` umbrella; call rpbp's internal `get_orfs`
function directly via a small inline Python wrapper. Skips the umbrella
script's bowtie2-build (rRNA index) and STAR genome-generate steps
entirely - none of the downstream Rp-Bp tools we wrap consume those
indices, and upstream alignment is supplied as the BAM.

rpbp/buildconfig removed - the config dict is now built in-memory inside
rpbp/preparegenome from the input fasta + gtf paths. fasta_gtf_bam_rpbp
loses the corresponding setup step.

extractorfprofiles replicates rpbp's metagene-length filter
(`ribo_utils.utils.get_periodic_lengths_and_offsets`) inline before
calling extract-orf-profiles, so the upstream select-periodic-offsets
output can drive --lengths/--offsets without going through rpbp's
filename-driven config plumbing. Filter thresholds are exposed via
ext.args2 (4 space-separated tokens, defaults match
rpbp.defaults.metagene_options).

All 7 inner-module tests now run the upstream chain on real chr20 data
(no stub-only tests). Subworkflow tests cover bam_rpbp_predictorfs and
fasta_gtf_bam_rpbp end-to-end. The 3 inner modules that need the lower
threshold (extractorfprofiles / estimateorfbayesfactors /
selectfinalpredictionset) carry tests/nextflow.config setting
ext.args2 = '10 1 None 0.0' so the chr20 BAM produces non-empty profiles.

Wave container unchanged
(community.wave.seqera.io/library/rpbp_star:247a8ae84a6babfb).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords changed the title feat(rpbp): add 8 rpbp modules + bam_rpbp_predictorfs / fasta_gtf_bam_rpbp subworkflows feat(rpbp): add 7 rpbp modules + bam_rpbp_predictorfs / fasta_gtf_bam_rpbp subworkflows May 19, 2026
pinin4fjords and others added 3 commits May 19, 2026 17:57
The per-sample 6-step chain was only ever wrapped by fasta_gtf_bam_rpbp;
keeping it as a standalone subworkflow added a thin layer no realistic
caller needs (rpbp's BED outputs only come from preparegenome). Inlined
into fasta_gtf_bam_rpbp — now one subworkflow with 7 process invocations
end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords changed the title feat(rpbp): add 7 rpbp modules + bam_rpbp_predictorfs / fasta_gtf_bam_rpbp subworkflows feat(rpbp): add 7 rpbp modules + fasta_gtf_bam_rpbp subworkflow May 19, 2026
pinin4fjords and others added 5 commits May 19, 2026 18:47
…nnel: column

Cleans up the file:
- Drops the multi-line header narrative recapping design choices and step ordering.
- Drops the numbered "1. ... 2. ... 3. ..." inline narrative blocks.
- Aligns the `// channel:` annotations in the emit block (12 of 13 lines
  were one column off the longest emit's reference position).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- M1/M2: meta.yml input pattern docs for `orfs_genomic_bed` and
  `orfs_exons_bed` now match preparegenome's actual emit suffix
  (`*.orfs-{genomic,exons}.annotated.bed.gz`). Doc-only drift before.
- M3: drop the hardcoded `--select-longest-by-stop --select-best-overlapping`
  from `rpbp/selectfinalpredictionset/main.nf`; consistent with every other
  rpbp module's `--num-cpus + \${args}` shape. Defaults moved to the
  module's `tests/nextflow.config` and the subworkflow test config via
  `ext.args`, so consuming pipelines configure them in `modules.config`.
- L1: drop `process_long` from `rpbp/estimateorfbayesfactors` so the
  `process_high + process_long` lint clash goes away; pipelines that need
  the extended walltime can set it via `modules.config` (`time = ...`).
- L2: narrow the `ext.args2` docstring in `rpbp/extractorfprofiles` to
  match the actual filter behaviour - only the 3rd (max_bf_var) and 4th
  (min_bf_likelihood) tokens accept "None"; the 1st and 2nd must be numeric.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent.yml

The previous `rpbp_star:247a8ae84a6babfb` container co-installed
`star=2.7.11b` for users of `prepare-rpbp-genome` / `run-rpbp-pipeline`
(which call STAR internally). The modules in this PR bypass those
umbrellas entirely (BAMs are supplied from external alignment), so the
STAR install was dead weight.

New container is built directly from each module's environment.yml
(`bioconda::rpbp=4.0.1`), so the conda and docker/singularity codepaths
now resolve to the same dependency set:

- docker:      community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b
- singularity: https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…irectly

Drops the hand-ported filter logic in favour of calling
rpbp.ribo_utils.utils.get_periodic_lengths_and_offsets directly. The
prior reimplementation had three divergences from upstream flagged in
review (all only material under non-default thresholds):

1. The hard `bf_mean` filter was applied unconditionally; upstream only
   applies it when `max_bf_var` is set or when both `var`/`likelihood`
   are None.
2. Boundary operators used >=/<= where upstream uses strict >/<.
3. Only slots 3 and 4 (`max_bf_var`, `min_bf_lik`) honoured "None"; the
   first two would have crashed with TypeError if "None" was passed.

The module now stages the input `periodic_offsets` CSV at the path
rpbp.ribo_utils.filenames constructs, then calls the upstream filter
function with a minimal config. Same `ext.args2` interface; any of the
four tokens can be "None".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `dict(...)` keyword constructor instead of `{ "key": val, ... }`
- `pd.DataFrame.to_csv` instead of a manual TSV write loop
- Drop the two intermediate `lengths.txt`/`offsets.txt` files; extract the
  bash-shell values directly from the just-written TSV with `tail | cut | tr`.

Heredoc body goes from 17 lines to 11; same behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two interleaved changes:

1. Split the periodic-length filter (`get_periodic_lengths_and_offsets`)
   out of `rpbp/extractorfprofiles` into its own
   `rpbp/getperiodiclengthsoffsets` module. `extractorfprofiles` now
   takes a `lengths_offsets` TSV input rather than computing the filter
   inline. The threshold args move from `ext.args2` on
   `extractorfprofiles` to `ext.args` on `getperiodiclengthsoffsets`.
   `fasta_gtf_bam_rpbp` chains the new module between
   `selectperiodicoffsets` and `extractorfprofiles`.

2. Module-level tests now fetch their single immediate-upstream input
   from `nf-core/test-datasets:modules` (under
   `data/genomics/homo_sapiens/riboseq_expression/rpbp/`, added in
   nf-core/test-datasets#2064) instead of chaining six upstream stages
   in setup. Each module test runs in well under a minute. The
   subworkflow integration test still runs the full chain end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords changed the title feat(rpbp): add 7 rpbp modules + fasta_gtf_bam_rpbp subworkflow feat(rpbp): add 8 rpbp modules + fasta_gtf_bam_rpbp subworkflow May 20, 2026
pinin4fjords and others added 11 commits May 20, 2026 13:32
…fallback

Both Bayes-factor modules now accept Stan model lists as proper
`tuple val(metaN), path(...)` inputs. The model lookup that previously
ran unconditionally inside the shell block is now only triggered when
the caller passes empty lists - all branching resolved in Groovy
before the `"""`.

- `rpbp/estimatemetagenebayesfactors` gains `tuple val(meta2), path(periodic_models, stageAs: ...)` and `tuple val(meta3), path(nonperiodic_models, stageAs: ...)`.
- `rpbp/estimateorfbayesfactors` gains `tuple val(meta2), path(translated_models, ...)` and `tuple val(meta3), path(untranslated_models, ...)`.
- Pass `[[], []]` to fall back to rpbp's bundled `.stan` files inside the container; pass populated channels to override.
- `fasta_gtf_bam_rpbp` subworkflow wires empty placeholders by default.
- Pattern matches `ensemblvep/vep` (cache + fasta optional inputs with in-container default fallback).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two structural changes coupled into one commit:

1. All meta-less `path foo` inputs become `tuple val(metaN), path(foo)`,
   per nf-core convention that every file input carries a meta. Affects
   `transcript_bed` (extractmetageneprofiles), `orfs_genomic_bed` +
   `exons_bed` (extractorfprofiles), `orfs_genomic_bed`
   (estimateorfbayesfactors), `genome_fasta` (selectfinalpredictionset).
   Subworkflow updated to attach `[ id: 'reference' ]` metas.

2. The two Python heredocs (`rpbp/preparegenome` and
   `rpbp/getperiodiclengthsoffsets`) become `templates/*.py` files.
   The surrounding bash setup (mkdir + chrName.txt generation for
   preparegenome; mkdir + cp for getperiodiclengthsoffsets) folded
   into the Python via `os.makedirs` + `shutil.copy`. Script blocks
   end in `template '<name>.py'` with no remaining bash.

meta.yml updates document the new tuple inputs. Tests pass meta+file
tuples instead of bare paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the `need_bundled` / `bundled_setup` Groovy plumbing with a
cleaner layer-separated pattern:

- Groovy stringifies user-supplied paths (or emits empty) - that's it.
- Bash handles all flow control: assigns from the Groovy-interpolated
  string, then runs the rpbp-bundled-models lookup only if either
  variable is empty.

No Groovy string contains a `$bash_var` literal awaiting expansion -
which was a source-of-truth muddle in the previous version. Same
behaviour, fewer moving parts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Move `pandas` / `rpbp` imports to the top of
  `get_periodic_lengths_and_offsets.py` (E402: imports were after a
  helper function definition).
- Add `# noqa: F821` on the `${task.process}` interpolation lines in
  both templates. `task` is bound by Nextflow at template-resolution
  time; ruff can't see it.
The `${task.process}` is Nextflow template interpolation, not Python.
Ruff treats the f-string as Python and complains about undefined `task`.
Removing the `f` prefix on the lines that only have Nextflow interp
(while keeping it on lines that actually interpolate Python expressions
like `{platform.python_version()}`) makes ruff see a plain string and
the F821 falls away naturally. Matches the existing precedent in
`variantextractor`, `cellranger/multi`, `sigprofiler` templates.
Replaces the three-line `f.write` block with a single `yaml.safe_dump`
call. pyyaml is already a transitive dep of rpbp (used by
prepare_rpbp_genome for config loading). Matches the
nf-core/anndata/getsize and custom/orfcountmatrix precedents.
- Reorder imports (alphabetic third-party group; ruff considers
  `rpbp` and `yaml` peers of `pandas`).
- Remove `=`-alignment spaces.
- Multi-line `to_csv` call with one arg per line.

No semantic change.
… rpbp

Rewrite top-level descriptions on all 8 rpbp module meta.yml files plus
the fasta_gtf_bam_rpbp subworkflow so a reader who knows generic NGS /
Ribo-seq vocabulary but not Rp-Bp internals can understand what each
step does: metagene profile construction, per-(length, offset) Bayesian
periodicity scoring, P-site offset selection and filtering, per-ORF
P-site count vectors, per-ORF translated-vs-untranslated Bayes factor
scoring, and the final filtered prediction set with BED/DNA/protein
outputs.

Citation audit: corrected the Rp-Bp DOI from 10.1093/nar/gkw1141
(unrelated E. coli 5'-UTR paper) to 10.1093/nar/gkw1350 (Malone et al.
2017, Nucleic Acids Research, "Bayesian prediction of RNA translation
from ribosome profiling") across all 8 module meta.yml files. Homepage,
documentation and tool_dev_url verified to resolve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The optional Stan-model inputs on estimatemetagenebayesfactors and
estimateorfbayesfactors were non-functional for user-supplied files:
rpbp resolves models via CmdStanModel(exe_file=Path(pm).with_suffix(""))
and requires the pre-compiled binary to live next to the .stan file, so
staged user-supplied .stan files cannot be picked up. The umbrella
subworkflow always passed empties anyway. Both modules now always
resolve the bundled rpbp Python-package models; the remaining input
metas are renumbered to be sequential.

Also adds an empty-lengths guard to getperiodiclengthsoffsets so the
template fails loudly with a clear message when no read lengths pass
the periodicity filters, rather than silently emitting an empty TSV
that propagates to all-zero ORF output. Adds a note to preparegenome's
meta.yml flagging that it emits the *.annotated.bed.gz filenames
produced by get_orfs rather than the *.bed.gz-renamed forms produced
by the upstream prepare-rpbp-genome umbrella script; downstream
consumers in this module set reference those names explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local snapshots were captured under Nextflow 26.04.1, which serialises
`path "versions.yml" ... topic: versions` outputs into the same flattened
tuple form the eval-style emits produce. CI runs Nextflow 25.10.2, which
keeps them as `versions.yml:md5,...` (the conventional path-emit form).
Regenerate the two affected snapshots so they match the CI output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords marked this pull request as ready for review May 20, 2026 19:15
pinin4fjords added a commit to pinin4fjords/riboseq that referenced this pull request May 21, 2026
Swap modules/local/gedi/{indexgenome,price} for the nf-core/modules
versions (git_sha 51af5cb84874647aa4733742b86142025705b042). The
upstream modules differ only by convention: configurable ext.prefix
on the indexgenome output directory and an additive versions_gedi
emit alongside the versions topic. Setting ext.prefix='price_index'
on GEDI_INDEXGENOME preserves the existing published path and the
${index}/reference.oml contract consumed by GEDI_PRICE, so behaviour
is unchanged.

ribocode/* and ribotish/* were already moved to modules/nf-core/
via earlier `nf-core modules install` work, so this commit only
handles gedi.

Follow-up housekeeping once these PRs merge:
- rpbp/* via nf-core/modules#11695
- concat_gtf and filter_gtf_class_code via nf-core/modules#11729

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant