Add dotseq/dotseq by pinin4fjords · Pull Request #11742 · nf-core/modules

pinin4fjords · 2026-05-21T22:41:45Z

Adds a dotseq/dotseq module wrapping DOTSeq, a Bioconductor package for detecting differential ORF usage (DOU) and ORF-level differential translation efficiency (DTE) from Ribo-seq with matched RNA-seq.

Status

Bioconda recipe (bioconda/bioconda-recipes#65677) merged; bioconductor-dotseq=1.0.0 live on linux-64 + osx-64.
Container: Wave community build pulling bioconductor-dotseq + tidyverse + plotting deps.
Test data (nf-core/test-datasets#2072) open, awaiting maintainer review. The module's tests/nextflow.config points at the PR's fork branch until that merges.
nf-core modules lint dotseq/dotseq clean.
Local nf-test passes against the Wave container in ~4 min on AWS c5.9xlarge.

What it does

Inputs: contrast tuple (variable, reference, target) + (samplesheet, count_matrix, annotation_tsv). The shape mirrors anota2seq/anota2seqrun and deltate (samplesheet + counts) with one extra per-ORF annotation TSV that supplies the parent-gene id the DOU beta-binomial fit needs.
- samplesheet: CSV/TSV with run, strategy, replicate, condition columns (overridable via task.ext.args).
- count_matrix: TSV with orf_id as the first column and per-sample count columns. Both Ribo-seq and RNA-seq samples share the matrix; the sample sheet's strategy column distinguishes them.
- annotation_tsv: per-ORF rows with orf_id + gene_id (required), and optional orf_type (mORF/uORF/dORF) and coordinate columns. Genomic coordinates are stored on the resulting GRanges for downstream inspection only; the fit itself only uses gene_id (to group child ORFs per parent gene) and orf_type (to bucket the heatmap).
Calls DOTSeqDataSetsFromSummarizeOverlaps() -> DOTSeq() -> getContrasts(). The R template uses optparse for --key value overrides via task.ext.args and tidyverse syntax throughout.
Emits only what DOTSeq produces natively:
- translation.dotseq.results.tsv - DTE (DESeq2 + ashr) interaction results
- dou.dotseq.results.tsv - DOU (beta-binomial glmmTMB + ashr) interaction results
- dou_strategy / dte_strategy TSVs when present (per-condition Ribo-vs-RNA contrasts)
- volcano.png, composite.png, venn.png, heatmap.png from plotDOT()
- interaction_p_distribution.png - histogram of DTE padj (plain ggplot on the package's own statistic)
- DOTSeqDataSets.rds, R_sessionInfo.log, versions.yml

Why this input shape

The DOTSeq vignette uses DOTSeqDataSetsFromFeatureCounts() because the cell_cycle example was generated with featureCounts -f -O against a flattened ORF GTF; the constructor parses that featureCounts-format table and pairs it with a flattened GTF + BED to build the internal GRanges. That is not intrinsic to the DOTSeq model. The package also exports DOTSeqDataSetsFromSummarizeOverlaps(), which takes a plain per-ORF count matrix plus a GRanges annotation. The latter is a much cleaner contract for upstream consumers (nf-core/riboseq's ORF-level pipeline already produces a per-ORF P-site count matrix and a per-ORF annotation TSV directly, with no need to synthesize a featureCounts header + flattened GTF + flattened BED). The module wraps FromSummarizeOverlaps and builds the GRanges in-process from the annotation TSV.

Cross-tool naming

The DTE interaction table is written as translation.dotseq.results.tsv to match anota2seq/anota2seqrun's translation.anota2seq.results.tsv - same biological quantity (differential translation efficiency), measured per-ORF here vs per-gene there. This keeps downstream pipelines portable across the three differential translation methods (anota2seq, deltate, DOTSeq) while each module still emits only what its underlying package supports natively. DOTSeq's DOU output is unique to this tool and has no anota2seq/deltate counterpart.

Test

nextflow_process test using DOTSeq's own cell_cycle_subset bundled data (~150KB total, MIT licence, derived from inst/extdata), restricted to the Mitotic_Cycling vs Interphase contrast on chx-treatment samples. Snapshots filenames + versions topic + versions.yml md5; DOTSeq's Bayesian + glmmTMB outputs are stochastic so file content is not snapshotted.

PR checklist

This comment contains a description of changes (with reason).
Add a test for the new module/subworkflow.
Snapshot the test output.
If you've added a new tool - have you followed the module conventions in the contribution docs
If necessary, include test data in your PR. (Sibling test-datasets PR linked above.)
Remove all TODO statements.
Emit the versions.yml file.

DOTSeq is a Bioconductor package for detecting differential ORF usage (DOU) and ORF-level differential translation efficiency (DTE) from Ribo-seq with matched RNA-seq. Module wraps DOTSeqDataSetsFromFeatureCounts + DOTSeq() + getContrasts() and emits per-ORF TSVs for the DOU and DTE interaction contrasts plus the serialised DOTSeqDataSets object. Pre-requisites (in flight): - Bioconda recipe: bioconda/bioconda-recipes#65677 - Test data: nf-core/test-datasets#2072

Bioconda recipe (bioconda/bioconda-recipes#65677) merged; biocontainer image is not yet built so swap the placeholder quay.io/depot URLs for a Wave community container built from the now-merged bioconda package. Also widen the singularity guard to include 'apptainer' and add the versions topic block in meta.yml (via nf-core modules lint --fix).

- Restructure the R template around optparse + readr + dplyr + purrr + ggplot2; drop the homemade parse_args / read_delim_flexible helpers in favour of the standard package idioms and native pipe. - Output set is now what DOTSeq itself emits natively: per-ORF DTE contrasts (translation.dotseq.results.tsv), DOU contrasts (dou.dotseq.results.tsv), optional dou_strategy / dte_strategy per-condition Ribo-vs-RNA contrasts, plus the four plotDOT() PNGs (volcano / composite / venn / heatmap) and a DTE p-value distribution histogram drawn directly from DOTSeq's padj column. - Container picks up r-eulerr + r-ggsignif (required for plotDOT venn) and explicit r-ggplot2 so the histogram has a stable ggplot version. - plotDOT() default of force_new_device=TRUE was killing our png() device on each call; pass FALSE so the PNGs land where Nextflow expects them.

- Drop the homemade read_delim_flexible() and write_results_tsv() wrappers in favour of read_tsv() / read_csv() / write_tsv() directly. The earlier to_orf_tibble() conditional is also gone now that we know getContrasts() always returns a frame with orf_id as a column (per the DOTSeq source in posthoc.R + main.R). - plotDOT(heatmap) requires gene-paired mORF + sorf entries; try uORF first (the package default) and fall back to dORF when no significant gene has both. tryCatch in safe_plot_dot still makes either a no-op when neither succeeds.

…fallback robustness - Add stub: block to main.nf matching the proteus/readproteingroups precedent. - Read sample sheet with read_delim() picking comma/tab from the file extension so the meta.yml-advertised TSV variant actually works. - Refuse to clobber an existing canonical column (e.g. an existing 'condition' column when --contrast_variable=treatment is supplied). - Dedupe multi-lane sample sheets and validate that both Ribo and RNA strategies are present (DOTSeq's interaction design is unestimable otherwise). - Add an is_set() predicate that catches NULL / empty stringent + required options before the tri-state switch silently returns NULL. - safe_plot_dot now unlinks the partially-written PNG on plotDOT error and returns success so the heatmap fallback (uORF then dORF) keys off whether the first call actually drew, not file.exists() of a stale handle. - getContrasts(type='interaction') errors propagate (headline outputs); type='strategy' stays tryCatch'd because absence is legitimate. - Cache getDOU(d) / getDTE(d) once and share across contrasts + plotDOT. - Drop redundant file.exists() walk - Nextflow's path staging already guarantees the inputs exist. - Expand the test to assert volcano / composite / venn plot emission and add a -stub test.

Lets CI verify the module is actually green; revert this commit once nf-core/test-datasets#2072 merges and the canonical modules-branch URL resolves.

Aligns the module's input contract with deltate / anota2seq so that consumers can dispatch between the three ORF-DTE methods without maintaining a separate prep step for dotseq. The four featureCounts/GTF/BED inputs collapse to a per-ORF count matrix (orf_id + sample columns) plus a per-ORF annotation TSV (orf_id + gene_id + optional orf_type/coords). The R template now calls DOTSeqDataSetsFromSummarizeOverlaps() and builds the required GRanges in-process from the annotation TSV; the model fit, contrast tables, and plotDOT outputs are unchanged. Test fixtures updated alongside in nf-core/test-datasets#2072 (commit 8c9b27c). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DOTSeq's parse_condition_table() requires a `replicate` column for stable ordering of samples within strategy+condition. Pipeline samplesheets often have a `pair` column (or none at all), so the R template now treats the column as optional: when present it is renamed to `replicate` as before; when absent the template assigns a per-(strategy, condition) row counter so the model fit is unaffected. This matches how anota2seq/deltate consume the same samplesheets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds DOTSeq as a third option for per-ORF translational efficiency analysis alongside deltate and anota2seq. DOTSeq is ORF-only and produces both DTE (DESeq2 + ashr interaction) and DOU (beta-binomial glmmTMB) results. Module installed pre-merge from nf-core/modules#11742; modules.json carries it under a second-repo entry pointing at the PR branch on the user's fork. Also drops an unused alias import (RIBOTISH_QUALITY as RIBOTISH_QUALITY_TISEQ) surfaced during the workflow refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wire DOTSEQ_DOTSEQ_ORF through DTE_COUNTS_PREP at ORF resolution. Drop the --run_dotseq placeholder; dotseq is now a third value for --translational_efficiency_method. Module installed from nf-core/modules#11742-pending (registered under https://github.com/pinin4fjords/nf-core-modules.git so nf-core lint doesn't hit an interactive prompt under CI's no-TTY shell). Adds withName blocks for the ORF-level DTE chain plus extra_orf_dte_args / extra_dotseq_args params, and brings tests/dotseq.nf.test + snapshot. [skip ci]

pinin4fjords added 2 commits May 21, 2026 23:41

pinin4fjords mentioned this pull request May 22, 2026

Add DOTSeq cell_cycle subset test data for modules nf-core/test-datasets#2072

Open

2 tasks

pinin4fjords added 5 commits May 22, 2026 09:55

TEMPORARY: point test at the pending test-datasets PR fork branch

164ddbc

Lets CI verify the module is actually green; revert this commit once nf-core/test-datasets#2072 merges and the canonical modules-branch URL resolves.

Merge branch 'master' into dotseq-add-module

dbd6a9b

pinin4fjords marked this pull request as ready for review May 22, 2026 10:43

pinin4fjords and others added 2 commits May 22, 2026 12:27

pinin4fjords mentioned this pull request May 22, 2026

feat: ORF-level differential translation nf-core/riboseq#189

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dotseq/dotseq#11742

Add dotseq/dotseq#11742
pinin4fjords wants to merge 9 commits into
nf-core:masterfrom
pinin4fjords:dotseq-add-module

pinin4fjords commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pinin4fjords commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

What it does

Why this input shape

Cross-tool naming

Test

PR checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pinin4fjords commented May 21, 2026 •

edited

Loading