Skip to content

Add DOTSeq cell_cycle subset test data for modules#2072

Open
pinin4fjords wants to merge 5 commits into
nf-core:modulesfrom
pinin4fjords:add-dotseq-testdata
Open

Add DOTSeq cell_cycle subset test data for modules#2072
pinin4fjords wants to merge 5 commits into
nf-core:modulesfrom
pinin4fjords:add-dotseq-testdata

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented May 21, 2026

Summary

Test fixtures for nf-core/modules#11742 (dotseq/dotseq) under data/genomics/homo_sapiens/riboseq_expression/dotseq/.

DOTSeq is a Bioconductor package for differential ORF usage (DOU) + ORF-level differential translation efficiency (DTE). Its API takes a featureCounts-format ORF count table plus a flattened ORF GTF/BED pair, so the existing riboseq_expression/ salmon-gene-level fixtures don't satisfy the input contract. Rather than build new featureCounts + ORF annotations from the existing BAMs (which would require an ORF caller + featureCounts pass), this PR drops in the small cell_cycle_subset bundled with the DOTSeq package - DOTSeq's own example data, already in the exact shape its functions expect.

Files

File Size Notes
featureCounts.cell_cycle_subset.txt.gz 117 KB ORF-level featureCounts output, 6644 ORF rows × 12 sample columns (chx-treatment subset of Ly et al. 2024)
gencode.v47.orf_flattened_subset.gtf.gz 81 KB Flattened GENCODE v47 ORF annotation matching the count table, 6945 lines
gencode.v47.orf_flattened_subset.bed.gz 53 KB Matching BED (6642 lines)
metadata.txt.gz 211 B Headerless 24-sample metadata: run strategy replicate treatment condition
samplesheet.csv 423 B Headered, chx-only subset of metadata.txt.gz (12 rows) - derived in this PR so the module can consume it without an extra header step

Total: 252 KB across 5 files.

Why these specific files

DOTSeq's DOTSeqDataSetsFromFeatureCounts() requires four inputs:

  1. count_table with featureCounts annotation columns (Geneid, Chr, Start, End, Strand, Length) followed by per-sample counts.
  2. flattened_gtf with gene_id + exon_number attributes (DOTSeq uses these to name ORFs as gene_id:O###).
  3. flattened_bed matching the GTF.
  4. condition_table with run, strategy, replicate, condition columns.

The first four files are reused verbatim from DOTSeq's inst/extdata/. samplesheet.csv is the only derived file: metadata.txt.gz ships as a 5-column whitespace table with no header, so the module test would need to add a header before consuming it. To keep the nf-test minimal we ship the headered, chx-filtered version (12 rows matching the 12 sample columns in the featureCounts file) directly.

Source & licence

Bundled inst/extdata of compgenom/DOTSeq (Bioconductor 3.23 release). Upstream data is from Ly et al. 2024 (GSE231096 / SRR242304XX cell-cycle Ribo-seq + RNA-seq cohort), captured as the DOTSeq author's example dataset. Package licence: MIT.

Test plan

  • modules/nf-core/dotseq/dotseq/tests/main.nf.test in nf-core/modules#11742 consumes these fixtures via raw.githubusercontent.com URLs pinned to this PR's branch; URLs to be updated to modules after this PR merges
  • nf-core modules test --profile docker dotseq/dotseq runs DOTSeq's DOU + DTE + per-strategy DESeq2 fits + plots end-to-end in ~4 min on a c5.9xlarge against this fixture set

Bundled inst/extdata files from the DOTSeq Bioconductor package, copied
into homo_sapiens/riboseq_expression/dotseq for use by the nf-core/modules
dotseq/dotseq nf-test:

- featureCounts.cell_cycle_subset.txt.gz: ORF-level featureCounts output
- gencode.v47.orf_flattened_subset.gtf.gz: flattened ORF annotation
- gencode.v47.orf_flattened_subset.bed.gz: matching BED
- metadata.txt.gz: sample condition table (run/strategy/replicate/treatment/condition)

Source: https://github.com/compgenom/DOTSeq, MIT license.
Derived from metadata.txt.gz, filtered to treatment=chx samples (which is
what the bundled featureCounts table contains) and headered as
run,strategy,replicate,condition for direct consumption by the
dotseq/dotseq nf-test.
@pinin4fjords
Copy link
Copy Markdown
Member Author

Companion to nf-core/modules#11742 - the dotseq module nf-test 404s on this samplesheet URL until this PR merges, otherwise green.

Document the file set, derivation, and verification of the DOTSeq
cell_cycle_subset fixtures, matching the gedi/price test data PR
convention.
pinin4fjords added a commit to pinin4fjords/nf-core-modules that referenced this pull request May 22, 2026
Lets CI verify the module is actually green; revert this commit once
nf-core/test-datasets#2072 merges and the canonical modules-branch URL
resolves.
Replaces the upstream featureCounts table and flattened GTF/BED with a
tidy per-ORF count matrix (counts.tsv.gz) and per-ORF annotation TSV
(annotation.tsv.gz) so the dotseq module can call
DOTSeqDataSetsFromSummarizeOverlaps() directly. ORF identifiers and
sample columns are preserved; both files are derived from the same
DOTSeq cell_cycle_subset cohort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pinin4fjords added a commit to pinin4fjords/nf-core-modules that referenced this pull request May 22, 2026
Aligns the module's input contract with deltate / anota2seq so that
consumers can dispatch between the three ORF-DTE methods without
maintaining a separate prep step for dotseq. The four
featureCounts/GTF/BED inputs collapse to a per-ORF count matrix
(orf_id + sample columns) plus a per-ORF annotation TSV (orf_id +
gene_id + optional orf_type/coords). The R template now calls
DOTSeqDataSetsFromSummarizeOverlaps() and builds the required GRanges
in-process from the annotation TSV; the model fit, contrast tables,
and plotDOT outputs are unchanged.

Test fixtures updated alongside in
nf-core/test-datasets#2072 (commit 8c9b27c).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant