Add DOTSeq cell_cycle subset test data for modules by pinin4fjords · Pull Request #2072 · nf-core/test-datasets

pinin4fjords · 2026-05-21T22:16:14Z

Summary

Test fixtures for nf-core/modules#11742 (dotseq/dotseq) under data/genomics/homo_sapiens/riboseq_expression/dotseq/.

DOTSeq is a Bioconductor package for differential ORF usage (DOU) + ORF-level differential translation efficiency (DTE). Its API takes a featureCounts-format ORF count table plus a flattened ORF GTF/BED pair, so the existing riboseq_expression/ salmon-gene-level fixtures don't satisfy the input contract. Rather than build new featureCounts + ORF annotations from the existing BAMs (which would require an ORF caller + featureCounts pass), this PR drops in the small cell_cycle_subset bundled with the DOTSeq package - DOTSeq's own example data, already in the exact shape its functions expect.

Files

File	Size	Notes
`featureCounts.cell_cycle_subset.txt.gz`	117 KB	ORF-level featureCounts output, 6644 ORF rows × 12 sample columns (chx-treatment subset of Ly et al. 2024)
`gencode.v47.orf_flattened_subset.gtf.gz`	81 KB	Flattened GENCODE v47 ORF annotation matching the count table, 6945 lines
`gencode.v47.orf_flattened_subset.bed.gz`	53 KB	Matching BED (6642 lines)
`metadata.txt.gz`	211 B	Headerless 24-sample metadata: `run strategy replicate treatment condition`
`samplesheet.csv`	423 B	Headered, chx-only subset of `metadata.txt.gz` (12 rows) - derived in this PR so the module can consume it without an extra header step

Total: 252 KB across 5 files.

Why these specific files

DOTSeq's DOTSeqDataSetsFromFeatureCounts() requires four inputs:

count_table with featureCounts annotation columns (Geneid, Chr, Start, End, Strand, Length) followed by per-sample counts.
flattened_gtf with gene_id + exon_number attributes (DOTSeq uses these to name ORFs as gene_id:O###).
flattened_bed matching the GTF.
condition_table with run, strategy, replicate, condition columns.

The first four files are reused verbatim from DOTSeq's inst/extdata/. samplesheet.csv is the only derived file: metadata.txt.gz ships as a 5-column whitespace table with no header, so the module test would need to add a header before consuming it. To keep the nf-test minimal we ship the headered, chx-filtered version (12 rows matching the 12 sample columns in the featureCounts file) directly.

Source & licence

Bundled inst/extdata of compgenom/DOTSeq (Bioconductor 3.23 release). Upstream data is from Ly et al. 2024 (GSE231096 / SRR242304XX cell-cycle Ribo-seq + RNA-seq cohort), captured as the DOTSeq author's example dataset. Package licence: MIT.

Test plan

modules/nf-core/dotseq/dotseq/tests/main.nf.test in nf-core/modules#11742 consumes these fixtures via raw.githubusercontent.com URLs pinned to this PR's branch; URLs to be updated to modules after this PR merges
nf-core modules test --profile docker dotseq/dotseq runs DOTSeq's DOU + DTE + per-strategy DESeq2 fits + plots end-to-end in ~4 min on a c5.9xlarge against this fixture set

Bundled inst/extdata files from the DOTSeq Bioconductor package, copied into homo_sapiens/riboseq_expression/dotseq for use by the nf-core/modules dotseq/dotseq nf-test: - featureCounts.cell_cycle_subset.txt.gz: ORF-level featureCounts output - gencode.v47.orf_flattened_subset.gtf.gz: flattened ORF annotation - gencode.v47.orf_flattened_subset.bed.gz: matching BED - metadata.txt.gz: sample condition table (run/strategy/replicate/treatment/condition) Source: https://github.com/compgenom/DOTSeq, MIT license.

Derived from metadata.txt.gz, filtered to treatment=chx samples (which is what the bundled featureCounts table contains) and headered as run,strategy,replicate,condition for direct consumption by the dotseq/dotseq nf-test.

pinin4fjords · 2026-05-22T07:40:39Z

Companion to nf-core/modules#11742 - the dotseq module nf-test 404s on this samplesheet URL until this PR merges, otherwise green.

Document the file set, derivation, and verification of the DOTSeq cell_cycle_subset fixtures, matching the gedi/price test data PR convention.

Lets CI verify the module is actually green; revert this commit once nf-core/test-datasets#2072 merges and the canonical modules-branch URL resolves.

Replaces the upstream featureCounts table and flattened GTF/BED with a tidy per-ORF count matrix (counts.tsv.gz) and per-ORF annotation TSV (annotation.tsv.gz) so the dotseq module can call DOTSeqDataSetsFromSummarizeOverlaps() directly. ORF identifiers and sample columns are preserved; both files are derived from the same DOTSeq cell_cycle_subset cohort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns the module's input contract with deltate / anota2seq so that consumers can dispatch between the three ORF-DTE methods without maintaining a separate prep step for dotseq. The four featureCounts/GTF/BED inputs collapse to a per-ORF count matrix (orf_id + sample columns) plus a per-ORF annotation TSV (orf_id + gene_id + optional orf_type/coords). The R template now calls DOTSeqDataSetsFromSummarizeOverlaps() and builds the required GRanges in-process from the annotation TSV; the model fit, contrast tables, and plotDOT outputs are unchanged. Test fixtures updated alongside in nf-core/test-datasets#2072 (commit 8c9b27c). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pinin4fjords added 2 commits May 21, 2026 23:15

Add headered samplesheet.csv (chx subset) for DOTSeq tests

1779c8c

Derived from metadata.txt.gz, filtered to treatment=chx samples (which is what the bundled featureCounts table contains) and headered as run,strategy,replicate,condition for direct consumption by the dotseq/dotseq nf-test.

pinin4fjords mentioned this pull request May 21, 2026

Add dotseq/dotseq nf-core/modules#11742

Open

7 tasks

pinin4fjords added 2 commits May 22, 2026 09:51

Add README for DOTSeq cell_cycle test data

a59fbdb

Document the file set, derivation, and verification of the DOTSeq cell_cycle_subset fixtures, matching the gedi/price test data PR convention.

Document dotseq fixture set in top-level README

e25da10

pinin4fjords added the Ready to review label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DOTSeq cell_cycle subset test data for modules#2072

Add DOTSeq cell_cycle subset test data for modules#2072
pinin4fjords wants to merge 5 commits into
nf-core:modulesfrom
pinin4fjords:add-dotseq-testdata

pinin4fjords commented May 21, 2026 •

edited

Loading

Uh oh!

pinin4fjords commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pinin4fjords commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Why these specific files

Source & licence

Test plan

Uh oh!

pinin4fjords commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pinin4fjords commented May 21, 2026 •

edited

Loading