Add DOTSeq cell_cycle subset test data for modules#2072
Open
pinin4fjords wants to merge 5 commits into
Open
Conversation
Bundled inst/extdata files from the DOTSeq Bioconductor package, copied into homo_sapiens/riboseq_expression/dotseq for use by the nf-core/modules dotseq/dotseq nf-test: - featureCounts.cell_cycle_subset.txt.gz: ORF-level featureCounts output - gencode.v47.orf_flattened_subset.gtf.gz: flattened ORF annotation - gencode.v47.orf_flattened_subset.bed.gz: matching BED - metadata.txt.gz: sample condition table (run/strategy/replicate/treatment/condition) Source: https://github.com/compgenom/DOTSeq, MIT license.
Derived from metadata.txt.gz, filtered to treatment=chx samples (which is what the bundled featureCounts table contains) and headered as run,strategy,replicate,condition for direct consumption by the dotseq/dotseq nf-test.
7 tasks
Member
Author
|
Companion to nf-core/modules#11742 - the dotseq module nf-test 404s on this samplesheet URL until this PR merges, otherwise green. |
Document the file set, derivation, and verification of the DOTSeq cell_cycle_subset fixtures, matching the gedi/price test data PR convention.
pinin4fjords
added a commit
to pinin4fjords/nf-core-modules
that referenced
this pull request
May 22, 2026
Lets CI verify the module is actually green; revert this commit once nf-core/test-datasets#2072 merges and the canonical modules-branch URL resolves.
Replaces the upstream featureCounts table and flattened GTF/BED with a tidy per-ORF count matrix (counts.tsv.gz) and per-ORF annotation TSV (annotation.tsv.gz) so the dotseq module can call DOTSeqDataSetsFromSummarizeOverlaps() directly. ORF identifiers and sample columns are preserved; both files are derived from the same DOTSeq cell_cycle_subset cohort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pinin4fjords
added a commit
to pinin4fjords/nf-core-modules
that referenced
this pull request
May 22, 2026
Aligns the module's input contract with deltate / anota2seq so that consumers can dispatch between the three ORF-DTE methods without maintaining a separate prep step for dotseq. The four featureCounts/GTF/BED inputs collapse to a per-ORF count matrix (orf_id + sample columns) plus a per-ORF annotation TSV (orf_id + gene_id + optional orf_type/coords). The R template now calls DOTSeqDataSetsFromSummarizeOverlaps() and builds the required GRanges in-process from the annotation TSV; the model fit, contrast tables, and plotDOT outputs are unchanged. Test fixtures updated alongside in nf-core/test-datasets#2072 (commit 8c9b27c). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test fixtures for nf-core/modules#11742 (
dotseq/dotseq) underdata/genomics/homo_sapiens/riboseq_expression/dotseq/.DOTSeq is a Bioconductor package for differential ORF usage (DOU) + ORF-level differential translation efficiency (DTE). Its API takes a featureCounts-format ORF count table plus a flattened ORF GTF/BED pair, so the existing
riboseq_expression/salmon-gene-level fixtures don't satisfy the input contract. Rather than build new featureCounts + ORF annotations from the existing BAMs (which would require an ORF caller + featureCounts pass), this PR drops in the smallcell_cycle_subsetbundled with the DOTSeq package - DOTSeq's own example data, already in the exact shape its functions expect.Files
featureCounts.cell_cycle_subset.txt.gzgencode.v47.orf_flattened_subset.gtf.gzgencode.v47.orf_flattened_subset.bed.gzmetadata.txt.gzrun strategy replicate treatment conditionsamplesheet.csvmetadata.txt.gz(12 rows) - derived in this PR so the module can consume it without an extra header stepTotal: 252 KB across 5 files.
Why these specific files
DOTSeq's
DOTSeqDataSetsFromFeatureCounts()requires four inputs:count_tablewith featureCounts annotation columns (Geneid, Chr, Start, End, Strand, Length) followed by per-sample counts.flattened_gtfwithgene_id+exon_numberattributes (DOTSeq uses these to name ORFs asgene_id:O###).flattened_bedmatching the GTF.condition_tablewithrun,strategy,replicate,conditioncolumns.The first four files are reused verbatim from DOTSeq's
inst/extdata/.samplesheet.csvis the only derived file:metadata.txt.gzships as a 5-column whitespace table with no header, so the module test would need to add a header before consuming it. To keep the nf-test minimal we ship the headered, chx-filtered version (12 rows matching the 12 sample columns in the featureCounts file) directly.Source & licence
Bundled
inst/extdataof compgenom/DOTSeq (Bioconductor 3.23 release). Upstream data is from Ly et al. 2024 (GSE231096 / SRR242304XX cell-cycle Ribo-seq + RNA-seq cohort), captured as the DOTSeq author's example dataset. Package licence: MIT.Test plan
modules/nf-core/dotseq/dotseq/tests/main.nf.testin nf-core/modules#11742 consumes these fixtures viaraw.githubusercontent.comURLs pinned to this PR's branch; URLs to be updated tomodulesafter this PR mergesnf-core modules test --profile docker dotseq/dotseqruns DOTSeq's DOU + DTE + per-strategy DESeq2 fits + plots end-to-end in ~4 min on a c5.9xlarge against this fixture set