-
Notifications
You must be signed in to change notification settings - Fork 955
Fix file extension for module PBMARKDUP #9784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refines the PBMARKDUP module’s handling of input file extensions so that outputs follow the tool’s convention when given multiple input types, including compressed FASTA/FASTQ, and updates the associated tests and snapshots accordingly.
Changes:
- Update
PBMARKDUPprocess logic to derive the output suffix from the actual input filenames, including correctly preserving compound extensions such as.fastq.gzand.fasta.gz. - Adjust nf-test specifications for
pbmarkdupto cover multiple input patterns (single FASTA, multiple BAMs, and mixed FASTA/FASTQ inputs with dupfile+log options). - Regenerate the nf-test snapshot file with new test names, expected outputs (including dupfile and log channels), and newer
nf-testand Nextflow version metadata.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
modules/nf-core/pbmarkdup/main.nf |
Changes suffix calculation to search the input list for FASTA/FASTQ filenames (including .gz), derive the appropriate extension fragment, and fall back to the first input’s extension, ensuring the module’s single-output filename matches the primary input type. |
modules/nf-core/pbmarkdup/tests/main.nf.test |
Renames and restructures tests to exercise multiple input types and the dupfile/log behaviour, including a new mixed FASTA/FASTQ test, while keeping stub and single-input FASTA coverage. |
modules/nf-core/pbmarkdup/tests/main.nf.test.snap |
Updates snapshot keys and expected outputs to align with the new test names and behaviours (including null.dup.bam dupfile outputs and test.pbmarkdup.log logs), and records the newer nf-test/Nextflow versions and timestamps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pbmarkdup_args = "--clobber --dup-file ${prefix}.dup.bam --log-level INFO" | ||
| } |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here pbmarkdup_args uses ${prefix} inside a Groovy double-quoted string, but prefix is not defined in this nf-test parameter scope, so it interpolates to the literal string null (reflected by the null.dup.bam filenames in the snapshot). To avoid this surprising behaviour and better reflect the intention of tying the duplicate file name to the module’s output prefix, consider either using an explicit fixed dup file name here or passing the desired prefix via task.ext.prefix instead of relying on ${prefix} in the params string.
| } | ||
|
|
||
| test("acropora cervicornis - bam - multiple tests remove duplicates") { | ||
| test("homo sapiens - Multiple input types - with dupfile logfile and remove duplicates") { |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test description "homo sapiens - Multiple input types" doesn’t match the actual test data, which still uses genomics/eukaryotes/acropora_cervicornis/... paths, so the species label in the name is misleading. To keep tests self-describing and easier to interpret, consider either renaming the test to reference Acropora cervicornis or switching the input files to a Homo sapiens dataset.
| test("homo sapiens - Multiple input types - with dupfile logfile and remove duplicates") { | |
| test("acropora cervicornis - Multiple input types - with dupfile logfile and remove duplicates") { |
| // To allow multiple input types/files: (compressed) fasta, fastq, bam; Determine suffix from input file names | ||
| suffix = | ||
| input.find { | ||
| it.name ==~ /.*\.(fasta|fa|fna)(\.gz)?$/ }?.with { f -> | ||
| f.name.tokenize('.').takeRight(f.name.endsWith('.gz') ? 2 : 1).join('.') | ||
| } ?: | ||
| input.find { it.name ==~ /.*\.(fastq|fq)(\.gz)?$/ }?.with { f -> | ||
| f.name.tokenize('.').takeRight(f.name.endsWith('.gz') ? 2 : 1).join('.') | ||
| } ?: | ||
| input[0].extension |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new suffix resolution logic adds explicit handling for compressed FASTA/FASTQ inputs (e.g. .fasta.gz, .fastq.gz), but the updated pbmarkdup tests only cover uncompressed fasta/fastq and BAM inputs, so the .gz branch isn’t exercised. Given this module already has nf-test coverage, it would be worthwhile to add at least one test using a .fasta.gz or .fastq.gz input to verify that the output file name preserves the full compressed extension as intended.
Fix file extension to:
PR checklist
Closes #XXX
topic: versions- See version_topicslabelnf-core modules test <MODULE> --profile dockernf-core modules test <MODULE> --profile singularitynf-core modules test <MODULE> --profile condanf-core subworkflows test <SUBWORKFLOW> --profile dockernf-core subworkflows test <SUBWORKFLOW> --profile singularitynf-core subworkflows test <SUBWORKFLOW> --profile conda