Skip to content

feat: accept gzip-compressed fasta input#505

Open
tpall wants to merge 1 commit intoWrightonLabCSU:devfrom
tpall:gzip-fasta-input
Open

feat: accept gzip-compressed fasta input#505
tpall wants to merge 1 commit intoWrightonLabCSU:devfrom
tpall:gzip-fasta-input

Conversation

@tpall
Copy link
Copy Markdown

@tpall tpall commented May 8, 2026

Summary

Accept gzip-compressed fasta inputs (*.fa.gz / *.fna.gz / *.fasta.gz) without requiring users to decompress first. Plain fastas keep working unchanged.

This was one of the changes bundled into the now-closed #472, split out per the maintainer-friendly path agreed when refiling #503 / #504.

What changed

f397b4a3 feat: accept gzip-compressed fasta input

+38 / -6 lines, three files:

  • modules/local/rename/decompress_fasta.nf (new, 20 lines) — wraps reformat.sh from the existing bbmap container (no new dependencies). Tagged process_tiny.
  • workflows/dram.nf — channel branch on .gz suffix, decompress only the gz branch, mix both back. Sample-name stripping is unified so sample.fa and sample.fa.gz yield identical downstream names.
  • nextflow_schema.jsoninput_fasta and fasta_fmt descriptions updated to mention gz support.

How it works

ch_fasta_named = ch_fasta_raw.map { f ->
    def name = f.name.replaceAll(/\.gz$/, '').replaceAll(/\.(fa|fna|fasta)$/, '')
    tuple(name, f)
}

ch_fasta_branched = ch_fasta_named.branch { entry ->
    gz:    entry[1].name.endsWith('.gz')
    plain: true
}

DECOMPRESS_FASTA( ch_fasta_branched.gz )
ch_fasta = DECOMPRESS_FASTA.out.decompressed_fasta.mix( ch_fasta_branched.plain )

The default --fasta_fmt '*.f*' already matches both plain and .gz files, so users with a mixed directory don't need to change their launch.

Test plan

  • nextflow inspect parses cleanly.
  • JSON schema valid.
  • HPC run with a directory containing both *.fa and *.fa.gz: confirm both end up annotated identically and DECOMPRESS_FASTA only fires on the gz branch.

🤖 Generated with Claude Code

Adds a small DECOMPRESS_FASTA module (`reformat.sh` from the bbmap
container that other modules already use) and routes only `.gz`
inputs through it via a channel branch on the `.gz` suffix. Plain
fastas pass through unchanged.

Sample-name normalisation strips both the trailing `.gz` (if present)
and one of `.fa`/`.fna`/`.fasta` so `sample.fa` and `sample.fa.gz`
yield the same downstream name. Outputs are identical regardless of
input compression.

Default `--fasta_fmt '*.f*'` already matches both plain and `.gz`
files; schema description updated to mention this explicitly.

Files:
  modules/local/rename/decompress_fasta.nf  (new, 20 lines)
  workflows/dram.nf                          (channel branch + mix)
  nextflow_schema.json                       (description updates)
@github-project-automation github-project-automation Bot moved this to To Sort in DRAM May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To Sort

Development

Successfully merging this pull request may close these issues.

1 participant