Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
976c683
updated config files, added array size parameter for cluster execution
tpall Nov 20, 2025
ef13eed
updated nextflow.config
tpall Nov 20, 2025
d9a3bc6
Swap output assignments for rRNA and tRNA collections
tpall Nov 21, 2025
a087cf4
Merge branch 'dev' of https://github.com/WrightonLabCSU/DRAM into dev
tpall Nov 21, 2025
f5697b2
Merge branch 'dev' of https://github.com/tpall/DRAM into dev
tpall Nov 21, 2025
63ea268
Refactor distill script and configuration for improved clarity and fu…
tpall Nov 25, 2025
7414979
Refactor input and output path definitions for consistency in the SUM…
tpall Nov 26, 2025
a77c29e
Fix conditional check for gene columns in genome summary export to pr…
tpall Nov 26, 2025
e8b0e95
Refactor channel usage for consistency across workflows and improve r…
tpall Nov 26, 2025
4418739
Update SUMMARIZE module to use parameterized fasta column for grouping
tpall Nov 27, 2025
cd3b7ac
Fix closure in QC workflow
tpall Nov 28, 2025
ed054bd
Fix closure in DB_SEARCH workflow
tpall Nov 28, 2025
d39ff14
Updated combine_annotations.py to fix binwise summary. TODO: getting…
tpall Dec 1, 2025
64ab39e
Add QC:COLLECT_RNA to array pattern
tpall Dec 18, 2025
1ee2aea
Merge branch 'dev' of https://github.com/WrightonLabCSU/DRAM into dev
tpall Dec 23, 2025
2c93e2a
Merge branch 'dev' of https://github.com/WrightonLabCSU/DRAM into dev
tpall Dec 23, 2025
702666f
Merge branch 'dev' of https://github.com/tpall/DRAM into dev
tpall Dec 23, 2025
69b23f2
Merge branch 'dev' of https://github.com/WrightonLabCSU/DRAM into dev
tpall Mar 3, 2026
23b1a18
feat: accept gzip-compressed fasta input
tpall Apr 24, 2026
3059c89
fix: register array_size in schema so it validates
tpall Apr 24, 2026
d00f417
chore: remove unused trees subsystem and DRAM-v1 legacy setup scripts
tpall Apr 24, 2026
8c7dcf8
chore: remove DRAM-v1 db_description_builder and db_utils
tpall Apr 24, 2026
721fcaa
fix(db_search): correct DB_channel_SETUP case mismatch
tpall Apr 25, 2026
f1a60ec
fix(db_search): correct formattedOutputchannels case mismatch
tpall Apr 25, 2026
051d70b
fix(annotate): drop MMSEQS_INDEX publishDir to save disk
tpall Apr 26, 2026
dffe5ff
fix(distill): call .keys() on dict, not list, in check_columns log
tpall Apr 27, 2026
9e8e899
fix(distill): use polars to read rrna/trna/quast TSVs
tpall Apr 27, 2026
5b951c1
fix(distill): convert rrna section in make_genome_stats to polars
tpall Apr 27, 2026
00993c7
fix(distill): write genome_stats.tsv via polars write_csv
tpall Apr 27, 2026
84dc9db
fix(distill): rewrite make_genome_summary on polars + rule_parser
tpall Apr 27, 2026
0e44754
chore(distill): drop pandas-era dead code
tpall Apr 27, 2026
b1d597f
feat(dramv): vendor amg_database.tsv from v1
tpall Apr 27, 2026
01f7628
feat(dramv): vendor v1 reference sets into utils.dramv_constants
tpall Apr 27, 2026
0984ec2
feat(dramv): add dramv_flags.py — compute amg_flags + is_transposon
tpall Apr 27, 2026
34d0320
feat(dramv): add DRAMV_FLAGS process and wire into ANNOTATE
tpall Apr 27, 2026
dab18df
feat(dramv): add --amg_only mode to distill.py
tpall Apr 27, 2026
f255dde
feat(dramv): viral-mode pipeline defaults and SUMMARIZE wiring
tpall Apr 27, 2026
0cfb561
test(dramv): unit tests for compute_flags + read_scaffold_lengths
tpall Apr 27, 2026
a51d1f3
chore(dramv): register use_dramv and amg_length_from_end in nf schema
tpall Apr 27, 2026
c40a961
fix(summarize): stageAs distinct names for optional rrna/trna/quast i…
tpall Apr 27, 2026
9724960
fix(summarize): drop ext.args groupby_column override
tpall Apr 27, 2026
5a6ff33
docs(dramv): README viral-mode example + CHANGELOG entry
tpall Apr 27, 2026
413ffa3
chore(dramv): publish annotations_with_flags.tsv under ANNOTATE/
tpall Apr 28, 2026
92a4af7
feat(dramv): auto-enable use_pfam under use_dramv
tpall Apr 28, 2026
9e8bf61
Merge pull request #1 from tpall/feature/dramv-phase1
tpall Apr 28, 2026
e20ce2f
docs(readme): surface DRAM-v Phase 1 viral mode in intro and Quick Links
tpall Apr 28, 2026
2424c2b
feat(input): accept single fasta file as --input_fasta
tpall Apr 28, 2026
c44616f
fix(mmseqs_index): bump resources to process_small + enable retry
tpall Apr 28, 2026
168b70b
fix(hmmsearch): allow OOM retry and bump KOFam to process_medium
tpall Apr 29, 2026
0e5e1c8
feat(hmmsearch): chunk KOFam/VOG queries for parallel array execution
tpall Apr 29, 2026
dc83519
fix(combine_annotations): allow OOM retry and bump to process_big
tpall Apr 29, 2026
3531e39
fix(config): inline manifest refs for V2 config parser
tpall May 5, 2026
b2175d6
fix(config): make CONSTANTS work under V2 config parser
tpall May 5, 2026
3f873e3
fix(config): inline groupby_column to bypass cross-file params ref
tpall May 5, 2026
0a0c9a3
fix(config): enable nf-schema lenientMode for CLI boolean flags
tpall May 5, 2026
8333d3a
fix(config): disable SLURM job arrays by default
tpall May 5, 2026
b97fda4
chore(gitignore): ignore misc/ for local reference material
tpall May 6, 2026
3405048
feat(dramv): flag essential viral function rows per Martin 2025
tpall May 6, 2026
bec99db
fix(dramv): correct malformed identifiers in amg_database.tsv
tpall May 6, 2026
b3037c4
feat(dramv): add N flag for essential viral function genes
tpall May 6, 2026
e3fb1ca
feat(distill): exclude N-flagged genes from --amg_only output
tpall May 6, 2026
5839d34
docs(readme): document N flag and updated --amg_only filter
tpall May 6, 2026
5c256d4
fix(validation): coerce integer CLI params before nf-schema check
tpall May 7, 2026
58ca2b7
fix(validation): relax integer schema types for CLI-overridable params
tpall May 7, 2026
4e4fbc4
chore(config): tighten vog_e_value default to 1e-10
tpall May 6, 2026
9e43dc7
feat(dramv): register vog_id/vog_ids in dramv_flags ID_EXPR_DICT
tpall May 6, 2026
6930099
feat(dramv): auto-enable use_vog under use_dramv
tpall May 6, 2026
6fb6407
fix(validation): widen boolean schema types and coerce CLI strings
tpall May 7, 2026
a748943
fix(validation): wrap CLI coercion in a function called from workflow
tpall May 7, 2026
874e530
feat(dramv): V flag in compute_flags (Phase 2b)
tpall May 7, 2026
3714e45
Merge pull request #3 from tpall/dramv-vogdb-plumbing
tpall May 7, 2026
58f9f50
Merge pull request #4 from tpall/dramv-vflag
tpall May 7, 2026
197b99d
feat(dramv): auxiliary_score column (Phase 2c)
tpall May 7, 2026
156e4eb
feat(dramv): geNomad adapter for auxiliary_score (Phase 2c follow-up)
tpall May 7, 2026
2073f4e
feat(dramv): accept multiple geNomad genes TSVs (multi-sample support)
tpall May 7, 2026
3530d2c
fix(schema): drop file-path/exists constraint on genomad_genes
tpall May 7, 2026
87b3229
fix(dramv): correct genomad_genes path interpolation in staged module
tpall May 8, 2026
cafd9ad
fix(dramv): adapt geNomad parser to modern schema (0/1 int, marker su…
tpall May 8, 2026
1a0656c
Merge pull request #5 from tpall/dramv-aux-score
tpall May 8, 2026
775c5c7
fix(dramv): provirus position-join for geNomad → DRAM gene id mapping
tpall May 8, 2026
7b0a4b4
feat(distill): --max_auxiliary_score filter on --amg_only (Phase 2d)
tpall May 8, 2026
5406731
fix(dramv): hybrid join (direct gene-id first, position-overlap fallb…
tpall May 8, 2026
4558df7
Merge pull request #6 from tpall/dramv-amg-only-filter
tpall May 8, 2026
6cfa83b
fix(dramv): robust NO_FILE sentinel filter (don't pass to script)
tpall May 8, 2026
ada8912
fix(distill): cast query_id to Utf8 in make_genome_summary join
tpall May 8, 2026
edb6728
fix(schema): relax max_auxiliary_score type to accept CLI string
tpall May 8, 2026
66d48d1
feat(dramv): catalog-mode geNomad adapter via filename-derived sample…
tpall May 8, 2026
105f457
Merge pull request #7 from tpall/dramv-catalog-prefix
tpall May 8, 2026
e9cd2e5
docs(readme): elevate DRAM-v guide to its own top-level chapter
tpall May 8, 2026
d578ba9
docs(readme): drop nf-virome catalog-builder pointer
tpall May 8, 2026
d5b00d9
docs(readme): point DRAM-v examples at tpall/DRAM dev (not upstream)
tpall May 8, 2026
1248bc1
Merge pull request #8 from tpall/docs-dramv-chapter
tpall May 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,6 @@ nextflow-local.config

# scratch folder
scratch/

# local-only reference material (papers, notes)
misc/
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,23 @@

All notable changes to this project will be documented in this file.

## Unreleased (feature/dramv-phase1)

### Features

- **DRAM-v Phase 1: viral mode for geNomad+CheckV catalogs.** A new `--use_dramv` flag adds AMG flagging and viral-flavoured distillation on top of the bacterial annotate pipeline, no VirSorter affi-contigs required.
- New `DRAMV_FLAGS` process runs after `COMBINE_ANNOTATIONS` and appends `amg_flags` (M/K/E/A/P/T/F/B per DRAM v1 conventions, less the `V` flag pending VOGdb integration) and `is_transposon` columns to `raw-annotations.tsv`.
- `distill.py --amg_only` filters annotations to AMG candidates (`M` set, `A`/`P`/`T` clear), restricts the distillate form to `potential_amg=TRUE` rows, and collapses them into a single `AMG` Excel sheet.
- Viral mode forces `groupby_column=scaffold`, skips QUAST and the rRNA/tRNA collectors (none of which align with per-vMAG aggregation), and runs SUMMARIZE in `--amg_only` mode.
- Bundled assets: `bin/assets/amg_database.tsv` (ported verbatim from DRAM v1) and `bin/utils/dramv_constants.py` (TRANSPOSON_PFAMS, CELL_ENTRY_CAZYS, VIRAL_PEPTIDASES_MEROPS).
- Pytest unit suite at `tests/unit/test_dramv_flags.py` covering individual flag firing, B-flag scaffold boundaries, sub-3-gene scaffolds, K-forces-M, E (verified AMG), F window, and FASTA parsing.

### Bug Fixes

- `bin/distill.py`: rewrote the pandas-era summarisation path on top of polars + `rule_parser.evaluate_rules_on_anno`, dropped the broken pandas `write_summarized_genomes_to_xlsx` shadow, fixed `bin_taxnomy` typo, swapped `pd.read_csv` for `pl.read_csv` for rrna/trna/quast, and converted the rrna section of `make_genome_stats` to polars.
- `modules/local/distill/distill.nf`: stage rrna / trna / quast inputs under distinct names so the same `default_sheet` dummy in viral mode no longer triggers a Nextflow input-name collision.
- `conf/modules.config`: dropped the `task.ext.args = "--groupby_column …"` SUMMARIZE override that double-defined the flag and silently won the click race.

## 2.0.0-beta24 - 2026-02-03

[3659fda](https://github.com/WrightonLabCSU/DRAM/commit/3659fdaa0f9779108840e3bbf97c6d196b37a7d3)...[32d0527](https://github.com/WrightonLabCSU/DRAM/commit/32d05274be6eaeaed48de6bb5a047bd67f21fea1)
Expand Down
95 changes: 95 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

DRAM v2 (Distilled and Refined Annotation of Metabolism Version 2) is a tool for annotating metagenomic and genomic assembled data (e.g. scaffolds or contigs) or called genes (e.g. nuclotide or amino acid format). DRAM annotates MAGs using [KEGG](https://www.kegg.jp/) (if provided by the user), [UniRef90](https://www.uniprot.org/), [PFAM](https://pfam.xfam.org/), [dbCAN](http://bcb.unl.edu/dbCAN2/), [RefSeq viral](https://www.ncbi.nlm.nih.gov/genome/viruses/), [VOGDB](http://vogdb.org/) and the [MEROPS](https://www.ebi.ac.uk/merops/) peptidase database as well as custom user databases.

Viral catalogs from a typical geNomad → CheckV pipeline are also supported via **DRAM-v Phase 1 viral mode** (`--use_dramv true`): per-vMAG (per-scaffold) AMG flagging (M/K/E/A/P/T/F/B per the v1 conventions) and a viral-flavoured distillate, no VirSorter affi-contigs file required. See the "Viral mode" example below.

DRAM is run in four stages:
1) Gene Calling Prodogal - genes are called on user provided scaffolds or contigs
2) Gene Annotation - genes are annotated with a set of user defined databases
Expand All @@ -26,6 +28,7 @@ For more detail on DRAM and how DRAM v2 works please see our DRAM products:
- [Usage Examples](https://dramit.readthedocs.io/en/latest/usage.html)
- [Parameter API]([#command-line-options](https://dramit.readthedocs.io/en/latest/params_doc.html))
- [Rules API]([#nextflow-tips-and-tricks](https://dramit.readthedocs.io/en/latest/rules_parser.html))
- [DRAM-v (Viral mode)](#dram-v-viral-mode) — per-sample / catalog launches, flag reference, `auxiliary_score`, geNomad adapter

## Example Usage

Expand Down Expand Up @@ -70,6 +73,98 @@ nextflow run -bg WrightonLabCSU/DRAM \
-profile singularity,full_mode
```

8) **Viral mode (DRAM-v) — AMG flags on geNomad+CheckV viral contigs.** See the dedicated [DRAM-v (Viral mode)](#dram-v-viral-mode) chapter below for the full guide (per-sample vs catalog, flag reference, `auxiliary_score`, and the geNomad adapter).

## DRAM-v (Viral mode)

DRAM-v adds three columns to the per-gene table and ships an AMG-filtered distillate. Designed for input from a geNomad → CheckV pipeline (no VirSorter affi-contigs file required).

> **Note.** DRAM-v Phase 2 (V flag, `auxiliary_score`, geNomad adapter, `--max_auxiliary_score`, `--genomad_filename_prefix`) currently lives on this fork's `dev` branch only — `nextflow run WrightonLabCSU/DRAM ...` will not recognise these flags. Launch with `nextflow run tpall/DRAM -r dev ...` until the changes are upstreamed.

Two run modes — pick whichever matches your input:

- **Per-sample mode** — DRAM runs separately on each sample's filtered viral fasta (`<sample>_filtered.fna`). Used during development, smoke testing, or when you want per-sample annotations without going through clustering.
- **Catalog mode** — DRAM runs once on a clustered vOTU catalog whose contigs were renamed `<sample>_<orig>` upstream. Standard MIUViG / Sullivan-lab production path, recommended for any multi-sample analysis.

### What `--use_dramv` produces

Three new columns in `ANNOTATE/raw-annotations.tsv` and `ANNOTATE/annotations_with_flags.tsv`:

| Column | Type | Meaning |
| --- | --- | --- |
| `amg_flags` | str | concatenated single-letter flags in v1 order: `M K E V A P T F B`, plus a non-v1 `N` (see below) |
| `is_transposon` | bool | the gene's Pfam hits intersect a curated transposon set |
| `auxiliary_score` | int 1-5 | v1 flank-confidence score; lower = stronger viral context |

The `--amg_only` distillate (a single `AMG` sheet in `metabolism_summary.xlsx`, one count column per scaffold) keeps rows where `amg_flags` contains `M`, lacks `A`/`P`/`T`/`N`, **and** `auxiliary_score ≤ --max_auxiliary_score` (default 3). Set `--max_auxiliary_score 5` to disable the score filter.

Viral mode forces `groupby_column=scaffold` and skips QUAST + rRNA/tRNA collection — none of those make sense per-scaffold.

### Per-sample mode

```bash
nextflow run tpall/DRAM -r dev \
--input_fasta path/to/filtered_fastas_dir \
--fasta_fmt "*.fna" \
--outdir results/dramv \
--call --annotate --summarize \
--use_kofam --use_dbcan --use_merops \
--use_dramv true \
--genomad_genes "path/to/genomad_genes/*_virus_genes.tsv" \
-profile singularity
```

Each gene id matches between DRAM and geNomad (both call genes from scratch on the same per-sample fasta), so the geNomad → DRAM join hits via direct gene-id equality.

### Catalog mode

```bash
nextflow run tpall/DRAM -r dev \
--input_fasta path/to/votu_catalog.fa \
--outdir results/dramv \
--call --annotate --summarize \
--use_kofam --use_dbcan --use_merops \
--use_dramv true \
--genomad_genes "path/to/genomad_genes/*_virus_genes.tsv" \
--genomad_filename_prefix true \
-profile singularity
```

Catalog contigs are typically renamed `<sample>_<orig>` upstream so cross-sample names don't collide. Without `--genomad_filename_prefix true`, geNomad's per-sample gene ids (`<orig>_<num>`) won't match DRAM's catalog gene ids (`<sample>_<orig>_<num>`) and `auxiliary_score` collapses to 5 for every row — which the default `--max_auxiliary_score 3` filter would then drop entirely. With the flag on, each `<sample>_virus_genes.tsv` filename is parsed to derive the prefix and the join lines up.

### Flag reference

| Flag | Meaning |
| --- | --- |
| **`M`** | metabolism — gene matches a curated metabolic-gene set (KEGG/Pfam/CAZy/EC) |
| **`K`** | gene id appears in `bin/assets/amg_database.tsv`. `K` force-sets `M` per v1 |
| **`E`** | same as `K`, restricted to verified AMG rows |
| **`V`** | gene has a VOG hit whose VOGdb `FunctionalCategory` is `Xr` (replication) or `Xs` (structure). Requires `vog_annotations_latest.tsv`; `--use_dramv` auto-enables `--use_vog` |
| **`A`** | gene matches the curated cell-entry CAZy set |
| **`P`** | gene matches the curated viral peptidase MEROPS set |
| **`T`** | any gene on the same scaffold has `is_transposon=true` |
| **`F`** | gene is within `--amg_length_from_end` (default 5000) bp of either contig end |
| **`B`** | gene is part of a 3-consecutive-`M`-gene window on the scaffold |
| **`N`** *(not v1)* | gene id hits `amg_database.tsv` rows where `essential_viral_function=TRUE`, per [Martin et al. 2025](https://doi.org/10.1038/s41564-025-02095-4). Paper-cautioned genes (DsrC, QueC/QueF, folA/folB/folK, RNR, mazG, pur*, etc.) likely essential for viral processes rather than auxiliary metabolism. `N`-flagged rows are excluded from the strict-AMG distillate but stay in the full `raw-annotations.tsv` for review. `N` does **not** force `M`. |

### `auxiliary_score` reference

Verbatim port of v1's flank-confidence algorithm (`mag_annotator/annotate_vgfs.py:calculate_auxiliary_scores`, commit `6cd68f9`). Lower is better:

| Score | Condition |
| --- | --- |
| 1 | hallmark virus genes on **both** flanks |
| 2 | hallmark on one side, viral-like on the other |
| 3 | viral-like on both sides |
| 4 | hallmark/viral-like on at least one side, **or** self carries one |
| 5 | first/last on scaffold, **or** no viral context anywhere |

Then the `B`-flag downgrade fires: any gene with `B` in `amg_flags` and `auxiliary_score < 4` is bumped to 4 (a stretch of three metabolic genes is suspicious in a true viral region).

Without `--genomad_genes`, every gene scores 5 (v1 fallback for genomes without VirSorter input). Provide it to populate the score: geNomad's `virus_hallmark = 1` rows and marker suffix `.VV` / `.Vv` map to VirSorter hallmark (cat `0`); marker suffix `.vV` / `.vv` maps to viral-like (cat `1`).

The DRAM-v ↔ geNomad join is hybrid: direct gene-id equality first, position-overlap on the parent assembly contig as fallback. CheckV-trimmed provirus contigs (`<contig>|provirus_<start>_<end>`) are handled in both paths.

## Nextflow Tips and Tricks

The `-resume` option in Nextflow DSL2 allows you to efficiently manage and modify your workflow runs:
Expand Down
Empty file added assets/NO_FILE
Empty file.
86 changes: 0 additions & 86 deletions assets/internal/generate_sql_database.py

This file was deleted.

Loading