Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 45 additions & 5 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@

## Sources of data and tools

- **Ensembl VEP**

> McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4.

- Nanoseq masks

> Abascal, F., Harvey, L.M.R., Mitchell, E. et al. Somatic mutation landscapes at single-molecule resolution. Nature 593, 405–410 (2021). https://doi.org/10.1038/s41586-021-03477-4
Expand All @@ -34,15 +38,51 @@

> Stefano Pellegrini, Olivia Dove-Estrella, Ferran Muiños, Nuria Lopez-Bigas, Abel Gonzalez-Perez, Oncodrive3D: fast and accurate detection of structural clusters of somatic mutations under positive selection, Nucleic Acids Research, Volume 53, Issue 15, 28 August 2025, gkaf776, https://doi.org/10.1093/nar/gkaf776

- **dNdScv (tool)**

> Martincorena I, Raine KM, Gerstung M, et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell. 2017. https://doi.org/10.1016/j.cell.2017.09.042

- **Omega (dN/dS)**

> Repository: https://github.com/bbglab/omega (see repository for citation details)

- **OncodriveFML**

> Repository: https://github.com/bbglab/oncodrivefml (see repository for citation details)

- **OncodriveCLUSTL**

> Repository: https://github.com/bbglab/oncodriveclustl (see repository for citation details)

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- Python
- SigProfilerAssignment, MatrixGenerator
- HDP
- OncodriveFML
- OncodriveCLUSTL
- **SigProfilerAssignment / SigProfilerMatrixGenerator**

> Repository: https://github.com/AlexandrovLab/SigProfilerAssignment
> Repository: https://github.com/AlexandrovLab/SigProfilerMatrixGenerator

- **HDP / mSigHdp**

> Repository: https://github.com/Nik-Zainal-Group/msigHdp (see repository for citation details)

- **bgreference / bgdata**

> Repository: https://github.com/bbglab/bgreference
> Repository: https://github.com/bbglab/bgdata

- **SAMtools**

> Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078-9. doi: 10.1093/bioinformatics/btp352.

- **BEDTools**

> Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841-842. doi: 10.1093/bioinformatics/btq033.

- **HTSlib / Tabix**

> Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011;27(5):718-719. doi: 10.1093/bioinformatics/btq671.

## Software packaging/containerisation tools

Expand Down
68 changes: 68 additions & 0 deletions docs/file_formatting.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,16 +216,84 @@ params {

### cosmic_ref_signatures

Path to the COSMIC SBS signature reference file used by SigProfilerAssignment. Use the SBS 96 context file for your genome build (e.g., `COSMIC_v3.4_SBS_GRCh38.txt`). The file is a tab-delimited matrix where the first column encodes mutation context and each additional column corresponds to a signature.

### wgs_trinuc_counts

Tab-delimited file with two columns:

```text
CONTEXT COUNT
ACA 118979126
ACC 67570313
...
```

The file represents the **total number of occurrences of each trinucleotide** in the reference genome. The pipeline provides a default example in `assets/trinucleotide_counts/`.

### cadd_scores

Path to the CADD "All possible SNVs" file (BGZIP-compressed TSV). This file is used for OncodriveFML scoring.

Recommended download: [CADD downloads](https://cadd.gs.washington.edu/download) → "All possible SNVs of GRCh38/hg38".

### cadd_scores_ind

Tabix index (`.tbi`) for the `cadd_scores` file. If you need to generate it:

```bash
bgzip -c whole_genome_SNVs.tsv > whole_genome_SNVs.tsv.gz
tabix -s 1 -b 2 -e 2 whole_genome_SNVs.tsv.gz
```

### dnds_ref_transcripts

Reference transcript annotation for dNdScv. For human, this is typically `RefCDS_human_latest_intogen.rda` from the dNdScv reference bundle (IntOGen mirror).

### dnds_covariates

dNdScv covariates file, usually `covariates_hg19_hg38_epigenome_pcawg.rda`. This provides covariate regression terms for mutation rate modeling.

### datasets3d

Directory containing precomputed Oncodrive3D datasets (structure and mutation mapping information). Build using the [Oncodrive3D dataset builder](https://github.com/bbglab/oncodrive3d?tab=readme-ov-file#building-datasets).

### annotations3d

Directory containing Oncodrive3D annotation datasets (protein annotations, stability data, etc.). Use the same build process as `datasets3d` to ensure compatibility.

### gff3_file

Optional local GFF3 file used by the DNA2PROTEINMAPPING step. If not provided, the pipeline downloads the GFF3 from Ensembl. If provided, it must match the Ensembl release, species, and genome build you are using (compressed `.gff3.gz` files are supported).

## Examples

### Blacklist mutations

```
chr1:11107296_C>CA
chr1:11107450_C>A
chr1:11108379_T>A
```

### Gene grouping

```
chr15q chr15q IDH2 SIN3A
chr17p chr17p MAP2K4 NCOR1 TP53 USP6
```

### Custom annotation

See `assets/example_inputs/custom_regions.example.tsv` for a full example of a custom region annotation file.

### Omega hotspots / subgenic regions

Provide a BED file with 3 or 4 columns (`CHROM`, `START`, `END`, optional `NAME`):

```
chr7 55191765 55191840 EGFR_L858R_region
chr12 25245300 25245380 KRAS_G12_region
```

You can expand these regions with `hotspot_expansion` and optionally generate complements with `subgenic_regions_complement`.
29 changes: 29 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,29 @@ work/
.nextflow.log
```

### Output directory cheat sheet

| Output directory/file | Description |
| --- | --- |
| `sumannotation/` | Aggregated mutation annotations (one row per mutation) after VEP annotation and preprocessing. |
| `germline_somatic/` | Mutations labeled as germline vs somatic before strict cohort filtering. |
| `clean_somatic/` | Filtered somatic mutations used in downstream analyses. |
| `clean_germline_somatic/` | Filtered mutations retaining germline/somatic labels. |
| `annotatedepths/` | Depth tables per genomic position (used for depth-aware metrics). |
| `depthssummary/` | Cohort depth summaries (TSV + PDF plots). |
| `computeprofile/` | Mutational profiles, proportions, and `*.profile_stability.tsv` metrics. |
| `mutrate/` | Mutation density tables (per sample/group, depth-normalized). |
| `omega/` | dN/dS selection results using per-sample profiles. |
| `omegagloballoc/` | dN/dS selection results using global cohort profiles. |
| `absolutemutabilities/` | Expected mutability per site for selection analyses. |
| `sitecomparison/` | Observed vs expected mutability comparisons per site/residue. |
| `oncodrivefmlsnvs/` | OncodriveFML results. |
| `oncodrive3d/` | Oncodrive3D clustering results and plots. |
| `signatures_hdp/`, `sigprofilerassignment/`, `sigprobs/` | Mutational signature extraction/assignment outputs. |
| `plotmaf/`, `plotneedles/`, `plotselection/`, `plotsomaticmaf/` | Standard plotting outputs for mutation and selection summaries. |
| `qc/metrics_vs_depth/` | QC plots/tables comparing depth vs mutation density and omega. |
| `pipeline_info/` | Pipeline metadata and software versions. |

## Input and configuration

See Usage docs for extensive explanation on required inputs and format. Including documentation on parameters to run on for 4 different suggested running modes.
Expand Down Expand Up @@ -177,6 +200,8 @@ Optional:
- clean_somatic
- clean_germline_somatic

**PMEAN/PSTD fields:** if the input VCF contains read-position statistics (PMN/PST from deepUMIcaller), deepCSA stores them as `PMEAN` and `PSTD` in the mutation tables. When not available, these columns are set to `-1`.

## Basic analysis

### Key role
Expand All @@ -193,6 +218,8 @@ Optional:
- computeprofile
- mutrate

`computeprofile` also emits `*.profile_stability.tsv` files, which quantify how sensitive each mutational profile is to the addition of a single mutation per channel (see [Tools](tools.md#mutational-profile-stability)).

## Intermediate outputs

### Key role
Expand Down Expand Up @@ -287,6 +314,8 @@ Optional:

- Optionally think on adding more plots.

Plotting scope can be controlled with `plot_only_allsamples`: when `true`, only cohort-level plots are generated; when `false`, plots are also produced for each defined subgroup.

### Outputs

- plotmaf
Expand Down
69 changes: 69 additions & 0 deletions docs/tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,51 @@

Here, you can find an explanation of the different computations, tools or metrics implemented in deepCSA.

## Interpreting outputs (sanity checks and key metrics)

### Sanity checks / QC

Use these outputs to assess overall data quality before interpreting biological signals:

- **Depth summaries** (`depthssummary/`): verify consistent coverage across samples and genes.
- **Mutation density vs depth** (`qc/metrics_vs_depth/`): check that mutation density does not collapse in low-depth samples.
- **Omega QC** (`qc/metrics_vs_depth/` + `qc/annotated_omegas`): highlights genes/samples with unstable omega estimates.
- **Mutational profile stability** (`computeprofile/*.profile_stability.tsv`): higher deviations indicate unstable mutational profiles (see below).

### Omega vs omegagloballoc

- **`omega/`** uses **per-sample mutational profiles** and per-sample synonymous rates to estimate selection.
- **`omegagloballoc/`** uses a **global cohort mutational profile** and global synonymous rates (shared across samples), which stabilizes estimates in low-burden samples and facilitates cohort-level comparisons.

Use `omega` for sample-specific selection signals and `omegagloballoc` for conservative cohort-level estimates.

### Site selection values

Outputs in `sitecomparison/` and `sitecomparisongloballoc/` compare observed vs expected mutations per site or residue:

- `OBSERVED_MUTS`: number of observed mutations.
- `EXPECTED_MUTS`: expected mutations from mutability models.
- `OBS/EXP`: selection enrichment ratio.
- `p_value`: Poisson p-value for observing at least `OBSERVED_MUTS` given `EXPECTED_MUTS`.

The resolution is controlled by `site_comparison_grouping` (`site`, `aminoacid`, or `aminoacid_change`).

### Mutational signatures

- **`sigprofilerassignment/`**: assignments of known COSMIC signatures; includes activity tables and plots.
- **`signatures_hdp/`**: extracted signatures using a hierarchical Dirichlet process.
- **`sigprobs/` / `muts2sigs/`**: per-mutation signature probabilities (useful for downstream stratification).

Interpret signature results alongside mutation counts and profile stability to avoid over-interpreting low-burden samples.

### Mutational profile stability

The file `*.profile_stability.tsv` is generated by adding a single mutation to each of the 96 SBS channels and measuring the L1 deviation from the original profile. Reported statistics include:

- `mean_deviation`, `min_deviation`, `max_deviation`, `std_deviation`

Lower deviations indicate a more stable (less noisy) profile.

## Publications with detailed explanation

We are in the process of completing the documentation, but in the meantime you can check the recently published [paper and its supplementary material for more details](https://www.nature.com/articles/s41586-025-09521-x).
Expand Down Expand Up @@ -123,3 +168,27 @@ We provide two different strategies for signature analysis.
- Using a Hierarchical Dirichlet Process algorithm developed by Nicola Robets and compacted by the McGranahan lab into a wrapped version.

Additionally one could run SigProfilerExtractor on the data but this needs to be done externally.

## Containers and reproducibility

deepCSA defines container images directly in module files and `conf/modules.config`. For bbglab-maintained images (`bbglab/*`), Dockerfile recipes are tracked in the lab repository: https://github.com/bbglab/containers-recipes. External images (e.g., `ferriolcalvet/*`, `rblancomi/*`, `biocontainers/*`) should be mirrored locally if strict reproducibility is required.

Key images used by the pipeline:

| Component | Image |
| --- | --- |
| Core utilities | `docker.io/bbglab/deepcsa-core:0.1.0` |
| Panel BED tools | `docker.io/bbglab/deepcsa_bed:latest` |
| Omega | `docker.io/bbglab/omega:0.2.1` |
| Oncodrive3D | `docker.io/bbglab/oncodrive3d:1.0.5` |
| Oncodrive3D (ChimeraX plots) | `docker.io/spellegrini87/oncodrive3d_chimerax:latest` |
| OncodriveFML | `docker.io/ferriolcalvet/oncodrivefml:latest` |
| OncodriveCLUSTL | `docker.io/ferriolcalvet/oncodriveclustl:latest` |
| SigProfilerAssignment | `docker.io/ferriolcalvet/sigprofiler_assignment:1.1.3` |
| SigProfilerMatrixGenerator | `docker.io/ferriolcalvet/sigprofilermatrixgenerator:1.3.5` |
| mSigHdp (HDP) | `docker.io/ferriolcalvet/msighdp:latest` |
| bbgregressions | `docker.io/rblancomi/bbgregressions:dev` |
| Ensembl VEP | `biocontainers/ensembl-vep:111.0--pl5321h2a3209d_0` (version depends on `vep_cache_version`) |
| SAMtools | `biocontainers/samtools:1.18--h50ea8bc_1` |

To override any image, set `process.container` or the relevant module label in your `nextflow.config`.
Loading