This image is a modular pipeline that facilitates the usage g:Profiler, PANTHER and GSEA for streamlined Enrichment Analysis. Includes support features for input building and output processing.
The documentation is available at /html/index.html...
First you need to have Docker installed in your computing environment. If you don't, follow the installation guidelines at pegi3s Bioinformatics Docker Images Project website: http://bdip.i3s.up.pt/
To pull the docker image you should run the following command:
docker pull pegi3s/auto-enrich
To run an analysis you must set up the pipeline configuration file, name it config and have in a folder alongside one, or more, input data file/s (such as a Gene Expression matrix, Genes Lists, GSEA Preranked lists, Gene Sets, etc.) under the /your/data/directory in order for the pipeline to properly work.
Detailed instructions are given in the documentation (open /html/index.html), where the available modules and parameters to be configured are described in detail.
After setting ip the require files you should adapt and run the following command:
docker run --rm -v /your/data/directory:/data pegi3s/auto-enrich
In this command, you should replace /your/data/directory to point to the folder that contains the input files for the pipeline.
In either of the following test dataset, the input files are pre-configured inside the /inputs directory and the expected output inside the /outputs directory.
To run the pipeline you should adapt and run following command: docker run --rm -v /your/data/directory:/data pegi3s/auto-enrich
In this command, you should replace /your/data/directory to point to the directory that contains the input files for the pipeline.
This test demonstrates how the auto-Enrich pipeline can be run using all modules to perform Over-Representation Analysis (ORA; with g:Profiler and PANTHER) and Gene Set Enrichment Analysis (GSEA).
The input includes a Gene Expression matrix from Mus musculus, the necessary configuration files, and two gene set files from the Mouse Collections of Molecular Signature DataBase (MSigDB).
Test files: test_data.zip
Contents inside the inputs directory of test_data.zip:
- expression_matrix.tsv: Gene expression data matrix (Gene ID and respective gene expression samples data, from Nogueira-Rodrigues et al. (2022) https://www.researchgate.net/publication/357595784).
- config: Pipeline configuration file, setup to run input file generation for ORA and GSEA methods (Modules 1, 2, and 5), enrichment analysis (Modules 3, 4, and 6) on pre-selected sources, and output processing (Module 7).
- /gene_sets: Folder holding the Gene Sets from the MSigDB to be provided to the GSEA runs (includes the Reactome Pathways gene set: m2.cp.reactome.v2026.1.Mm.symbols.gmt; and the Gene Ontology gene set: m5.go.v2026.1.Mm.symbols.gmt)
Pipeline behaviour workflow (tools="1,2,3,4,5,6,7"):
1. The gene expression matrix is processed to build a genes list (from column index set in the variable gene) of the pre-selected differentially expressed genes marked (1) at the column index (set in the variable selected) [Module 1]
2. The given input gene identifiers are mapped to GeneID, UniProtKBs, HUGO Gene Symbol and Full Name [Module 2]
3. Over-Representation Enrichment Analysis are perfomed, with the mapped genes list, using g:Profiler g:GOSt tool on selected annotations-sources (set in the variable gprofiler_dbs), outputs are processed and enriched terms annotations are mapped [Module 3]
4. Over-Representation Enrichment Analysis are perfomed, with the mapped genes list, using PANTHER Overrepresentation test on selected annotations-sources (set in the variable panther_dbs), outputs are processed and enriched terms annotations are mapped [Module 4]
5. The gene expression matrix is partitioned into scoring Preranked Genes lists (.rnk) by calculating the set fold-changes (in the variables preranked1; preranked2) between computed experimental groups averages (set with the variables number_groups, number_samples, samples, groups, calculate_averages, isoform) [Module 5]
6. Preranked Gene Set Enrichment Analysis (GSEA) are executed using the previously generated Preranked Genes lists with set paramters (method, gene_set, nperm, others set by default) [Module 6]
7. The enrichment results from the different tools are intersected to find common results (intersection boolean variable) and individual tool results are filtered (by set variable max_annot; maximum allowed enriched term size ) [Module 7]
Output directories after the run:
# Generated inputs for Enrichment Analysis
/data/
├── annotations/* → Utilized sources to map gene identifiers and terms annotations
├── preranked_gene_lists/
│ └── selected_genes_list → Selected genes list
├── mapped_gene_lists/
│ └── selected_genes_list_map → Mapped selected genes list
└── gsea/
├── parameters_log2FC_A_SCI_A_Sham → GSEA parameters run files (one per set run)
└── preranked_gene_lists/ → Generated Preranked Gene lists
├── log2FC_A_SCI_A_Sham.rnk
├── log2FC_A_SCI_A_Sham.rnk
└── log2FC_A_Sham_M_Sham.rnk
# Enrichment Analysis results
/data/
├── gprofiler/
│ └── selected_genes_list/
│ ├── enrichment_fields.tsv → Enrichment results fields provided by gProfiler
│ ├── enrichmed_terms_annotations.tsv → Enrichmed terms annotations mapping summary
│ └── source/annotations/* → Raw enriched terms annotations sources
├── panther/
│ └── selected_genes_list/
│ ├── enrichment_fields.tsv → Enrichment results fields provided by PANTHER
│ ├── enrichmed_terms_annotations.tsv → Enrichmed terms annotations mapping summary
│ └── source/annotations/* → Raw enriched terms annotations sources
└── gsea/
└── results/
├── log2FC_A_SCI_A_Sham.combined.GseaPreranked/*
├── log2FC_M_SCI_M_Sham.combined.GseaPreranked/*
└── log2FC_A_Sham_M_Sham.combined.GseaPreranked/
├── enrichment_fields.tsv → Enrichment results fields provided by GSEA
├── enrichmed_terms_annotations.tsv → Enrichmed terms annotations mapping summary
├── source/annotations/* → Raw enriched terms annotations sources
└── raw_GSEA_output.zip → Raw GSEA output report files