An R package for extracting transcription start site (TSS) groups from GTF annotation files and generating extended promoter regions for genomic analysis.
PromoterScan leverages the proActiv package to identify and group transcription start sites, then creates extended promoter regions based on configurable upstream and downstream boundaries. The package provides comprehensive outputs including CSV files, BED files, and distribution plots for downstream analysis.
- Extract TSS groups from GTF annotation or TxDb SQLite files
- Group transcripts sharing promoter regions
- Generate extended promoter regions with configurable boundaries
- Optional constraint to first exon boundaries
- Export to multiple formats (CSV, BED)
- Generate distribution plots for quality control and visualization
- Support for multiple genome versions (GRCh38, GRCh37)
- Annotate high-confidence external promoters (protein-coding with CCDS or Ensembl canonical tag, non-internal)
- Compute TSS distances between promoters of the same gene
- Compute distance from each promoter to the nearest CpG island with name and width annotation
PromoterScan requires R >= 4.0.0 and several Bioconductor packages. Install dependencies first:
# Install BiocManager if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Install Bioconductor dependencies
BiocManager::install(c(
"AnnotationDbi",
"GenomeInfoDb",
"GenomicRanges",
"IRanges",
"proActiv",
"rtracklayer",
"S4Vectors",
"txdbmaker"
))
# Install CRAN dependencies
install.packages(c("data.table", "dplyr", "ggplot2", "magrittr", "scales"))# Install from GitHub
devtools::install_github("lucianhu/PromoterScan")
# Or from local source
devtools::install_local("path/to/PromoterScan")library(PromoterScan)
# Step 1: Extract TSS groups from GTF file
tss_results <- run_tss_group_extraction(
gtf_file = "path/to/annotation.gtf",
species = "Homo_sapiens",
output_dir = "tss_analysis"
)
# Step 2: Create extended promoter regions
promoter_results <- run_promoter_regions_extraction(
tss_groups = tss_results,
output_dir = "promoter_regions",
upstream = 300,
downstream = 100,
genome_version = "GRCh38"
)For improved performance with large annotations, you can create a TxDb SQLite database:
library(txdbmaker)
library(AnnotationDbi)
# Create TxDb from GTF
txdb <- makeTxDbFromGFF(
file = "annotation.gtf",
format = "gtf",
dataSource = "GENCODE",
organism = "Homo sapiens"
)
# Save to SQLite file
saveDb(txdb, file = "annotation.sqlite")
# Use in PromoterScan (gtf_file always required for transcript metadata)
tss_results <- run_tss_group_extraction(
gtf_file = "annotation.gtf",
sqlite_file = "annotation.sqlite",
species = "Homo_sapiens",
output_dir = "tss_analysis"
)Complete workflow for extracting TSS groups using proActiv.
Parameters:
gtf_file: Path to GTF annotation file (required for transcript metadata)sqlite_file: Path to TxDb SQLite file. If provided, used for proActiv annotation instead of GTF (optional)species: Species name for proActiv (default:"Homo_sapiens")output_dir: Output directory for results (default:"tss_analysis")
Returns: Data frame with TSS groups
Outputs:
tss_groups.csv: TSS groups with metadataplots/tss_group_width_distribution.png/pdf: Width distribution plot
Complete workflow for creating extended promoter regions from TSS groups.
Parameters:
tss_groups: Data frame fromrun_tss_group_extraction()output_dir: Output directory (default:"promoter_regions")upstream: Base pairs to extend upstream of TSS (default:300)downstream: Base pairs to extend downstream of TSS (default:100)genome_version: Genome version for chromosome sizes (default:"GRCh38")limited_first_exon: Whether to limit promoter to first exon end (default:FALSE)
Returns: List containing promoter_regions and bed_data
Outputs:
promoter_regions.csv: Extended promoter regionspromoter_regions.bed: BED format fileplots/: Distribution plots
Extract and group transcription start sites from annotation.
Create extended promoter regions with configurable boundaries.
filter_gtf_and_export(): Filter GTF files by biotype and transcript classcreate_bed_file(): Convert promoter regions to BED formatget_chrom_sizes(): Get chromosome sizes for genome versioncreate_width_plot(): Generate width distribution histogram with optional zoom and reference linescreate_distribution_plot(): Generate binned bar chart distributionscreate_pie_chart(): Generate pie chart for categorical distributionsbin_data(): Bin a numeric column into labelled count categories (used internally bycreate_distribution_plot())get_base_colors(): Return the NPG colour palette used throughout the packagesave_plot(): Save a ggplot object to multiple file formats (PNG, PDF, SVG) with configurable size and DPItheme_custom(): Return a Nature journal-style ggplot2 theme with clean axes and customisable base font sizegenerate_all_distribution_plots(): Convenience wrapper that generates all QC plots in one call. Each plot is produced for all promoters and again for external-only promoters where applicable. Steps 1–12 always run; step 13 is optional:- Genes by promoter count pie chart
- Internal vs external promoter pie chart
- Promoters per gene distribution (all + external)
- Promoters per chromosome distribution (all + external)
- Transcripts per promoter distribution (all + external)
- Transcripts per gene distribution (all + external)
- Transcripts per chromosome distribution (all + external)
- Genes per chromosome distribution
- TSS group width distribution (all + external)
- Promoter region width distribution (all + external)
- Promoter overlap categories (all + external)
- Nearest HC external promoter distance (all + external)
- CpG island distance — only if
min_distance_tss_to_cpg_islandcolumn is present (all + external)
annotate_hc_promoters(): Flags high-confidence external (HC) promoters and computes TSS distances to other HC promoters of the same gene. A promoter is defined as high-confidence external if: (1)internalPromoter == FALSE, (2)transcriptTypecontains"protein_coding", AND (3)tagcontains"CCDS"(consensus CDS-validated coding transcript) or"Ensembl_canonical"(Ensembl-designated canonical isoform). Adds three columns:high_confident_external_promoter,abs_distance_tss_to_other_ex_hc_promoter,min_distance_tss_to_other_ex_hc_promoterannotate_cpg_distance(): Computes the distance from each promoter to the nearest CpG island usingdistanceToNearest(), in two versions: (1) from the TSS point, (2) from the extended promoter region. Accepts a pre-loaded CpG island data frame (UCSC 0-based coordinates converted to 1-based internally). Adds columns:min_distance_tss_to_cpg_island,min_distance_region_to_cpg_island(0 if overlapping),nearest_cpg_island_name,nearest_cpg_island_width, andRelation_to_Island(factor: Island / Shore / Shelf / Open Sea)overlap_category(): Categorise and visualise promoter overlaps (same gene, different gene, no overlap)
plot_promoter_count_pie(): Pie chart of genes by number of promoters (1 vs ≥2)plot_internal_external_pie(): Pie chart of all promoters split by internal vs external (internalPromoter)plot_promoters_per_gene(): Promoter count distribution per geneplot_promoters_per_chromosome(): Promoter count per chromosomeplot_transcripts_per_promoter(): Transcript count per promoterplot_transcripts_per_gene(): Transcript count per geneplot_transcripts_per_chromosome(): Transcript count per chromosomeplot_genes_per_chromosome(): Gene count per chromosomeplot_hc_promoter_min_distance(): Histogram of minimum TSS distance to nearest HC external promoterplot_cpg_distance(): Multi-panel figure of CpG island distance (TSS and region distances, with/without overlapping promoters) plus a pie chart of promoter CpG context (Island / Shore / Shelf / Open Sea). Returns a named list:cpg_island_width,tss,tss_exclude_0,region,region_exclude_0,combined,island_chart
Contains grouped transcription start sites with the following columns:
seqnames: Chromosome namestart: Minimum TSS position in the group (start = min_tss)end: Maximum TSS position in the group (end = max_tss)strand: Strand direction (+ or -)TSS: Representative transcription start site position (equals min_tss)promoterId: Unique promoter identifier from proActivpromoterName: Gene-based promoter name (e.g., "GENE_P1")promoterEnsemblId: Ensembl-style promoter IDinternalPromoter: Whether promoter is internal to a genefirstExonEnd: First exon boundary positionintronId: Intron identifiers (comma-separated if multiple)geneId,geneName,geneType: Gene metadatatranscript_count: Number of transcripts sharing this TSS grouptranscriptName,transcriptType: Transcript information (comma-separated if multiple)transcript_gr_starts: Start positions of transcripts in group (comma-separated)transcript_gr_ends: End positions of transcripts in group (comma-separated)tss_gr_positions: All TSS positions within the group (comma-separated)tss_group_width: Width of TSS group regiontag: Additional annotation tagsproteinId: Protein identifiers (comma-separated if multiple)
Extended regions with additional columns:
promoter_width: Width of extended promoter region (bp)high_confident_external_promoter:TRUEif the promoter is external (internalPromoter == FALSE),transcriptTypecontains"protein_coding", andtagcontains"CCDS"(consensus CDS-validated coding transcript) or"Ensembl_canonical"(Ensembl-designated canonical isoform). Rows withNAintranscriptTypeortagare treated asFALSEabs_distance_tss_to_other_ex_hc_promoter: Pipe-separated (|) sorted TSS distances (bp) from this promoter to all other high-confidence external promoters of the same gene.NAif no other such promoter exists in the genemin_distance_tss_to_other_ex_hc_promoter: Minimum TSS distance (bp) to the nearest other high-confidence external promoter of the same gene.NAif no other such promoter existsmin_distance_tss_to_cpg_island: Distance (bp) from the TSS point to the nearest CpG island.0if the TSS overlaps a CpG island. Added byannotate_cpg_distance()min_distance_region_to_cpg_island: Distance (bp) from the extended promoter region boundary to the nearest CpG island.0if the region overlaps a CpG island. Added byannotate_cpg_distance()nearest_cpg_island_name: Name of the CpG island nearest to the promoter TSS, taken from thenamecolumn of the input CpG-island table. Added byannotate_cpg_distance()nearest_cpg_island_width: Width (bp) of the CpG island nearest to the promoter TSS, taken from thelengthcolumn of the input CpG-island table. Added byannotate_cpg_distance()Relation_to_Island: CpG context of the promoter region —"Island"(overlapping),"Shore"(<=2 kb),"Shelf"(<=4 kb), or"Open Sea"(>4 kb or no CpG island on same chromosome). Added byannotate_cpg_distance()- All columns from TSS groups
Standard 0-based BED format:
chr start end name score strand promoterId# Filter GTF for protein-coding multi-exon transcripts
filter_gtf_and_export(
gtf_file = "gencode.v46.annotation.gtf.gz",
out_gtf = "gencode.v46.filtered.gtf",
biotypes = "protein_coding",
keep_class = "multi_exon"
)
tss_results <- run_tss_group_extraction(
gtf_file = "gencode.v46.filtered.gtf",
species = "Homo_sapiens",
output_dir = "promoter_regions_300_100_tss"
)
promoter_results <- run_promoter_regions_extraction(
tss_groups = tss_results,
output_dir = "promoter_regions_300_100_tss",
upstream = 300,
downstream = 100,
genome_version = "GRCh38",
limited_first_exon = FALSE
)# Extract TSS groups
tss_groups <- extract_tss_groups(
gtf_file = "annotation.gtf",
species = "Homo_sapiens",
output_csv = "my_tss_groups.csv"
)
# Create extended regions
promoter_regions <- create_extended_tss_regions(
tss_groups = tss_groups,
upstream = 300,
downstream = 100,
genome_version = "GRCh38",
output_csv = "my_promoter_regions.csv"
)
# Export to BED
bed_data <- create_bed_file(
promoter_regions,
output_bed = "my_promoters.bed"
)
# Annotate distance to nearest CpG island
cpg_island <- read.table("cpgIsland_hg38.txt", header = FALSE, sep = "\t")
colnames(cpg_island) <- c("id", "seqnames", "start", "end", "name",
"length", "cpg_count", "gc_count",
"pct_cpg", "pct_gc", "obs_exp")
promoter_regions <- annotate_cpg_distance(promoter_regions, cpg_island)View documentation for any function:
?run_tss_group_extraction
?create_extended_tss_regions
?extract_tss_groupsThis work was developed at the Neuropathology Department, Heidelberg University Hospital (UKHD) and the German Cancer Research Center (DKFZ). We thank Dr. med. Abigail Suwala, Dr. Isabell Bludau, Dr. Paul Kerbs, Quynh Nhu Nguyen, Temesvari-Nagy Levente, and the Bioinformatics team at the Neuropathology Department, Heidelberg University Hospital for their contributions and support.
MIT License - see LICENSE for details.
Contributions are welcome! Please feel free to submit issues or pull requests.
For questions or issues, please open an issue on the GitHub repository.