PromoterScan

An R package for extracting transcription start site (TSS) groups from GTF annotation files and generating extended promoter regions for genomic analysis.

Overview

PromoterScan leverages the proActiv package to identify and group transcription start sites, then creates extended promoter regions based on configurable upstream and downstream boundaries. The package provides comprehensive outputs including CSV files, BED files, and distribution plots for downstream analysis.

Features

Extract TSS groups from GTF annotation or TxDb SQLite files
Group transcripts sharing promoter regions
Generate extended promoter regions with configurable boundaries
Optional constraint to first exon boundaries
Export to multiple formats (CSV, BED)
Generate distribution plots for quality control and visualization
Support for multiple genome versions (GRCh38, GRCh37)
Annotate high-confidence external promoters (protein-coding with CCDS or Ensembl canonical tag, non-internal)
Compute TSS distances between promoters of the same gene
Compute distance from each promoter to the nearest CpG island with name and width annotation

Installation

Prerequisites

PromoterScan requires R >= 4.0.0 and several Bioconductor packages. Install dependencies first:

# Install BiocManager if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install Bioconductor dependencies
BiocManager::install(c(
    "AnnotationDbi",
    "GenomeInfoDb",
    "GenomicRanges",
    "IRanges",
    "proActiv",
    "rtracklayer",
    "S4Vectors",
    "txdbmaker"
))

# Install CRAN dependencies
install.packages(c("data.table", "dplyr", "ggplot2", "magrittr", "scales"))

Install PromoterScan

# Install from GitHub
devtools::install_github("lucianhu/PromoterScan")

# Or from local source
devtools::install_local("path/to/PromoterScan")

Quick Start

Basic Workflow

library(PromoterScan)

# Step 1: Extract TSS groups from GTF file
tss_results <- run_tss_group_extraction(
  gtf_file = "path/to/annotation.gtf",
  species = "Homo_sapiens",
  output_dir = "tss_analysis"
)

# Step 2: Create extended promoter regions
promoter_results <- run_promoter_regions_extraction(
  tss_groups = tss_results,
  output_dir = "promoter_regions",
  upstream = 300,
  downstream = 100,
  genome_version = "GRCh38"
)

Using TxDb SQLite Files

For improved performance with large annotations, you can create a TxDb SQLite database:

library(txdbmaker)
library(AnnotationDbi)

# Create TxDb from GTF
txdb <- makeTxDbFromGFF(
  file = "annotation.gtf",
  format = "gtf",
  dataSource = "GENCODE",
  organism = "Homo sapiens"
)

# Save to SQLite file
saveDb(txdb, file = "annotation.sqlite")

# Use in PromoterScan (gtf_file always required for transcript metadata)
tss_results <- run_tss_group_extraction(
  gtf_file = "annotation.gtf",
  sqlite_file = "annotation.sqlite",
  species = "Homo_sapiens",
  output_dir = "tss_analysis"
)

Main Functions

High-Level Workflows

`run_tss_group_extraction()`

Complete workflow for extracting TSS groups using proActiv.

Parameters:

gtf_file: Path to GTF annotation file (required for transcript metadata)
sqlite_file: Path to TxDb SQLite file. If provided, used for proActiv annotation instead of GTF (optional)
species: Species name for proActiv (default: "Homo_sapiens")
output_dir: Output directory for results (default: "tss_analysis")

Returns: Data frame with TSS groups

Outputs:

tss_groups.csv: TSS groups with metadata
plots/tss_group_width_distribution.png/pdf: Width distribution plot

`run_promoter_regions_extraction()`

Complete workflow for creating extended promoter regions from TSS groups.

Parameters:

tss_groups: Data frame from run_tss_group_extraction()
output_dir: Output directory (default: "promoter_regions")
upstream: Base pairs to extend upstream of TSS (default: 300)
downstream: Base pairs to extend downstream of TSS (default: 100)
genome_version: Genome version for chromosome sizes (default: "GRCh38")
limited_first_exon: Whether to limit promoter to first exon end (default: FALSE)

Returns: List containing promoter_regions and bed_data

Outputs:

promoter_regions.csv: Extended promoter regions
promoter_regions.bed: BED format file
plots/: Distribution plots

Core Functions

`extract_tss_groups()`

Extract and group transcription start sites from annotation.

`create_extended_tss_regions()`

Create extended promoter regions with configurable boundaries.

Utility Functions

filter_gtf_and_export(): Filter GTF files by biotype and transcript class
create_bed_file(): Convert promoter regions to BED format
get_chrom_sizes(): Get chromosome sizes for genome version
create_width_plot(): Generate width distribution histogram with optional zoom and reference lines
create_distribution_plot(): Generate binned bar chart distributions
create_pie_chart(): Generate pie chart for categorical distributions
bin_data(): Bin a numeric column into labelled count categories (used internally by create_distribution_plot())
get_base_colors(): Return the NPG colour palette used throughout the package
save_plot(): Save a ggplot object to multiple file formats (PNG, PDF, SVG) with configurable size and DPI
theme_custom(): Return a Nature journal-style ggplot2 theme with clean axes and customisable base font size
generate_all_distribution_plots(): Convenience wrapper that generates all QC plots in one call. Each plot is produced for all promoters and again for external-only promoters where applicable. Steps 1–12 always run; step 13 is optional:
1. Genes by promoter count pie chart
2. Internal vs external promoter pie chart
3. Promoters per gene distribution (all + external)
4. Promoters per chromosome distribution (all + external)
5. Transcripts per promoter distribution (all + external)
6. Transcripts per gene distribution (all + external)
7. Transcripts per chromosome distribution (all + external)
8. Genes per chromosome distribution
9. TSS group width distribution (all + external)
10. Promoter region width distribution (all + external)
11. Promoter overlap categories (all + external)
12. Nearest HC external promoter distance (all + external)
13. CpG island distance — only if min_distance_tss_to_cpg_island column is present (all + external)

Promoter Annotation Functions

annotate_hc_promoters(): Flags high-confidence external (HC) promoters and computes TSS distances to other HC promoters of the same gene. A promoter is defined as high-confidence external if: (1) internalPromoter == FALSE, (2) transcriptType contains "protein_coding", AND (3) tag contains "CCDS" (consensus CDS-validated coding transcript) or "Ensembl_canonical" (Ensembl-designated canonical isoform). Adds three columns: high_confident_external_promoter, abs_distance_tss_to_other_ex_hc_promoter, min_distance_tss_to_other_ex_hc_promoter
annotate_cpg_distance(): Computes the distance from each promoter to the nearest CpG island using distanceToNearest(), in two versions: (1) from the TSS point, (2) from the extended promoter region. Accepts a pre-loaded CpG island data frame (UCSC 0-based coordinates converted to 1-based internally). Adds columns: min_distance_tss_to_cpg_island, min_distance_region_to_cpg_island (0 if overlapping), nearest_cpg_island_name, nearest_cpg_island_width, and Relation_to_Island (factor: Island / Shore / Shelf / Open Sea)
overlap_category(): Categorise and visualise promoter overlaps (same gene, different gene, no overlap)

Visualisation Functions

plot_promoter_count_pie(): Pie chart of genes by number of promoters (1 vs ≥2)
plot_internal_external_pie(): Pie chart of all promoters split by internal vs external (internalPromoter)
plot_promoters_per_gene(): Promoter count distribution per gene
plot_promoters_per_chromosome(): Promoter count per chromosome
plot_transcripts_per_promoter(): Transcript count per promoter
plot_transcripts_per_gene(): Transcript count per gene
plot_transcripts_per_chromosome(): Transcript count per chromosome
plot_genes_per_chromosome(): Gene count per chromosome
plot_hc_promoter_min_distance(): Histogram of minimum TSS distance to nearest HC external promoter
plot_cpg_distance(): Multi-panel figure of CpG island distance (TSS and region distances, with/without overlapping promoters) plus a pie chart of promoter CpG context (Island / Shore / Shelf / Open Sea). Returns a named list: cpg_island_width, tss, tss_exclude_0, region, region_exclude_0, combined, island_chart

Output Formats

TSS Groups CSV

Contains grouped transcription start sites with the following columns:

seqnames: Chromosome name
start: Minimum TSS position in the group (start = min_tss)
end: Maximum TSS position in the group (end = max_tss)
strand: Strand direction (+ or -)
TSS: Representative transcription start site position (equals min_tss)
promoterId: Unique promoter identifier from proActiv
promoterName: Gene-based promoter name (e.g., "GENE_P1")
promoterEnsemblId: Ensembl-style promoter ID
internalPromoter: Whether promoter is internal to a gene
firstExonEnd: First exon boundary position
intronId: Intron identifiers (comma-separated if multiple)
geneId, geneName, geneType: Gene metadata
transcript_count: Number of transcripts sharing this TSS group
transcriptName, transcriptType: Transcript information (comma-separated if multiple)
transcript_gr_starts: Start positions of transcripts in group (comma-separated)
transcript_gr_ends: End positions of transcripts in group (comma-separated)
tss_gr_positions: All TSS positions within the group (comma-separated)
tss_group_width: Width of TSS group region
tag: Additional annotation tags
proteinId: Protein identifiers (comma-separated if multiple)

Promoter Regions CSV

Extended regions with additional columns:

promoter_width: Width of extended promoter region (bp)
high_confident_external_promoter: TRUE if the promoter is external (internalPromoter == FALSE), transcriptType contains "protein_coding", and tag contains "CCDS" (consensus CDS-validated coding transcript) or "Ensembl_canonical" (Ensembl-designated canonical isoform). Rows with NA in transcriptType or tag are treated as FALSE
abs_distance_tss_to_other_ex_hc_promoter: Pipe-separated (|) sorted TSS distances (bp) from this promoter to all other high-confidence external promoters of the same gene. NA if no other such promoter exists in the gene
min_distance_tss_to_other_ex_hc_promoter: Minimum TSS distance (bp) to the nearest other high-confidence external promoter of the same gene. NA if no other such promoter exists
min_distance_tss_to_cpg_island: Distance (bp) from the TSS point to the nearest CpG island. 0 if the TSS overlaps a CpG island. Added by annotate_cpg_distance()
min_distance_region_to_cpg_island: Distance (bp) from the extended promoter region boundary to the nearest CpG island. 0 if the region overlaps a CpG island. Added by annotate_cpg_distance()
nearest_cpg_island_name: Name of the CpG island nearest to the promoter TSS, taken from the name column of the input CpG-island table. Added by annotate_cpg_distance()
nearest_cpg_island_width: Width (bp) of the CpG island nearest to the promoter TSS, taken from the length column of the input CpG-island table. Added by annotate_cpg_distance()
Relation_to_Island: CpG context of the promoter region — "Island" (overlapping), "Shore" (<=2 kb), "Shelf" (<=4 kb), or "Open Sea" (>4 kb or no CpG island on same chromosome). Added by annotate_cpg_distance()
All columns from TSS groups

BED File

Standard 0-based BED format:

chr  start  end  name  score  strand  promoterId

Examples

Example 1: Run whole pipeline

# Filter GTF for protein-coding multi-exon transcripts
filter_gtf_and_export(
  gtf_file = "gencode.v46.annotation.gtf.gz",
  out_gtf = "gencode.v46.filtered.gtf",
  biotypes = "protein_coding",
  keep_class = "multi_exon"
)

tss_results <- run_tss_group_extraction(
  gtf_file = "gencode.v46.filtered.gtf",
  species = "Homo_sapiens",
  output_dir = "promoter_regions_300_100_tss"
)

promoter_results <- run_promoter_regions_extraction(
  tss_groups = tss_results,
  output_dir = "promoter_regions_300_100_tss",
  upstream = 300,
  downstream = 100,
  genome_version = "GRCh38",
  limited_first_exon = FALSE
)

Example 2: Step-by-Step Workflow

# Extract TSS groups
tss_groups <- extract_tss_groups(
  gtf_file = "annotation.gtf",
  species = "Homo_sapiens",
  output_csv = "my_tss_groups.csv"
)

# Create extended regions
promoter_regions <- create_extended_tss_regions(
  tss_groups = tss_groups,
  upstream = 300,
  downstream = 100,
  genome_version = "GRCh38",
  output_csv = "my_promoter_regions.csv"
)

# Export to BED
bed_data <- create_bed_file(
  promoter_regions,
  output_bed = "my_promoters.bed"
)

# Annotate distance to nearest CpG island
cpg_island <- read.table("cpgIsland_hg38.txt", header = FALSE, sep = "\t")
colnames(cpg_island) <- c("id", "seqnames", "start", "end", "name",
                           "length", "cpg_count", "gc_count",
                           "pct_cpg", "pct_gc", "obs_exp")
promoter_regions <- annotate_cpg_distance(promoter_regions, cpg_island)

Documentation

View documentation for any function:

?run_tss_group_extraction
?create_extended_tss_regions
?extract_tss_groups

Acknowledgments

This work was developed at the Neuropathology Department, Heidelberg University Hospital (UKHD) and the German Cancer Research Center (DKFZ). We thank Dr. med. Abigail Suwala, Dr. Isabell Bludau, Dr. Paul Kerbs, Quynh Nhu Nguyen, Temesvari-Nagy Levente, and the Bioinformatics team at the Neuropathology Department, Heidelberg University Hospital for their contributions and support.

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Contact

For questions or issues, please open an issue on the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
R		R
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
PromoterScan.Rproj		PromoterScan.Rproj
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

PromoterScan

Overview

Features

Installation

Prerequisites

Install PromoterScan

Quick Start

Basic Workflow

Using TxDb SQLite Files

Main Functions

High-Level Workflows

run_tss_group_extraction()

run_promoter_regions_extraction()

Core Functions

extract_tss_groups()

create_extended_tss_regions()

Utility Functions

Promoter Annotation Functions

Visualisation Functions

Output Formats

TSS Groups CSV

Promoter Regions CSV

BED File

Examples

Example 1: Run whole pipeline

Example 2: Step-by-Step Workflow

Documentation

Acknowledgments

License

Contributing

Contact

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`run_tss_group_extraction()`

`run_promoter_regions_extraction()`

`extract_tss_groups()`

`create_extended_tss_regions()`

Packages