Skip to content

UKHD-NP/PromoterScan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PromoterScan

An R package for extracting transcription start site (TSS) groups from GTF annotation files and generating extended promoter regions for genomic analysis.

Overview

PromoterScan leverages the proActiv package to identify and group transcription start sites, then creates extended promoter regions based on configurable upstream and downstream boundaries. The package provides comprehensive outputs including CSV files, BED files, and distribution plots for downstream analysis.

Features

  • Extract TSS groups from GTF annotation or TxDb SQLite files
  • Group transcripts sharing promoter regions
  • Generate extended promoter regions with configurable boundaries
  • Optional constraint to first exon boundaries
  • Export to multiple formats (CSV, BED)
  • Generate distribution plots for quality control and visualization
  • Support for multiple genome versions (GRCh38, GRCh37)
  • Annotate high-confidence external promoters (protein-coding with CCDS or Ensembl canonical tag, non-internal)
  • Compute TSS distances between promoters of the same gene
  • Compute distance from each promoter to the nearest CpG island with name and width annotation

Installation

Prerequisites

PromoterScan requires R >= 4.0.0 and several Bioconductor packages. Install dependencies first:

# Install BiocManager if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install Bioconductor dependencies
BiocManager::install(c(
    "AnnotationDbi",
    "GenomeInfoDb",
    "GenomicRanges",
    "IRanges",
    "proActiv",
    "rtracklayer",
    "S4Vectors",
    "txdbmaker"
))

# Install CRAN dependencies
install.packages(c("data.table", "dplyr", "ggplot2", "magrittr", "scales"))

Install PromoterScan

# Install from GitHub
devtools::install_github("lucianhu/PromoterScan")

# Or from local source
devtools::install_local("path/to/PromoterScan")

Quick Start

Basic Workflow

library(PromoterScan)

# Step 1: Extract TSS groups from GTF file
tss_results <- run_tss_group_extraction(
  gtf_file = "path/to/annotation.gtf",
  species = "Homo_sapiens",
  output_dir = "tss_analysis"
)

# Step 2: Create extended promoter regions
promoter_results <- run_promoter_regions_extraction(
  tss_groups = tss_results,
  output_dir = "promoter_regions",
  upstream = 300,
  downstream = 100,
  genome_version = "GRCh38"
)

Using TxDb SQLite Files

For improved performance with large annotations, you can create a TxDb SQLite database:

library(txdbmaker)
library(AnnotationDbi)

# Create TxDb from GTF
txdb <- makeTxDbFromGFF(
  file = "annotation.gtf",
  format = "gtf",
  dataSource = "GENCODE",
  organism = "Homo sapiens"
)

# Save to SQLite file
saveDb(txdb, file = "annotation.sqlite")

# Use in PromoterScan (gtf_file always required for transcript metadata)
tss_results <- run_tss_group_extraction(
  gtf_file = "annotation.gtf",
  sqlite_file = "annotation.sqlite",
  species = "Homo_sapiens",
  output_dir = "tss_analysis"
)

Main Functions

High-Level Workflows

run_tss_group_extraction()

Complete workflow for extracting TSS groups using proActiv.

Parameters:

  • gtf_file: Path to GTF annotation file (required for transcript metadata)
  • sqlite_file: Path to TxDb SQLite file. If provided, used for proActiv annotation instead of GTF (optional)
  • species: Species name for proActiv (default: "Homo_sapiens")
  • output_dir: Output directory for results (default: "tss_analysis")

Returns: Data frame with TSS groups

Outputs:

  • tss_groups.csv: TSS groups with metadata
  • plots/tss_group_width_distribution.png/pdf: Width distribution plot

run_promoter_regions_extraction()

Complete workflow for creating extended promoter regions from TSS groups.

Parameters:

  • tss_groups: Data frame from run_tss_group_extraction()
  • output_dir: Output directory (default: "promoter_regions")
  • upstream: Base pairs to extend upstream of TSS (default: 300)
  • downstream: Base pairs to extend downstream of TSS (default: 100)
  • genome_version: Genome version for chromosome sizes (default: "GRCh38")
  • limited_first_exon: Whether to limit promoter to first exon end (default: FALSE)

Returns: List containing promoter_regions and bed_data

Outputs:

  • promoter_regions.csv: Extended promoter regions
  • promoter_regions.bed: BED format file
  • plots/: Distribution plots

Core Functions

extract_tss_groups()

Extract and group transcription start sites from annotation.

create_extended_tss_regions()

Create extended promoter regions with configurable boundaries.

Utility Functions

  • filter_gtf_and_export(): Filter GTF files by biotype and transcript class
  • create_bed_file(): Convert promoter regions to BED format
  • get_chrom_sizes(): Get chromosome sizes for genome version
  • create_width_plot(): Generate width distribution histogram with optional zoom and reference lines
  • create_distribution_plot(): Generate binned bar chart distributions
  • create_pie_chart(): Generate pie chart for categorical distributions
  • bin_data(): Bin a numeric column into labelled count categories (used internally by create_distribution_plot())
  • get_base_colors(): Return the NPG colour palette used throughout the package
  • save_plot(): Save a ggplot object to multiple file formats (PNG, PDF, SVG) with configurable size and DPI
  • theme_custom(): Return a Nature journal-style ggplot2 theme with clean axes and customisable base font size
  • generate_all_distribution_plots(): Convenience wrapper that generates all QC plots in one call. Each plot is produced for all promoters and again for external-only promoters where applicable. Steps 1–12 always run; step 13 is optional:
    1. Genes by promoter count pie chart
    2. Internal vs external promoter pie chart
    3. Promoters per gene distribution (all + external)
    4. Promoters per chromosome distribution (all + external)
    5. Transcripts per promoter distribution (all + external)
    6. Transcripts per gene distribution (all + external)
    7. Transcripts per chromosome distribution (all + external)
    8. Genes per chromosome distribution
    9. TSS group width distribution (all + external)
    10. Promoter region width distribution (all + external)
    11. Promoter overlap categories (all + external)
    12. Nearest HC external promoter distance (all + external)
    13. CpG island distance — only if min_distance_tss_to_cpg_island column is present (all + external)

Promoter Annotation Functions

  • annotate_hc_promoters(): Flags high-confidence external (HC) promoters and computes TSS distances to other HC promoters of the same gene. A promoter is defined as high-confidence external if: (1) internalPromoter == FALSE, (2) transcriptType contains "protein_coding", AND (3) tag contains "CCDS" (consensus CDS-validated coding transcript) or "Ensembl_canonical" (Ensembl-designated canonical isoform). Adds three columns: high_confident_external_promoter, abs_distance_tss_to_other_ex_hc_promoter, min_distance_tss_to_other_ex_hc_promoter
  • annotate_cpg_distance(): Computes the distance from each promoter to the nearest CpG island using distanceToNearest(), in two versions: (1) from the TSS point, (2) from the extended promoter region. Accepts a pre-loaded CpG island data frame (UCSC 0-based coordinates converted to 1-based internally). Adds columns: min_distance_tss_to_cpg_island, min_distance_region_to_cpg_island (0 if overlapping), nearest_cpg_island_name, nearest_cpg_island_width, and Relation_to_Island (factor: Island / Shore / Shelf / Open Sea)
  • overlap_category(): Categorise and visualise promoter overlaps (same gene, different gene, no overlap)

Visualisation Functions

  • plot_promoter_count_pie(): Pie chart of genes by number of promoters (1 vs ≥2)
  • plot_internal_external_pie(): Pie chart of all promoters split by internal vs external (internalPromoter)
  • plot_promoters_per_gene(): Promoter count distribution per gene
  • plot_promoters_per_chromosome(): Promoter count per chromosome
  • plot_transcripts_per_promoter(): Transcript count per promoter
  • plot_transcripts_per_gene(): Transcript count per gene
  • plot_transcripts_per_chromosome(): Transcript count per chromosome
  • plot_genes_per_chromosome(): Gene count per chromosome
  • plot_hc_promoter_min_distance(): Histogram of minimum TSS distance to nearest HC external promoter
  • plot_cpg_distance(): Multi-panel figure of CpG island distance (TSS and region distances, with/without overlapping promoters) plus a pie chart of promoter CpG context (Island / Shore / Shelf / Open Sea). Returns a named list: cpg_island_width, tss, tss_exclude_0, region, region_exclude_0, combined, island_chart

Output Formats

TSS Groups CSV

Contains grouped transcription start sites with the following columns:

  • seqnames: Chromosome name
  • start: Minimum TSS position in the group (start = min_tss)
  • end: Maximum TSS position in the group (end = max_tss)
  • strand: Strand direction (+ or -)
  • TSS: Representative transcription start site position (equals min_tss)
  • promoterId: Unique promoter identifier from proActiv
  • promoterName: Gene-based promoter name (e.g., "GENE_P1")
  • promoterEnsemblId: Ensembl-style promoter ID
  • internalPromoter: Whether promoter is internal to a gene
  • firstExonEnd: First exon boundary position
  • intronId: Intron identifiers (comma-separated if multiple)
  • geneId, geneName, geneType: Gene metadata
  • transcript_count: Number of transcripts sharing this TSS group
  • transcriptName, transcriptType: Transcript information (comma-separated if multiple)
  • transcript_gr_starts: Start positions of transcripts in group (comma-separated)
  • transcript_gr_ends: End positions of transcripts in group (comma-separated)
  • tss_gr_positions: All TSS positions within the group (comma-separated)
  • tss_group_width: Width of TSS group region
  • tag: Additional annotation tags
  • proteinId: Protein identifiers (comma-separated if multiple)

Promoter Regions CSV

Extended regions with additional columns:

  • promoter_width: Width of extended promoter region (bp)
  • high_confident_external_promoter: TRUE if the promoter is external (internalPromoter == FALSE), transcriptType contains "protein_coding", and tag contains "CCDS" (consensus CDS-validated coding transcript) or "Ensembl_canonical" (Ensembl-designated canonical isoform). Rows with NA in transcriptType or tag are treated as FALSE
  • abs_distance_tss_to_other_ex_hc_promoter: Pipe-separated (|) sorted TSS distances (bp) from this promoter to all other high-confidence external promoters of the same gene. NA if no other such promoter exists in the gene
  • min_distance_tss_to_other_ex_hc_promoter: Minimum TSS distance (bp) to the nearest other high-confidence external promoter of the same gene. NA if no other such promoter exists
  • min_distance_tss_to_cpg_island: Distance (bp) from the TSS point to the nearest CpG island. 0 if the TSS overlaps a CpG island. Added by annotate_cpg_distance()
  • min_distance_region_to_cpg_island: Distance (bp) from the extended promoter region boundary to the nearest CpG island. 0 if the region overlaps a CpG island. Added by annotate_cpg_distance()
  • nearest_cpg_island_name: Name of the CpG island nearest to the promoter TSS, taken from the name column of the input CpG-island table. Added by annotate_cpg_distance()
  • nearest_cpg_island_width: Width (bp) of the CpG island nearest to the promoter TSS, taken from the length column of the input CpG-island table. Added by annotate_cpg_distance()
  • Relation_to_Island: CpG context of the promoter region — "Island" (overlapping), "Shore" (<=2 kb), "Shelf" (<=4 kb), or "Open Sea" (>4 kb or no CpG island on same chromosome). Added by annotate_cpg_distance()
  • All columns from TSS groups

BED File

Standard 0-based BED format:

chr  start  end  name  score  strand  promoterId

Examples

Example 1: Run whole pipeline

# Filter GTF for protein-coding multi-exon transcripts
filter_gtf_and_export(
  gtf_file = "gencode.v46.annotation.gtf.gz",
  out_gtf = "gencode.v46.filtered.gtf",
  biotypes = "protein_coding",
  keep_class = "multi_exon"
)

tss_results <- run_tss_group_extraction(
  gtf_file = "gencode.v46.filtered.gtf",
  species = "Homo_sapiens",
  output_dir = "promoter_regions_300_100_tss"
)

promoter_results <- run_promoter_regions_extraction(
  tss_groups = tss_results,
  output_dir = "promoter_regions_300_100_tss",
  upstream = 300,
  downstream = 100,
  genome_version = "GRCh38",
  limited_first_exon = FALSE
)

Example 2: Step-by-Step Workflow

# Extract TSS groups
tss_groups <- extract_tss_groups(
  gtf_file = "annotation.gtf",
  species = "Homo_sapiens",
  output_csv = "my_tss_groups.csv"
)

# Create extended regions
promoter_regions <- create_extended_tss_regions(
  tss_groups = tss_groups,
  upstream = 300,
  downstream = 100,
  genome_version = "GRCh38",
  output_csv = "my_promoter_regions.csv"
)

# Export to BED
bed_data <- create_bed_file(
  promoter_regions,
  output_bed = "my_promoters.bed"
)

# Annotate distance to nearest CpG island
cpg_island <- read.table("cpgIsland_hg38.txt", header = FALSE, sep = "\t")
colnames(cpg_island) <- c("id", "seqnames", "start", "end", "name",
                           "length", "cpg_count", "gc_count",
                           "pct_cpg", "pct_gc", "obs_exp")
promoter_regions <- annotate_cpg_distance(promoter_regions, cpg_island)

Documentation

View documentation for any function:

?run_tss_group_extraction
?create_extended_tss_regions
?extract_tss_groups

Acknowledgments

This work was developed at the Neuropathology Department, Heidelberg University Hospital (UKHD) and the German Cancer Research Center (DKFZ). We thank Dr. med. Abigail Suwala, Dr. Isabell Bludau, Dr. Paul Kerbs, Quynh Nhu Nguyen, Temesvari-Nagy Levente, and the Bioinformatics team at the Neuropathology Department, Heidelberg University Hospital for their contributions and support.

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Contact

For questions or issues, please open an issue on the GitHub repository.

About

R package for promoter region extraction and annotation from GTF files.

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages