Skip to content

EG-xry/DiffentialGenes_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

GEO Transcriptomic Data Analysis Pipeline

This repository provides a pipeline for analyzing gene expression data from the GEO dataset GSE206959 and GSE7305. The pipeline covers data processing, differential gene expression analysis, and various visualization techniques including PCA, heatmaps, volcano plots, and Venn diagrams. A custom function for KEGG pathway (by Bio Informatics Skilltree) enrichment visualization is also included.

By Zhou Quan Eric Gao, special thanks to Bio Informatics Skilltree

Repository Structure

GSE206959 Pipeline

  • GSE206959.RmdThis R Markdown file handles the initial data processing steps:
    • Data Import and Preparation: Loads the gene count data from a gzipped TSV file and converts it into a usable expression matrix.
    • Clinical Metadata Processing: Retrieves clinical data via GEOquery (or falls back to generating minimal clinical data if a dedicated file is not available).
    • Gene Identifier Conversion: Uses the tinyarray package to convert gene IDs to standard formats (e.g., SYMBOL or ENSEMBL).
    • Gene Filtering: Filters out genes with low or no expression based on user-defined criteria.
    • Grouping and Saving: Extracts sample group information and saves the processed data (expression matrix, grouping info, and clinical data) into an RData file.
  • GSE206959_Figure.RmdThis R Markdown file performs differential expression analysis and produces visualizations:

    • Differential Expression Analysis: Implements analysis using DESeq2, edgeR, and limma voom to identify upregulated and downregulated genes.
    • Result Visualization: Generates PCA plots, heatmaps, volcano plots, and Venn diagrams to compare differential expression results across methods.
    • Output Saving: Saves key outputs (plots and result objects) for further investigation or reporting.
  • Pipeline (GSE7305)/Workspace/kegg_plot_function.RThis script contains the custom kegg_plot function:

    • KEGG Pathway Visualization: Processes KEGG enrichment results and creates a horizontal bar plot using ggplot2 and ggthemes. Adjusts significance levels (using log-transformed p-values) to highlight enriched pathways

GSE7305 Pipeline

  • Pipeline (GSE7305)/Workspace/0\_PackageInstall.R\

    • Checks for and installs required CRAN and Bioconductor packages.
    • Updates and installs additional dependencies from GitHub (e.g., `tinyarray`, `deconstructSigs`).
  • Pipeline (GSE7305)/Workspace/1\_StartGEO.R

    • Downloads the GEO dataset GSE7305 using `GEOquery`.
    • Extracts the expression matrix and clinical (phenotype) data.
    • Performs initial transformation (e.g. log2-transformation) and consistency checks between expression data and sample metadata.
    • Saves the preprocessed data as `step1output.Rdata`.
  • Pipeline (GSE7305)/Workspace/2\_GroupIDs.R

    • Defines sample groups based on clinical metadata using string detection.
    • Annotates probe IDs to gene symbols using methods from the `tinyarray` and `AnnoProbe` packages (or alternative methods as commented).
    • Saves the annotated dataset as `step2output.Rdata`.
  • Pipeline (GSE7305)/Workspace/3\_PCAheatmap.R

    • Performs exploratory data analysis:

      • Runs Principal Component Analysis (PCA) and visualizes the results.
      • Generates heatmaps (both direct and row-standardized) for the top 1000 genes with the highest standard deviation.
    • Saves visual outputs as PNG files.

  • Pipeline (GSE7305)/Workspace/4\_DEG.R

    • Conducts differential expression analysis using the **limma** package.
    • Generates a volcano plot to display differentially expressed genes.
    • Adds gene annotation information (gene symbols and ENTREZ IDs) and labels genes as upregulated, downregulated, or stable.
    • Constructs a heatmap of differentially expressed genes.
    • Saves the results as `step4output.Rdata`.

Data Files

  • GSE206959_Gene_Counts.tsv.gz and GSE7305_series_matrix.txt.gz The primary gene expression dataset in a gzipped TSV format.

About

This repository provides a pipeline for analyzing gene expression data from the GEO dataset **GSE206959** and **GSE7305**

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages