GWASTutorial

This tutorial provides hands-on training in Complex Trait Genomics for the course Basic Seminar II at The Laboratory of Complex Trait Genomics, University of Tokyo. See About for details. Questions or suggestions? Please use the Issue section.

What is GWAS?

A Genome-Wide Association Study (GWAS) is a research approach that investigates the association between genetic variants (typically SNPs) and traits across the entire genome to discover genetic factors that contribute to complex traits and diseases.

Why Study GWAS and Statistical Genetics?

GWAS and statistical genetics are revolutionizing our understanding of human biology and medicine. These fields are fundamental to modern genetics research, enabling the discovery of genetic risk factors for common diseases, uncovering biological mechanisms, advancing personalized medicine through polygenic risk prediction, and identifying novel drug targets.

As genetic datasets grow exponentially and precision medicine gains widespread adoption, expertise in GWAS and statistical genetics is increasingly essential for researchers across genomics, medicine, public health, and biotechnology.

Study Aim

This tutorial aims to provide comprehensive, hands-on training in genome-wide association studies (GWAS) and complex trait genomics. Through practical exercises and detailed explanations, students will learn to:

Understand the fundamental concepts and methodologies of GWAS
Perform data quality control, association testing
Interpret and visualize GWAS results
Apply post-GWAS analyses including heritability estimation, fine-mapping, and polygenic risk scoring
Develop proficiency in the computational tools and statistical methods essential for modern genetic research

Category	Topic	Description
Introduction	Introduction	Essential background knowledge for understanding genome-wide association studies (GWAS) and complex trait genomics.
Command Line Tools - Linux	Linux command line basics	For those who haven't used the command line, we will first introduce the basics of the Linux system and commonly used commands.
Pre-GWAS	1000 Genomes Project	Comprehensive catalog of human genetic variation providing reference data for GWAS and imputation.
	Sample Dataset	Sample dataset of 504 East Asian individuals from 1000 Genomes Project for tutorial exercises.
	Data formats	Before any analysis, the first thing is always to get familiar with your data. In this section, we will introduce some basic formats used to store sequence, genotype and dosage data.
	Data QC	Usually the raw genotype data is "dirty". This means that there are usually errors, invalid or missing values. In this section, we will learn how to perform quality control for the raw genotype data using PLINK.
	Principal component analysis (PCA)	In this section, we will cover how to perform Principal Component Analysis (PCA) to analyze the population structure.
	Phasing	Determining the haplotypes (parental chromosome origin) of genetic variants.
	Imputation	Predicting ungenotyped variants using reference panels and LD patterns.
GWAS	Association tests	After QC, we will perform the very first association tests for a simulated binary trait (case-control trait) with a logistic regression model using PLINK.
	Visualization	To visualize the summary statistics generated from association tests, we will use a python package called gwaslab to create Manhattan plots, Quantile-Quantile plots and Regional plots.
	Linear mixed model (LMM)	Statistical framework to account for population structure, cryptic relatedness, and confounding in GWAS.
	Whole genome regression by REGENIE	Computationally efficient whole-genome regression method for large-scale GWAS with multiple phenotypes.
	Rare variant association tests	Methods for testing associations of rare variants by aggregating information across variants in genes or regions.
	Saddlepoint approximation (SAIGE)	Accurate p-value calculation for binary traits with unbalanced case-control ratios using saddlepoint approximation.
Post-GWAS	Variant Annotation by ANNOVAR/VEP	Annotating genetic variants with functional information including gene location, consequence, and population frequency.
	SNP-Heritability estimation by GCTA-GREML	Estimating the proportion of phenotypic variance explained by all SNPs using linear mixed models.
	LD score regression (univariate, cross-trait and partitioned) by LDSC	Method to estimate heritability, genetic correlation, and cell-type specificity from GWAS summary statistics.
	Gene / Gene-set analysis by MAGMA	Testing associations at the gene and gene-set level by aggregating variant-level signals within genes.
	Fine-mapping by SUSIE	Identifying the most likely causal variant(s) within a genomic region showing significant association.
	Meta-analysis	Combining evidence from multiple GWAS studies to increase statistical power and improve effect size estimation.
	Polygenic risk scores	Calculating genetic risk scores by summing effect sizes of trait-associated variants weighted by their effects.
	Mendelian randomization	Using genetic variants as instrumental variables to infer causal relationships between exposures and outcomes.
	Conditional analysis	Identifying independent association signals within a locus by conditioning on lead variants.
	Colocalization	Testing whether two traits share the same causal variant in a genomic region to support causal inference.
	TWAS	Transcriptome-wide association study to identify genes whose expression is associated with traits using expression imputation.
Topics	Linkage disequilibrium (LD)	Non-random association of alleles at different loci, fundamental concept for understanding GWAS results.
	Heritability Concepts	Understanding how much phenotypic variation can be explained by genetic variation (broad-sense and narrow-sense heritability).
	Power analysis for GWAS	Calculating statistical power to detect associations given sample size, effect size, allele frequency, and significance threshold.
	Winner's curse	Systematic overestimation of genetic effect sizes when variants are selected based on significance thresholds.
	Study design and phenotype definition	Study design principles for case/control selection, covariates, trait transformations, and phenotype QC.
	Relatedness and sample structure	Identifying related samples, handling duplicates, and choosing family-based vs population GWAS designs.
	Measure of effect	Understanding different measures of genetic effect including odds ratio, relative risk, and hazard ratio.
Others	Recommended reading	Curated list of textbooks, review articles, and topic-specific papers for further learning.

Name		Name	Last commit message	Last commit date
Latest commit History 535 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
00_Introduction		00_Introduction
01_Dataset		01_Dataset
02_Linux_basics		02_Linux_basics
03_Data_formats		03_Data_formats
04_Data_QC		04_Data_QC
05_PCA		05_PCA
06_Association_tests		06_Association_tests
07_Annotation		07_Annotation
08_LDSC		08_LDSC
09_Gene_based_analysis		09_Gene_based_analysis
10_PRS		10_PRS
11_meta_analysis		11_meta_analysis
12_fine_mapping		12_fine_mapping
13_heritability		13_heritability
14_gcta_greml		14_gcta_greml
15_winners_curse		15_winners_curse
16_mendelian_randomization		16_mendelian_randomization
17_colocalization		17_colocalization
18_Conditioning_analysis		18_Conditioning_analysis
19_ld		19_ld
20_power_analysis		20_power_analysis
21_twas		21_twas
22_bias		22_bias
24_multiomics		24_multiomics
25_singlecell		25_singlecell
26_normalization		26_normalization
28_relatedness		28_relatedness
29_postgwas		29_postgwas
30_phasing		30_phasing
31_imputation		31_imputation
32_whole_genome_regression		32_whole_genome_regression
33_linear_mixed_model		33_linear_mixed_model
34_rare_variant		34_rare_variant
35_saddlepoint_approximation		35_saddlepoint_approximation
36_alleles		36_alleles
39_overview		39_overview
40_1000_genome_project		40_1000_genome_project
41_variant_databases		41_variant_databases
42_biobanks_cohorts_in_Japan		42_biobanks_cohorts_in_Japan
50_step_by_step		50_step_by_step
55_measure_of_effect		55_measure_of_effect
60_awk		60_awk
61_sed		61_sed
65_regex		65_regex
69_resources		69_resources
70_python_basics		70_python_basics
71_python_resources		71_python_resources
75_R_basics		75_R_basics
76_R_resources		76_R_resources
80_miniconda		80_miniconda
81_jupyter_notebook		81_jupyter_notebook
82_windows_linux_subsystem		82_windows_linux_subsystem
83_git_and_github		83_git_and_github
84_ssh		84_ssh
85_job_scheduler		85_job_scheduler
89_programming		89_programming
90_Recommended_Reading		90_Recommended_Reading
91_basic_statistics		91_basic_statistics
95_Assignment		95_Assignment
96_Assignment2		96_Assignment2
98_updatelog		98_updatelog
99_About		99_About
docs		docs
tests		tests
README.md		README.md
convert_notebook_to_md.py		convert_notebook_to_md.py
deploy.sh		deploy.sh
generate_wordcloud.py		generate_wordcloud.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GWASTutorial

What is GWAS?

Why Study GWAS and Statistical Genetics?

Study Aim

Contents

About

Uh oh!

Contributors 8

Languages

Cloufield/GWASTutorial

Folders and files

Latest commit

History

Repository files navigation

GWASTutorial

What is GWAS?

Why Study GWAS and Statistical Genetics?

Study Aim

Contents

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 8

Languages