title: "RNAseq Base Population Project Record & Notes" author: "Enoch Ng'oma (EN), and Elizabeth King (EK)" date: "7/9/2018"
- Locate these resources:
- Download reference files: wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/bdgp6_tran.tar.gz
- then: gunzip bdgp6_tran.tar.gz
- then: tar -xvf bdgp6_tran.tar
- Sequence assembly (ref. Pertea et al 2016)
/scripts/assembly_shortProtocol/SetUpArrays_short.Rmdfor use with the short pipeline leading to DESeq2/scripts/assembly_longProtocol/Set_up_arrays.Rmdfor use with the long pipeline leading to Ballgown
- See
/scripts/assembly_shortProtocol/SetUpArrays_short.html - A step-by-step workflow follows:
-
To test, align one sample to genome:
hisat2 --dta -q -x bdgp6_tran/genome_tran -U C-1_B_run1.fastq.gz -S C-1_B_run1_run1.sam >temp.txt 2>error.txt & -
Run Step 1 Align in
/base_pop/scripts/assembly_longProtocol/Set_up_arrays.Rmd -
Output are 3 text lists of hisat commands:
/base_pop/scripts/assembly_longProtocol/S01_Align_cmd_run1.txt/base_pop/scripts/assembly_longProtocol/S02_Align_cmd_run2.txt/base_pop/scripts/assembly_longProtocol/S03_Align_cmd_run3.txt
-
In the terminal,
cdto/scripts/assembly_longProtocol/ -
First run test alignment of two samples:
sbatch --array=1-2 sarray_setup_run1.sh -
this uses an sbatch script
/scripts/assembly_longProtocol/sarray_setup_run1.sh -
Second, run for all samples in a
/base_pop/samples/run1/:sbatch --array=1-54 sarray_setup_run1.sh -
sbatch scripts available:
/scripts/assembly_longProtocol/sarray_setup_run1.sh# 54 samples in run1/scripts/assembly_longProtocol/sarray_setup_run2.sh# 54 samples in run2/scripts/assembly_longProtocol/sarray_setup_run3.sh# 54 samples in run3 -
Output files are
.samand are located in/base_pop/processed/dotsams/ -
Note: can run all sbatch files at once - SLARM takes care of resources
-
Example from Pertea et al 2016:
samtools sort -@ 8 -o ERR188044_chrX.bam ERR188044_chrX.sam -
In R, do Samtools sort (step 2a in
Set_up_arrays.Rmd) -
Output1:
/base_pop/scripts/assembly_longProtocol/S02a_short_samtools.txt -
Next, run
sbatch --array=1-162 samtools_all.shfor all samples. This readsS02a_short_samtools.txtcreated above. -
Output2: .bam files located in
/base_pop/processed/dotbams/
- Do Samtools merge in R (step 2b in
Set_up_arrays.Rmd) - Output1:
/scripts/assembly_longProtocol/S02b_samtools_merge.txt - Run
sbatch --array=1-54 samtools_merge.shwhich readsS02b_samtools_merge.txtto merge like files - Output2: 54
sampleName_merged.bamlocated in/base_pop/processed/dotbams/
-
Novel transcripts not needed for analysis. Therefore we adopt the alternative pipeline for stringtie: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de
-
Instead of merging transcripts from all samples (step 4 in Pertea et al, 2016), we get expression estimates separately for each sample by applying the reference annotation (rather than merged sample .gtf files) to each sample separately.So we apply
stringtie -eBto the output of 2b (i.e.sampleName_merged.bam)
Quote from gpertea, June 11, 2018 gpertea/stringtie#170:
- "... if you run the alternate, faster protocol which does not attempt to assemble any novel genes, the output will only have expression level values for the known transcripts given in the reference annotation, and no merge step is needed, so there will be no internally generated IDs like MSTRG.# or STRG.# that could possibly join multiple reference gene IDs. So if you really do not care about any novel transcripts/isoforms and trust the completeness and correctness of your reference annotation, give this alternate protocol a try .... You could then use the prepDE script on the GTF files generated by the these stringtie runs (if you prefer using DESeq2 or other DE analysis tools instead of Ballgown)."
stringtie -e -B
-
Run StringTie transcript abundances in
Set_up_arrays.Rmd -
Output1:
/scripts/assembly_longProtocol/S03_abundances_short.txt -
Next, run
sbatch --array=1-54 stringt_short_abund.sh -
Output2: each of 54 sample directories (e.g. C-1_B, etc) are written to
/processed/shortProtocol/S03_short_ballG/. -
Procedure:
- Do Step 3: 'StringTie Assemble and quantify' in
Set_up_arrays.Rmdin R - Output:
S03_stringtie_assemble.txt - Next, run
sbatch --array=1-54 stringtie_assemble.sh - Output: 54 'sample_name_assembled.gtf' files located in /base_pop/processed/dotgtfs/
- Do Step 3: 'StringTie Assemble and quantify' in
Run mstrg_prep.pl to extract more gene ids from StringTie output (MSTRG names)
Ref: gpertea/stringtie#179
Script is here: https://gist.github.com/gpertea/b83f1b32435e166afa92a2d388527f4b
Command: perl mstrg_prep.pl stringtie_merged.gtf > stringtie_merged_pl.gtf
---- END OF READ MAPPING ----
- Do "Create a csv containing sample ids" in
Set_up_arrays.Rmd - Output:
/scripts/DESeq_scripts/describe_samples.csv
- Performed, but reported results not based on Ballgown.
- All scripts located in
/scripts/DESeq_scripts/assembly_longProtocol/ - Intermediate files located in
/processed/longProtocol/
- Obtain count-based gene and transcript table by running
/scripts/DESeq_scripts/prepDE.py - Instructions are located in
/scripts/DESeq_scripts/prepDEpy_instructions.txt - Input: StringTie output should be put in this directory structure:
/ballgown/sampleName/- it is required this directory be named 'ballgown' (case-sensitive)
- Output: two matrices:
/processed/shortProtocol/S03_short_ballG/transcript_count_matrix.csv, and/processed/shortProtocol/S03_short_ballG/gene_count_matrix.csv
batch_DESeq.Rmd- data prep for sva- Input1:
/processed/describe_samples_batch.csv - Input2:
/processed/shortProtocol/S03_short_matrices/gene_count_matrix.csv - writes: 1)
/processed/DESEQ/Expr_countData.csv, a DESeqDataSet based on design matrix~ tissue*treatmentwhich is input into SVASeq - writes: 2)
/processed/DESEQ/sampleDat_with_SV.csv. Surrogate variables are appened as additional columns - See
/scripts/DESeq_scripts/batch_DESeq.html
- Input1:
- Differential gene expression with DESeq2
- Gene set enrichment analysis with GAGE
- Hierarchical clustering with WGCNA
- Examination of eQTL in Stanley et al (2017)
/scripts/DESeq_scripts/DESeq_DExpr.Rmd
-
Takes the two outputs from step six as input files:
- Input 1:
/processed/DESEQ/Expr_countData.csv - Input2:
/processed/DESEQ/sampleDat_with_SV.csv
- Input 1:
-
Runs two DESeq models:
- dds.01: design = ~ SV1 + batch + tissue + treatment
- Output1
/processed/DESEQ/dds_deseq01.Rda - dds.02: design = ~ SV1 + batch + tissue*treatment
- Output2:
/processed/DESEQ/dds.02.Rda
-
Performs likelihood ratio test for several reduced models to identify DEGs
- Effect of treatment
reduced=~ SV1 + batch + tissue; saves/processed/DESEQ/lrtTreat_allPadj.Rda - Effect of tissue
reduced=~ SV1 + batch + treatment - Effect of interaction
reduced=~ SV1 + batch + tissue + treatment; saves/processed/DESEQ/lrtInt_allPadj.Rda
- Effect of treatment
-
Writes csv files filtered at given thresholds:
- Treatment effect
/processed/DESEQ/DEGs_lrt.treatment_0.05.csv - Tissue effect
/processed/DESEQ/DEGs_lrt.tissue_0.05.csv - Interaction effect
/processed/DESEQ/DEGs_lrt.int_0.05.csv
- Treatment effect
-
Writes a background list of genes for GO (i.e. all genes with >1 transcript)
/processed/DESEQ/gene_background_list.csv
/scripts/DESeq_scripts/compare_StringTie_method.Rmd
- Input1:
/processed/DESEQ/Expr_countData.csv - Input2:
/processed/longProtocol/Old-gene_count_matrix.csv - Input3:
/processedshortProtocol/S03_short_ballG/gene_count_matrix.csv - Input4:
/processed/describe_samples_batch.csv - Input4:
/processed/longProtocol/gene_count_matrix_pl3c.csv - Does not write output - QC only.
Without removal of batch effects:
/scripts/DESeq_scripts/short_protc_viz.Rmd
- Input1:
/processed/describe_samples.csv - Input2:
/processed/shortProtocol/S03_short_ballG/ballgown - Produces transcript expression comparisons across 54 samples (for each diet and tissue):
- Output1:
/plots/Expression_54_barplot.pdf - Output2:
/plots/Expression_54_barplot_b2.pdf - Input3:
/processed/DESEQ/dds_deseq01.Rda - Input4:
/processed/DESEQ/sampleDat_with_SV.csv - Output3: PCA plot
/plots/PCAplot_rlog_uncorrected_trt.pdf
- Output1:
After removal of batch effects (limma::removeBatchEffect)
- Output4: PCA plot
/plots/PCAplot_rlog_corrected_trt.pdf
Pairwise log2FoldChange comparisons
/scripts/DESeq_scripts/foldchange_DEseq.Rmd
- Input1:
/processed/DESEQ/Expr_countData.csv - Input2:
/processed/DESEQ/sampleDat_with_SV.csv - Saves
/processed/DESEQ/all_fc_dat.rda - Output1: foldchange comparison plots of HS vs DR relative to C
/plots/FC_all.pdf - Output2: volcano plots tissue-diet combinations
/plots/Voc_all.pdf
Interaction effects - QC
/scripts/DESeq_scripts/interaction_analysis.Rmd
- Input1:
/processed/DESEQ/lrtInt_allPadj.Rda - Input2:
/processed/DESEQ/dds_deseq.02.Rda - Output1:
/plots/Interaction_ex.pdf - Input3:
/processed/DESEQ/cand_genes.csv - Input4:
/processed/DESEQ/dds_deseq.02.Rda - Output2:
/plots/Interaction_cand1.pdf
- GSEA: KEEG pathways and GO on whole list DEGs for the effect of diet performed in GAGE package
/scripts/GOscripts/gsea_GO.Rmd
- Input:
/processed/DESEQ/all_fc_dat.rda - Writes table of enriched pathways and GO terms
/processed/DESEQ/GO/sig_terms.csv
/scripts/DESeq_scripts/wgcna_resampling.Rmd
-
This script resamples repeatedly from 54 samples to identify modules before analysis on the full data set. It creates 100 samples each from a random subset of 36 samples, and performs WGCNA with same settings as full data set.
- Input1:
/processed/DESEQ/sampleDat_with_SV.csv - Input2:
/processed/DESEQ/rlogtrt_batchCor.csv - saves:
/processed/DESEQ/Coexpression/Resample_WGCNA_mods.rda - saves:
/processed/DESEQ/Coexpression/Resample_WGCNA_eigen.rda - writes:
/processed/DESEQ/Coexpression/Resamp_droppedGenes.csv
- Input1:
/scripts/DESeq_scripts/wgcnaCoex.Rmd
- Input1:
/processed/DESEQ/sampleDat_with_SV.csv - Input2
/processed/DESEQ/dds_deseq01.Rda - Input3
/processed/DESEQ/dds_deseq.02.Rda - Performs rlog transformation of raw expression data, then removes batches with
limma::removeBatchEffectwrites/processed/DESEQ/rlogtrt_batchCor.csv - Implements the whole WGCNA workflow
- First, genes that are weakly assigned to modules based on resampling are moved to unassigned module
- writes modules sizes
/processed/DESEQ/Coexpression/mergedModules_Sizes.txt - writes
/processed/DESEQ/Coexpression/modueleMembership.csv - writes module eigenegenes
/processed/DESEQ/Coexpression/WCGNA_eigengenes.txt
/scripts/DESeq_scripts/wgcna_ANOVAs.Rmd
- Performs module-based ANOVA, and module-based GO term analysis
- Input1:
/processed/DESEQ/Coexpression/WCGNA_eigengenes.txt - Input2:
/processed/DESEQ/Coexpression/modules.RData- contains module color data - Input3:
/processed/DESEQ/rlogtrt_batchCor.csv- batch corrected expression data - Input4:
/processed/DESEQ/Coexpression/FlyAnnotation.csv- fly annotation: (ftp://ftp.flybase.net/releases/FB2018_05/precomputed_files/genes/) - writes modules
/processed/DESEQ/Coexpression/minMod30/- entrez ids;/Mods- FBGN & symbols - Output:
/processed/DESEQ/Coexpression/minMod30/GOEnrichmentTable.csv - Output:
/processed/DESEQ/Coexpression/minMod30/GOEnrichmentTable_simple.txt- a screen viewable table - Input5:
/processed/DESEQ/sampleDat_with_SV.csv
- Input1:
- Main output are ANOVA and adhoc test results
/scripts/DESeq_scripts/wgcna_ANOVAs.html
/scripts/DESeq_scripts/wgcna_eigengenExp_acrossDiets.Rmd
- Visualization of eigengene expression in all diets and tissues
- Input1:
/processed/DESEQ/sampleDat_with_SV.csv - Input2:
/processed/DESEQ/Coexpression/WCGNA_eigengenes.txt - Output:
/plots/eigengenes_all.pdf
- Input1:
/scripts/GOscripts/wgcna_resampled_GO.Rmd
- GO analysis on WGCNA modules after resampling
- Input1:
/processed/DESEQ/Coexpression/Resamp_droppedGenes.csv. Genes weakly assigned to modules to be removed. - Input2:
/processed/DESEQ/Coexpression/Mods/fbgn_names/*.txt. Module gene assignments before resampling - Input3:
/processed/DESEQ/Coexpression/modules.RData. Module color information - Input4:
/processed/DESEQ/Coexpression/FlyAnnotation.csv. Annotation file for mapping gene ids - Output1:
/processed/DESEQ/Coexpression/resampMods/GOEnrichTab_Resamp.csv. GO terms by module at P-value cut-off <0.01. - Output2:
/processed/DESEQ/Coexpression/resampMods/GOEnrichTab_Resamp_simple.csv. Same output but lesser details to enable viewing on the screen. - Notes on each module.
- Note:
/scripts/GOscripts/style.css- for controlling font size in a knitted output.
- Input1:
/scripts/DESeq_scripts/individual_genes_viz.Rmd
- Expression patterns of canonical nutrient sensing pathway genes
- Input1:
/processed/DESEQ/DEG_QTL/gene_map_table.csv - Input2:
/processed/DESEQ/dds_deseq.02.Rda - Output:
/plots/Individualgenes.pdf
- Input1:
/scripts/DESeq_scripts/genes_in_QTL.Rmd
- search DEGs in a median lifespan suggestive QTL from Stanley et al (2017)
- Input1:
/processed/DESEQ/all_fc_dat.rda - Input2:
/processed/DESEQ/DEG_QTL/gene_map_table.csv - Input3:
/processed/DESEQ/DEGs_lrt.treatment_0.05.csv - Input4:
/processed/DESEQ/DEGs_lrt.int_0.05.csv - Input5:
/processed/DESEQ/DEG_QTL/all_iisPeak.rda - Input6:
/processed/DESEQ/Coexpression/Obs_modMembership.csv - Input7:
/processed/DESEQ/DEG_QTL/iis_genelist_clean.txt - Input8:
/processed/DESEQ/Coexpression/Resamp_droppedGenes.csv - Input9:
/processed/DESEQ/DEG_QTL/automated_gene_summaries.tsv - Output1:
/processed/DESEQ/DEG_QTL/all_diffExpr_underQTL.csv - Output2:
/processed/DESEQ/DEG_QTL/qtl_degs.csv
- Input1:
Get more gene_ids from StringTie object with python: https://www.biostars.org/p/261128/
Ref: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de Ref p.92-93 Haddok & Dunn book Detailed instructions are in /scripts/DESeq_scripts/prepDEpy_instructions.txt
To split a .csv column with multiple id entries, follow: http://help.snapizzi.com/csv-files/split-csv-data-into-different-columns
DESeq vs Ballgown, see https://support.bioconductor.org/p/107011/
Precomputed FlyBase files: http://flybase.org/cgi-bin/get_static_page.pl?file=bulkdata7.html&title=Current%20Release
2018-12-27 (EN)
Wanted to compare results with a different approach: ENRICHMENTBROWSER
/scripts/GOscripts/enrichBrowse_GO.Rmd
Stuck on achieving a SummarizedExperiment class matrix of results
I have a DESeqResults class object
2018-12-22 (EN)
Gene set enrichment analysis using GAGE
/scripts/GOscripts/gsea_GO.Rmd
Load load("../../processed/DESEQ/GO/lrt.treatment.Rda")
Need libraries: gage, pathview and org.Dm.eg.db
Followed this tutorial alongside all 3 vignettes: http://www.gettinggeneticsdone.com/2015/12/tutorial-rna-seq-differential.html
Mapped FlyBase CG SYMBOLs in our results to fly annotation database org.Dm.eg.db
Converted CG SYMBOLs to ENTREZ
Performed pathway and GO analyses
2018-11-30 (EGK) Computed fold changes with shrinkage (method= normal). Made plots of fold changes in DR and HS relative to control for body, ovary & head. P-values are the adjusted p-values for the OVERALL effect of treatment, so could be driven by one diet or tissue. Script is foldchange_DEseq.Rmd, fold change data is in all_fc_dat.rda and plot is FC_all.pdf
2018-11-07 (EN)
Reduced number of scripts to 3, renamed as follows:
prep_DESeq.Rmd renamed batch_DESeq.Rmd - runs SVA
bg_shortP_viz.Rmd renamed short_protc_viz.Rmd - makes visualizations in ballgown and DESeq2
DESeq_DE.Rmd renamed DESeq_DExpr.Rmd - fits models for differential expression in DESeq2 based expression values from the short protocol of stringtie
2018-11-02 (EN)
PCA plots in DESeq2: DESeq_viz.Rmd
Input1 /processed/DESEQ/sampleDat_with_SV.csv (made in prep_DESe2.Rmd)
Input2: /processed/DESEQ/normExpr_countData.csv (sva batch-controlled short protocol expression made in prep_DESe2.Rmd)
Run deseq() on model without- and model with interaction
Transform DESeq results using 1) Variance Stabilization Transformation (vst),
2) Regularized Log Transformation (rlog)
Plot 2 PCA graphs for PC1/PC2, one for for each transformation; both for the interaction because PCA plots for both DESeq models not different
2018-10-26 (EN) Making PCA for DESeq data Get fpkm values with fpkm(). An error occurs. Solution is to use tximport package to normalize count values for gene size https://support.bioconductor.org/p/83607/ Starts from "stringtie -eB -G transcripts.gff <source_file.bam>" cd to /processed/S03_short_ballG/ballgown: find * -type d > ../sampNames.txt write $ in the search window i.e. matches end of line write _ballgown.gtf in the replace window write ^ in search window i.e. matches start of line write ../../processed/S03_short_ballG/ballgown
2018-10-25 (EN)
DEGs for treatment effect - 2475 genes
DEGs for interaction effect - 977 genes
Identified 3 DEGs under trans-eQTL (step and Ilp5) peaks. In script genes_in_QTL.Rmd, see object distt
- Ahcy (step QTL) is found in both treatment and interaction lists. Gene summary file associates it with: viable, aging, fecundity. Involved in NAD processes
- Cda4 (Ilp5 QTL) occurs only in treatment list. Involved in chitin and carbohydrate metabolism via NADH activity
- CG40486 in treatment list
slif was not DE in both treatment and interaction effects.
Scanned Stanley et al 2017 trans-eQTL and lifespan QTL for DEGs - see R object qtl_degs
Split prep_DESeq.Rmd script into:
prep_DESeq.Rmd: runs batch control of Stringtie output using SVA
BaseP_DESeq.Rmd: Differential expression models with DESeq2
BaseP_DESeq_Visuals.Rmd: plots to visualize expression data
genes_in_QTL.Rmd
Code to scan past QTL peaks (Stanley et al 2017) for DEGs (this study)
Reads and pulses "../../processed/gene_map_table_fb_2018_05.tsv" downloaded from ftp://ftp.flybase.net/releases/FB2018_05/precomputed_files/genes/.
A function, DEunderPeak() scans under QTL peaks.
A prepared data frame is saved to "../../processed/DESEQ/DEG_QTL/gene_map_table.csv").
Steps:
- merge a dataframe of DEGs (DEG_treat) and the gene map table (gmtable) on FBgn
- read in data with QTL peaks ../../processed/DESEQ/DEG_QTL/iis_table1.txt. This excludes peak that span the centromere
- scan through trans eQTL and lifespan peaks for DEGs
Download a precomputed gene annotation file for use in GO analysis Fbgn annotation file "Genes:fbgn_annotation_ID_fb_2018_05.tsv" Index of ftp://ftp.flybase.net/releases/FB2018_05/precomputed_files/genes/
Convert FBgn symbols to name and annotation symbols for use with GO analysis packages Go to FlyBase ID Converter tool: http://flybase.org/convert/id Paste list of FBgn.. and choose 'Validate and Convert' to 'Genes' On result page choose 'BatchDownload' Select Symbol, Name, Annotation Symbol, FlyBase ID. Push 'Get Field Data' Press 'Download as a file'
DAVID web tool for Gene Ontology Analysis. I used the Functional Annotation Tool to get GO terms in all 3 categories (BP, MF and CC). Also obtained annotation chart and table
969 genes sig differentiated (padj<0.05, LRT) from the interaction model, dds.02
GO_deseq.Rmd
GOplot to visualize: https://cran.r-project.org/web/packages/GOplot/vignettes/GOplot_vignette.html
- Use GOrilla for functional analysis of each list of DEGs a) as a whole and b) up- down- separately Discontinued.
Run a model in DESeq2 to test effect of tissue x treatment interaction on normalized batch-controlled gene-wise data: dds.02 <- DESeqDataSetFromMatrix(countData = countdata, colData = phenDat.sva, design = ~ SV1 + batch + tissue*treatment)
The following pairwise comparisions for treatment effect: rr1 <- results(dds_deseq.02, contrast=c('tissue', "H","B"), alpha = 0.05) rr2 <- results(dds_deseq.02, contrast=c('tissue', "O","B"), alpha = 0.05) rr3 <- results(dds_deseq.02, contrast=c('tissue', "O","H"), alpha = 0.05)
and,
tissue effect: ff1 <- results(dds_deseq.02, contrast=c('treatment', "C","DR"), alpha = 0.05) ff2 <- results(dds_deseq.02, contrast=c('treatment', "C","HS"), alpha = 0.05) ff3 <- results(dds_deseq.02, contrast=c('treatment', "DR","HS"), alpha = 0.05)
resulted in tables of DEGs.
- I compiled a list of 26 genes previously identified in QTL studies of diet treatments (Stanley et al, 2017):
/processed/DESEQ/DEG_QTL/iis_genelist.csv". I identify any of these which are also significantly expressed in this study: files written to/processed/DESEQ/DEG_QTL/
Genes identified are collected in one file: /processed/DESEQ/DEG_QTL/qtl_genes_all_FlyBase.csv. Also present in this file is FlyBase genome and functional data for each gene from a batch search.
Developed compare_StringTie_method script. Compares: 1) gene count table generated from stringtie run that includes estimating novel transcripts, 2) gene count table generated from stringtie run that includes estimating novel transcripts followed by perl script to get back FBgn names, 3) gene count table ignoring novel transcripts
General comparison:
Type 1 has very few FBgn names. When comparing type 2 and 3 (merging on gene names), those with differences are almost all a MSTRG that is split up in the perl script to assign gene names. So some reads are assigned to a gene and some just get an MSTRG number (both share the same string tie MSTRG number). Another set are genes with more than one assigned with the perl script. It makes sense these do not agree. Those that are left ~800 I am not sure why they don't agree. For one I looked at, the pattern matches what modEncode reports on flybase (low ovary expression) for type 3, while type 2 and 1 are both way higher for all tissues including ovaries. In conclusion, I feel good about using type 3 for GENE level analysis.
Only 200 gene names in type 3 are not represented in type 2. Not sure why for these.
Implementing an alternative workflow. A visual of the short protocol is here: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de
Run stringtie -eB on the output of samtools -sort (ie. step 2 of Pertea et al, 2016)
Gene count tables are written to /processed/S03_short_ballG/ gene_count_matrix and transcript_count_matrix
Attempting to generate novel transcripts and get an accurate gene level analysis is complex. We plan to perform separate pipelines for a transcript level and gene level analysis.
Gene level analysis: we will use the reference annotation rather than performing the merge step in StringTie followed by DEseq
Transcript level analysis: we will use the StringTie-Ballgown pipeline
Exploration of the gene count table generated from the data using the merge file before and after assigning a reference gene gave different results. Some MSTRGs are split up into those that are assigned to a gene and those that are not.
- Get more gene ids tranferred from StringTie by running a perl script
mstrg_prep.plexplained here: gpertea/stringtie#179 and documented here: https://gist.github.com/gpertea/b83f1b32435e166afa92a2d388527f4b - Command:
perl mstrg_prep.pl stringtie_merged.gtf > stringtie_merged_pl.gtf - Then, re-run step 6 (Pertea et al, 2016) to compute transcript abundances: sbatch
stringtie_abundances.sh - Results stored in
ballG_pl - Problem: duplicated MSTRG and Fbgn ids. Perl script appears to assign transcript if not sure, to the same MSTRG id creating duplicates.
- Implemented DESeq2 on svaseq-prepped read count data
- 22 rows did not converged when running
ddsDE.ttSS <- DESeq(dds.ttSS) - Dealing with autocorrelation: see https://support.bioconductor.org/p/65091/ and tried:
-
- Filter out normalized count of at least 10 reads in two or more samples.
filter <- rowSums(nc >= 10) >= 2- 8 rows did not converge
-
- Try increasing the 2 sample requirement to 3 or 4
filter <- rowSums(nc >= 10) >= 3- 6 rows did not converge
-
- omit non-converging rows from the results step
ddsDE_Clean <- ddsDE_filt10[which(mcols(ddsDE_filt10)$betaConv),]- didn't run, but also doesn't feel a good approach!
-
- increase the maximum iterations (maxit) instead of running DESeq()
filter <- rowSums(nc >= 10) >= 3plusnbinomWaldTest(dds, maxit=500)
-
filter <- rowSums(nc >= 5) >= 2plusnbinomWaldTest(maxit=500)
- 9 rows don't converge
- Looked through DEseq code to check
- Implemented svaseq to check for batch effects
- Will follow this: https://support.bioconductor.org/p/80755/
- Needed to remove all genes with sample counts that were all zeros. (gave this error: Error in density.default(x, adjust = adj) : 'x' contains missing values)
- sva identified 2 svas
- clearly 4 groupings when plotting svas- unknown sources
- Performed first DESeq model:
design = ~ treatment + tissue. Need to 1) do batch analysis and 2) include an interaction termtreatment*tissue
- Sub-structured the
scriptsdirectory intoBallgown_scriptsandDESeq_scripts - Renamed
/processed/ballG_all/as/processed/ballgown. The directory structure created bt StringTie is requiredprepDE.pyto compile counts (see http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de)
- Started a new, streamlined
DiffExpr_batch.Rmdscript. Script with all models tried up to now is pushed toscript_sink/anDiffExpr_batch_old.Rmd
- Move all ballgown stuff from
DiffExpr_batch.RmdtoDiffExpr.Rmd - clean redundant code (i.e. same code in multiple scripts)
- change column
groupinphenDattobatch
Plan for analysis: (EN)
Compare ballgown and sva differential expression analyses
- Ballgown done in
DiffExpr.Rmd - Prep data for sva -
dataPrep_batch.Rmd - Do sva -
DiffExpr_batch.Rmd - Compare sva & ballgown
- Decided: 1) check that known batches in our case have effect on expression (a. include batch in PCA, b. test with sva or linear model) 2) run svaseq without known bathes
- Tested for batch effects using SVA package. The "be" and "leek" methods applied ns.v() produced 11 and 0 SVs, respectively.Wondering why the difference we read:
- Leek_2014_ svaseq: removing batch effects and other unwanted noise from sequencing data,
- Leek_2011_Asymptotic Conditional Singular Value Decomposition for High-Dimensional Genomic Data,
- https://support.bioconductor.org/p/97469/,
- Jaffe et al_2015_ Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis
- Found that methods can produce diferent results