BasePop_RNAseq/projectNotes.md at master · nochet/BasePop_RNAseq

title: "RNAseq Base Population Project Record & Notes" author: "Enoch Ng'oma (EN), and Elizabeth King (EK)" date: "7/9/2018"

Project Record

RNA read mapping

Locate these resources:

Download reference files: wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/bdgp6_tran.tar.gz

then: gunzip bdgp6_tran.tar.gz
then: tar -xvf bdgp6_tran.tar

Sequence assembly (ref. Pertea et al 2016)
/scripts/assembly_shortProtocol/SetUpArrays_short.Rmd for use with the short pipeline leading to DESeq2
/scripts/assembly_longProtocol/Set_up_arrays.Rmd for use with the long pipeline leading to Ballgown

See /scripts/assembly_shortProtocol/SetUpArrays_short.html
A step-by-step workflow follows:

STEP 1: Align reads to reference - Hisat2

To test, align one sample to genome: hisat2 --dta -q -x bdgp6_tran/genome_tran -U C-1_B_run1.fastq.gz -S C-1_B_run1_run1.sam >temp.txt 2>error.txt &
Run Step 1 Align in /base_pop/scripts/assembly_longProtocol/Set_up_arrays.Rmd
Output are 3 text lists of hisat commands:

/base_pop/scripts/assembly_longProtocol/S01_Align_cmd_run1.txt
/base_pop/scripts/assembly_longProtocol/S02_Align_cmd_run2.txt
/base_pop/scripts/assembly_longProtocol/S03_Align_cmd_run3.txt

In the terminal, cd to /scripts/assembly_longProtocol/
First run test alignment of two samples: sbatch --array=1-2 sarray_setup_run1.sh
this uses an sbatch script /scripts/assembly_longProtocol/sarray_setup_run1.sh
Second, run for all samples in a /base_pop/samples/run1/: sbatch --array=1-54 sarray_setup_run1.sh
sbatch scripts available: /scripts/assembly_longProtocol/sarray_setup_run1.sh # 54 samples in run1 /scripts/assembly_longProtocol/sarray_setup_run2.sh # 54 samples in run2 /scripts/assembly_longProtocol/sarray_setup_run3.sh # 54 samples in run3
Output files are .sam and are located in /base_pop/processed/dotsams/
Note: can run all sbatch files at once - SLARM takes care of resources

STEP 2a: Sort and convert SAM files to BAM - Samtools

Example from Pertea et al 2016: samtools sort -@ 8 -o ERR188044_chrX.bam ERR188044_chrX.sam
In R, do Samtools sort (step 2a in Set_up_arrays.Rmd)
Output1: /base_pop/scripts/assembly_longProtocol/S02a_short_samtools.txt
Next, run sbatch --array=1-162 samtools_all.sh for all samples. This reads S02a_short_samtools.txt created above.
Output2: .bam files located in /base_pop/processed/dotbams/

STEP 2b: Merge to combine corresponding transcripts from each run - Samtools

Do Samtools merge in R (step 2b in Set_up_arrays.Rmd)
Output1: /scripts/assembly_longProtocol/S02b_samtools_merge.txt
Run sbatch --array=1-54 samtools_merge.sh which reads S02b_samtools_merge.txt to merge like files
Output2: 54 sampleName_merged.bam located in /base_pop/processed/dotbams/

We use the alternative workflow from here on

Novel transcripts not needed for analysis. Therefore we adopt the alternative pipeline for stringtie: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de
Instead of merging transcripts from all samples (step 4 in Pertea et al, 2016), we get expression estimates separately for each sample by applying the reference annotation (rather than merged sample .gtf files) to each sample separately.So we apply stringtie -eB to the output of 2b (i.e. sampleName_merged.bam)

Quote from gpertea, June 11, 2018 gpertea/stringtie#170:

"... if you run the alternate, faster protocol which does not attempt to assemble any novel genes, the output will only have expression level values for the known transcripts given in the reference annotation, and no merge step is needed, so there will be no internally generated IDs like MSTRG.# or STRG.# that could possibly join multiple reference gene IDs. So if you really do not care about any novel transcripts/isoforms and trust the completeness and correctness of your reference annotation, give this alternate protocol a try .... You could then use the prepDE script on the GTF files generated by the these stringtie runs (if you prefer using DESeq2 or other DE analysis tools instead of Ballgown)."

STEP 3: StringTie (Transcript abundances and table of counts)

stringtie -e -B

Run StringTie transcript abundances in Set_up_arrays.Rmd
Output1: /scripts/assembly_longProtocol/S03_abundances_short.txt
Next, run sbatch --array=1-54 stringt_short_abund.sh
Output2: each of 54 sample directories (e.g. C-1_B, etc) are written to /processed/shortProtocol/S03_short_ballG/.
Procedure:
- Do Step 3: 'StringTie Assemble and quantify' in Set_up_arrays.Rmd in R
- Output: S03_stringtie_assemble.txt
- Next, run sbatch --array=1-54 stringtie_assemble.sh
- Output: 54 'sample_name_assembled.gtf' files located in /base_pop/processed/dotgtfs/

Run mstrg_prep.pl to extract more gene ids from StringTie output (MSTRG names) Ref: gpertea/stringtie#179 Script is here: https://gist.github.com/gpertea/b83f1b32435e166afa92a2d388527f4b Command: perl mstrg_prep.pl stringtie_merged.gtf > stringtie_merged_pl.gtf

---- END OF READ MAPPING ----

STEP 4: Create a csv containing sample ids

Do "Create a csv containing sample ids" in Set_up_arrays.Rmd
Output: /scripts/DESeq_scripts/describe_samples.csv

STEP5 A: Differential gene expression with BallGown

Performed, but reported results not based on Ballgown.
All scripts located in /scripts/DESeq_scripts/assembly_longProtocol/
Intermediate files located in /processed/longProtocol/

STEP5B: Prep for DESeq (gene-level, read count-based differential expression )

Obtain count-based gene and transcript table by running /scripts/DESeq_scripts/prepDE.py
Instructions are located in /scripts/DESeq_scripts/prepDEpy_instructions.txt
Input: StringTie output should be put in this directory structure: /ballgown/sampleName/
- it is required this directory be named 'ballgown' (case-sensitive)
Output: two matrices: /processed/shortProtocol/S03_short_ballG/transcript_count_matrix.csv, and /processed/shortProtocol/S03_short_ballG/gene_count_matrix.csv

STEP 6: Control for batch effects using SVAseq package

batch_DESeq.Rmd - data prep for sva
- Input1: /processed/describe_samples_batch.csv
- Input2: /processed/shortProtocol/S03_short_matrices/gene_count_matrix.csv
- writes: 1) /processed/DESEQ/Expr_countData.csv, a DESeqDataSet based on design matrix ~ tissue*treatment which is input into SVASeq
- writes: 2) /processed/DESEQ/sampleDat_with_SV.csv. Surrogate variables are appened as additional columns
- See /scripts/DESeq_scripts/batch_DESeq.html

Differential expression and downstream analysis

Differential gene expression with DESeq2
Gene set enrichment analysis with GAGE
Hierarchical clustering with WGCNA
Examination of eQTL in Stanley et al (2017)

Differential gene expression with DESeq2

/scripts/DESeq_scripts/DESeq_DExpr.Rmd

Takes the two outputs from step six as input files:
- Input 1: /processed/DESEQ/Expr_countData.csv
- Input2: /processed/DESEQ/sampleDat_with_SV.csv
Runs two DESeq models:
- dds.01: design = ~ SV1 + batch + tissue + treatment
- Output1 /processed/DESEQ/dds_deseq01.Rda
- dds.02: design = ~ SV1 + batch + tissue*treatment
- Output2: /processed/DESEQ/dds.02.Rda
Performs likelihood ratio test for several reduced models to identify DEGs
- Effect of treatment reduced=~ SV1 + batch + tissue; saves /processed/DESEQ/lrtTreat_allPadj.Rda
- Effect of tissue reduced=~ SV1 + batch + treatment
- Effect of interaction reduced=~ SV1 + batch + tissue + treatment; saves /processed/DESEQ/lrtInt_allPadj.Rda
Writes csv files filtered at given thresholds:
- Treatment effect /processed/DESEQ/DEGs_lrt.treatment_0.05.csv
- Tissue effect /processed/DESEQ/DEGs_lrt.tissue_0.05.csv
- Interaction effect /processed/DESEQ/DEGs_lrt.int_0.05.csv
Writes a background list of genes for GO (i.e. all genes with >1 transcript) /processed/DESEQ/gene_background_list.csv

Compare numbers of DEGs obtained by DESeq2 vs Ballgown

/scripts/DESeq_scripts/compare_StringTie_method.Rmd

Input1: /processed/DESEQ/Expr_countData.csv
Input2: /processed/longProtocol/Old-gene_count_matrix.csv
Input3: /processedshortProtocol/S03_short_ballG/gene_count_matrix.csv
Input4: /processed/describe_samples_batch.csv
Input4: /processed/longProtocol/gene_count_matrix_pl3c.csv
Does not write output - QC only.

Visualization of global expression changes

Without removal of batch effects:

/scripts/DESeq_scripts/short_protc_viz.Rmd

Input1: /processed/describe_samples.csv
Input2: /processed/shortProtocol/S03_short_ballG/ballgown
Produces transcript expression comparisons across 54 samples (for each diet and tissue):
- Output1: /plots/Expression_54_barplot.pdf
- Output2: /plots/Expression_54_barplot_b2.pdf
- Input3: /processed/DESEQ/dds_deseq01.Rda
- Input4: /processed/DESEQ/sampleDat_with_SV.csv
- Output3: PCA plot /plots/PCAplot_rlog_uncorrected_trt.pdf

After removal of batch effects (limma::removeBatchEffect)

Output4: PCA plot /plots/PCAplot_rlog_corrected_trt.pdf

Pairwise log2FoldChange comparisons

/scripts/DESeq_scripts/foldchange_DEseq.Rmd

Input1: /processed/DESEQ/Expr_countData.csv
Input2: /processed/DESEQ/sampleDat_with_SV.csv
Saves /processed/DESEQ/all_fc_dat.rda
Output1: foldchange comparison plots of HS vs DR relative to C /plots/FC_all.pdf
Output2: volcano plots tissue-diet combinations /plots/Voc_all.pdf

Interaction effects - QC

/scripts/DESeq_scripts/interaction_analysis.Rmd

Input1: /processed/DESEQ/lrtInt_allPadj.Rda
Input2: /processed/DESEQ/dds_deseq.02.Rda
Output1: /plots/Interaction_ex.pdf
Input3: /processed/DESEQ/cand_genes.csv
Input4: /processed/DESEQ/dds_deseq.02.Rda
Output2: /plots/Interaction_cand1.pdf

Gene set enrichment analysis with GAGE

GSEA: KEEG pathways and GO on whole list DEGs for the effect of diet performed in GAGE package

/scripts/GOscripts/gsea_GO.Rmd

Input: /processed/DESEQ/all_fc_dat.rda
Writes table of enriched pathways and GO terms /processed/DESEQ/GO/sig_terms.csv

Hierarchical clustering with WGCNA

/scripts/DESeq_scripts/wgcna_resampling.Rmd

This script resamples repeatedly from 54 samples to identify modules before analysis on the full data set. It creates 100 samples each from a random subset of 36 samples, and performs WGCNA with same settings as full data set.
- Input1: /processed/DESEQ/sampleDat_with_SV.csv
- Input2: /processed/DESEQ/rlogtrt_batchCor.csv
- saves: /processed/DESEQ/Coexpression/Resample_WGCNA_mods.rda
- saves: /processed/DESEQ/Coexpression/Resample_WGCNA_eigen.rda
- writes: /processed/DESEQ/Coexpression/Resamp_droppedGenes.csv

/scripts/DESeq_scripts/wgcnaCoex.Rmd

Input1: /processed/DESEQ/sampleDat_with_SV.csv
Input2 /processed/DESEQ/dds_deseq01.Rda
Input3 /processed/DESEQ/dds_deseq.02.Rda
Performs rlog transformation of raw expression data, then removes batches with limma::removeBatchEffect writes /processed/DESEQ/rlogtrt_batchCor.csv
Implements the whole WGCNA workflow
- First, genes that are weakly assigned to modules based on resampling are moved to unassigned module
- writes modules sizes /processed/DESEQ/Coexpression/mergedModules_Sizes.txt
- writes /processed/DESEQ/Coexpression/modueleMembership.csv
- writes module eigenegenes /processed/DESEQ/Coexpression/WCGNA_eigengenes.txt

/scripts/DESeq_scripts/wgcna_ANOVAs.Rmd

Performs module-based ANOVA, and module-based GO term analysis
- Input1: /processed/DESEQ/Coexpression/WCGNA_eigengenes.txt
- Input2: /processed/DESEQ/Coexpression/modules.RData - contains module color data
- Input3: /processed/DESEQ/rlogtrt_batchCor.csv - batch corrected expression data
- Input4: /processed/DESEQ/Coexpression/FlyAnnotation.csv - fly annotation: (ftp://ftp.flybase.net/releases/FB2018_05/precomputed_files/genes/)
- writes modules /processed/DESEQ/Coexpression/minMod30/ - entrez ids; /Mods - FBGN & symbols
- Output: /processed/DESEQ/Coexpression/minMod30/GOEnrichmentTable.csv
- Output: /processed/DESEQ/Coexpression/minMod30/GOEnrichmentTable_simple.txt - a screen viewable table
- Input5: /processed/DESEQ/sampleDat_with_SV.csv
Main output are ANOVA and adhoc test results /scripts/DESeq_scripts/wgcna_ANOVAs.html

/scripts/DESeq_scripts/wgcna_eigengenExp_acrossDiets.Rmd

Visualization of eigengene expression in all diets and tissues
- Input1: /processed/DESEQ/sampleDat_with_SV.csv
- Input2: /processed/DESEQ/Coexpression/WCGNA_eigengenes.txt
- Output: /plots/eigengenes_all.pdf

/scripts/GOscripts/wgcna_resampled_GO.Rmd

GO analysis on WGCNA modules after resampling
- Input1: /processed/DESEQ/Coexpression/Resamp_droppedGenes.csv. Genes weakly assigned to modules to be removed.
- Input2: /processed/DESEQ/Coexpression/Mods/fbgn_names/*.txt. Module gene assignments before resampling
- Input3: /processed/DESEQ/Coexpression/modules.RData. Module color information
- Input4: /processed/DESEQ/Coexpression/FlyAnnotation.csv. Annotation file for mapping gene ids
- Output1: /processed/DESEQ/Coexpression/resampMods/GOEnrichTab_Resamp.csv. GO terms by module at P-value cut-off <0.01.
- Output2: /processed/DESEQ/Coexpression/resampMods/GOEnrichTab_Resamp_simple.csv. Same output but lesser details to enable viewing on the screen.
- Notes on each module.
- Note: /scripts/GOscripts/style.css - for controlling font size in a knitted output.

/scripts/DESeq_scripts/individual_genes_viz.Rmd

Expression patterns of canonical nutrient sensing pathway genes
- Input1: /processed/DESEQ/DEG_QTL/gene_map_table.csv
- Input2: /processed/DESEQ/dds_deseq.02.Rda
- Output: /plots/Individualgenes.pdf

/scripts/DESeq_scripts/genes_in_QTL.Rmd

search DEGs in a median lifespan suggestive QTL from Stanley et al (2017)
- Input1: /processed/DESEQ/all_fc_dat.rda
- Input2: /processed/DESEQ/DEG_QTL/gene_map_table.csv
- Input3: /processed/DESEQ/DEGs_lrt.treatment_0.05.csv
- Input4: /processed/DESEQ/DEGs_lrt.int_0.05.csv
- Input5: /processed/DESEQ/DEG_QTL/all_iisPeak.rda
- Input6: /processed/DESEQ/Coexpression/Obs_modMembership.csv
- Input7: /processed/DESEQ/DEG_QTL/iis_genelist_clean.txt
- Input8: /processed/DESEQ/Coexpression/Resamp_droppedGenes.csv
- Input9: /processed/DESEQ/DEG_QTL/automated_gene_summaries.tsv
- Output1: /processed/DESEQ/DEG_QTL/all_diffExpr_underQTL.csv
- Output2: /processed/DESEQ/DEG_QTL/qtl_degs.csv

Project Notes

General Notes & links

Get more gene_ids from StringTie object with python: https://www.biostars.org/p/261128/

Ref: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de Ref p.92-93 Haddok & Dunn book Detailed instructions are in /scripts/DESeq_scripts/prepDEpy_instructions.txt

To split a .csv column with multiple id entries, follow: http://help.snapizzi.com/csv-files/split-csv-data-into-different-columns

DESeq vs Ballgown, see https://support.bioconductor.org/p/107011/

Precomputed FlyBase files: http://flybase.org/cgi-bin/get_static_page.pl?file=bulkdata7.html&title=Current%20Release

2018-12-27 (EN) Wanted to compare results with a different approach: ENRICHMENTBROWSER /scripts/GOscripts/enrichBrowse_GO.Rmd Stuck on achieving a SummarizedExperiment class matrix of results I have a DESeqResults class object

2018-12-22 (EN) Gene set enrichment analysis using GAGE /scripts/GOscripts/gsea_GO.Rmd Load load("../../processed/DESEQ/GO/lrt.treatment.Rda") Need libraries: gage, pathview and org.Dm.eg.db Followed this tutorial alongside all 3 vignettes: http://www.gettinggeneticsdone.com/2015/12/tutorial-rna-seq-differential.html Mapped FlyBase CG SYMBOLs in our results to fly annotation database org.Dm.eg.db Converted CG SYMBOLs to ENTREZ Performed pathway and GO analyses

2018-11-30 (EGK) Computed fold changes with shrinkage (method= normal). Made plots of fold changes in DR and HS relative to control for body, ovary & head. P-values are the adjusted p-values for the OVERALL effect of treatment, so could be driven by one diet or tissue. Script is foldchange_DEseq.Rmd, fold change data is in all_fc_dat.rda and plot is FC_all.pdf

2018-11-07 (EN) Reduced number of scripts to 3, renamed as follows: prep_DESeq.Rmd renamed batch_DESeq.Rmd - runs SVA bg_shortP_viz.Rmd renamed short_protc_viz.Rmd - makes visualizations in ballgown and DESeq2 DESeq_DE.Rmd renamed DESeq_DExpr.Rmd - fits models for differential expression in DESeq2 based expression values from the short protocol of stringtie

2018-11-02 (EN) PCA plots in DESeq2: DESeq_viz.Rmd Input1 /processed/DESEQ/sampleDat_with_SV.csv (made in prep_DESe2.Rmd) Input2: /processed/DESEQ/normExpr_countData.csv (sva batch-controlled short protocol expression made in prep_DESe2.Rmd) Run deseq() on model without- and model with interaction Transform DESeq results using 1) Variance Stabilization Transformation (vst), 2) Regularized Log Transformation (rlog) Plot 2 PCA graphs for PC1/PC2, one for for each transformation; both for the interaction because PCA plots for both DESeq models not different

2018-10-26 (EN) Making PCA for DESeq data Get fpkm values with fpkm(). An error occurs. Solution is to use tximport package to normalize count values for gene size https://support.bioconductor.org/p/83607/ Starts from "stringtie -eB -G transcripts.gff <source_file.bam>" cd to /processed/S03_short_ballG/ballgown: find * -type d > ../sampNames.txt write $ in the search window i.e. matches end of line write _ballgown.gtf in the replace window write ^ in search window i.e. matches start of line write ../../processed/S03_short_ballG/ballgown

2018-10-25 (EN) DEGs for treatment effect - 2475 genes DEGs for interaction effect - 977 genes Identified 3 DEGs under trans-eQTL (step and Ilp5) peaks. In script genes_in_QTL.Rmd, see object distt - Ahcy (step QTL) is found in both treatment and interaction lists. Gene summary file associates it with: viable, aging, fecundity. Involved in NAD processes - Cda4 (Ilp5 QTL) occurs only in treatment list. Involved in chitin and carbohydrate metabolism via NADH activity - CG40486 in treatment list slif was not DE in both treatment and interaction effects.

2018-10-21 (EN)

Scanned Stanley et al 2017 trans-eQTL and lifespan QTL for DEGs - see R object qtl_degs Split prep_DESeq.Rmd script into: prep_DESeq.Rmd: runs batch control of Stringtie output using SVA BaseP_DESeq.Rmd: Differential expression models with DESeq2 BaseP_DESeq_Visuals.Rmd: plots to visualize expression data

2018-10-20

genes_in_QTL.Rmd Code to scan past QTL peaks (Stanley et al 2017) for DEGs (this study) Reads and pulses "../../processed/gene_map_table_fb_2018_05.tsv" downloaded from ftp://ftp.flybase.net/releases/FB2018_05/precomputed_files/genes/. A function, DEunderPeak() scans under QTL peaks. A prepared data frame is saved to "../../processed/DESEQ/DEG_QTL/gene_map_table.csv"). Steps: - merge a dataframe of DEGs (DEG_treat) and the gene map table (gmtable) on FBgn - read in data with QTL peaks ../../processed/DESEQ/DEG_QTL/iis_table1.txt. This excludes peak that span the centromere - scan through trans eQTL and lifespan peaks for DEGs

2018-10-10 (EN)

Download a precomputed gene annotation file for use in GO analysis Fbgn annotation file "Genes:fbgn_annotation_ID_fb_2018_05.tsv" Index of ftp://ftp.flybase.net/releases/FB2018_05/precomputed_files/genes/

2018-10-09

Convert FBgn symbols to name and annotation symbols for use with GO analysis packages Go to FlyBase ID Converter tool: http://flybase.org/convert/id Paste list of FBgn.. and choose 'Validate and Convert' to 'Genes' On result page choose 'BatchDownload' Select Symbol, Name, Annotation Symbol, FlyBase ID. Push 'Get Field Data' Press 'Download as a file'

2018-10-07 (EN)

DAVID web tool for Gene Ontology Analysis. I used the Functional Annotation Tool to get GO terms in all 3 categories (BP, MF and CC). Also obtained annotation chart and table 969 genes sig differentiated (padj<0.05, LRT) from the interaction model, dds.02

GO_deseq.Rmd GOplot to visualize: https://cran.r-project.org/web/packages/GOplot/vignettes/GOplot_vignette.html

2018-10-02

Use GOrilla for functional analysis of each list of DEGs a) as a whole and b) up- down- separately Discontinued.

2018-09-04 (EN)

Run a model in DESeq2 to test effect of tissue x treatment interaction on normalized batch-controlled gene-wise data: dds.02 <- DESeqDataSetFromMatrix(countData = countdata, colData = phenDat.sva, design = ~ SV1 + batch + tissue*treatment)

The following pairwise comparisions for treatment effect: rr1 <- results(dds_deseq.02, contrast=c('tissue', "H","B"), alpha = 0.05) rr2 <- results(dds_deseq.02, contrast=c('tissue', "O","B"), alpha = 0.05) rr3 <- results(dds_deseq.02, contrast=c('tissue', "O","H"), alpha = 0.05)

and,

tissue effect: ff1 <- results(dds_deseq.02, contrast=c('treatment', "C","DR"), alpha = 0.05) ff2 <- results(dds_deseq.02, contrast=c('treatment', "C","HS"), alpha = 0.05) ff3 <- results(dds_deseq.02, contrast=c('treatment', "DR","HS"), alpha = 0.05)

resulted in tables of DEGs.

I compiled a list of 26 genes previously identified in QTL studies of diet treatments (Stanley et al, 2017): /processed/DESEQ/DEG_QTL/iis_genelist.csv". I identify any of these which are also significantly expressed in this study: files written to /processed/DESEQ/DEG_QTL/

Genes identified are collected in one file: /processed/DESEQ/DEG_QTL/qtl_genes_all_FlyBase.csv. Also present in this file is FlyBase genome and functional data for each gene from a batch search.

2018-08-30 (EGK)

Developed compare_StringTie_method script. Compares: 1) gene count table generated from stringtie run that includes estimating novel transcripts, 2) gene count table generated from stringtie run that includes estimating novel transcripts followed by perl script to get back FBgn names, 3) gene count table ignoring novel transcripts

General comparison:

Type 1 has very few FBgn names. When comparing type 2 and 3 (merging on gene names), those with differences are almost all a MSTRG that is split up in the perl script to assign gene names. So some reads are assigned to a gene and some just get an MSTRG number (both share the same string tie MSTRG number). Another set are genes with more than one assigned with the perl script. It makes sense these do not agree. Those that are left ~800 I am not sure why they don't agree. For one I looked at, the pattern matches what modEncode reports on flybase (low ovary expression) for type 3, while type 2 and 1 are both way higher for all tissues including ovaries. In conclusion, I feel good about using type 3 for GENE level analysis.

Only 200 gene names in type 3 are not represented in type 2. Not sure why for these.

2018-08-27 (EN)

Implementing an alternative workflow. A visual of the short protocol is here: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de

Run stringtie -eB on the output of samtools -sort (ie. step 2 of Pertea et al, 2016)

Gene count tables are written to /processed/S03_short_ballG/ gene_count_matrix and transcript_count_matrix

2018-08-27 (EGK)

Attempting to generate novel transcripts and get an accurate gene level analysis is complex. We plan to perform separate pipelines for a transcript level and gene level analysis.

Gene level analysis: we will use the reference annotation rather than performing the merge step in StringTie followed by DEseq

Transcript level analysis: we will use the StringTie-Ballgown pipeline

Exploration of the gene count table generated from the data using the merge file before and after assigning a reference gene gave different results. Some MSTRGs are split up into those that are assigned to a gene and those that are not.

2018-08-20 (EN)

Get more gene ids tranferred from StringTie by running a perl script mstrg_prep.pl explained here: gpertea/stringtie#179 and documented here: https://gist.github.com/gpertea/b83f1b32435e166afa92a2d388527f4b
Command: perl mstrg_prep.pl stringtie_merged.gtf > stringtie_merged_pl.gtf
Then, re-run step 6 (Pertea et al, 2016) to compute transcript abundances: sbatch stringtie_abundances.sh
Results stored in ballG_pl
Problem: duplicated MSTRG and Fbgn ids. Perl script appears to assign transcript if not sure, to the same MSTRG id creating duplicates.

2018-07-30 (EN)

Implemented DESeq2 on svaseq-prepped read count data
22 rows did not converged when running ddsDE.ttSS <- DESeq(dds.ttSS)
Dealing with autocorrelation: see https://support.bioconductor.org/p/65091/ and tried:
1. Filter out normalized count of at least 10 reads in two or more samples.
- filter <- rowSums(nc >= 10) >= 2
- 8 rows did not converge
1. Try increasing the 2 sample requirement to 3 or 4
- filter <- rowSums(nc >= 10) >= 3
- 6 rows did not converge
1. omit non-converging rows from the results step
- ddsDE_Clean <- ddsDE_filt10[which(mcols(ddsDE_filt10)$betaConv),]
- didn't run, but also doesn't feel a good approach!
1. increase the maximum iterations (maxit) instead of running DESeq()
- filter <- rowSums(nc >= 10) >= 3 plusnbinomWaldTest(dds, maxit=500)
1. filter <- rowSums(nc >= 5) >= 2 plus nbinomWaldTest(maxit=500)
- 9 rows don't converge

2018-07-24 (EGK)

Looked through DEseq code to check
Implemented svaseq to check for batch effects
- Will follow this: https://support.bioconductor.org/p/80755/
- Needed to remove all genes with sample counts that were all zeros. (gave this error: Error in density.default(x, adjust = adj) : 'x' contains missing values)
- sva identified 2 svas
- clearly 4 groupings when plotting svas- unknown sources

2018-07-20 (EN)

Performed first DESeq model: design = ~ treatment + tissue. Need to 1) do batch analysis and 2) include an interaction term treatment*tissue

2018-07-19 (EN)

Sub-structured the scripts directory into Ballgown_scripts and DESeq_scripts
Renamed /processed/ballG_all/ as /processed/ballgown. The directory structure created bt StringTie is required prepDE.py to compile counts (see http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de)

2018-07-17 (EN)

Started a new, streamlined DiffExpr_batch.Rmd script. Script with all models tried up to now is pushed to script_sink/ an DiffExpr_batch_old.Rmd

2018-07-11 (EN)

Move all ballgown stuff from DiffExpr_batch.Rmd to DiffExpr.Rmd
clean redundant code (i.e. same code in multiple scripts)
change column group in phenDat to batch

2018-07-09

Plan for analysis: (EN)

Compare ballgown and sva differential expression analyses

Ballgown done in DiffExpr.Rmd
Prep data for sva - dataPrep_batch.Rmd
Do sva - DiffExpr_batch.Rmd
Compare sva & ballgown

2018-06-20 (EN & EGK)

Decided: 1) check that known batches in our case have effect on expression (a. include batch in PCA, b. test with sva or linear model) 2) run svaseq without known bathes

2018-06-19 (EN)

Tested for batch effects using SVA package. The "be" and "leek" methods applied ns.v() produced 11 and 0 SVs, respectively.Wondering why the difference we read:
- Leek_2014_ svaseq: removing batch effects and other unwanted noise from sequencing data,
- Leek_2011_Asymptotic Conditional Singular Value Decomposition for High-Dimensional Genomic Data,
- https://support.bioconductor.org/p/97469/,
- Jaffe et al_2015_ Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis
- Found that methods can produce diferent results

FilesExpand file tree

projectNotes.md

Latest commit

History