ProteinFunctionalAnnotationScripts

Set of code to parse GenBank(.gbff) files containing sequence information for Desulfovibrio Vulgaris str. Hildenborough, as well as files with functional annotation of protein coding regions of this genome. Extracts the hypothetical proteins from the genome sequence and compiles annotations of these proteins from a few different predictive methods into several .csv files. Each file of code in the repo contains comments at the top describing its purpose.

The 4 files MakeHypoTable.py, MakeECtable.py, MakeSummaryTable.py, and MakeProbTable.py respectively create the main output tables Hypo.csv, ECtable.csv, SummaryTable.csv, and ProbTable.csv. These files import a long chain of other files throughout the directory to create the tables.The primary method for constructing tables throughout the files is to assemble a python dictionary with hypothetical protein IDs as keys and information of interest as values in the form of lists. The dictionary is then converted to a pandas dataframe, then to a .csv file.

The first files that were made, and the primary files in the order of the pipeline include Hypo.py, DSVtags.py, DSVcheck.py, DSVtypes.py, MakeValues.py, MakeHypoTable.py, ResParse.py, and GoneList.py. These files extract basic information about the hypothetical proteins from GenBank files. The March version of the DSV genome annotation was primarily used, but the October annotation is also available and was compared to the March annotation. The Retrieve2014.py file fetches the 2014 DSV chromosome genome and the 2016 DSV plasmid genome.

The two directories SalsaPDB and BlitsPDB contain code that can be used to download all PDB files in the output of SAdLSA and hhblits respectively. These PDB files must be downloaded in the directories in order for SalsaGetUP.py and BlitsGetUP.py to run. Both SalsaGetUP.py and BlitsGetUP.py are necessary for Brenda.py and MakeECtable.py to run, so this download should be completed before running any of those codes. The download is time consuming and could take 30-40 minutes.

A couple of files were compressed into tar files due to large size. ‘FileTypes’ is a collection of 2 fasta files, an October .gbff file (this file is also in the main directory as OctDSV.gbff), and a .gff file with information about genes and proteins of DSV Hildenborough. Aside from the October .gbff file, none of these are necessary to run any code in the main directory, but they were added for possible future use. Another tarred file is ‘brenda_download’ which is a very large file containing lists of proteins for each enzyme commission number according to the Brenda enzyme database. This file is read in by Brenda.py in order to find EC numbers, and Brenda.py is imported to MakeECTable.py, so the brenda_download file should be available before running either of those files.

MakeValues.py creates the MarProtIDList, which is used throughout most of the files as the list of protein IDs for the hypotheticals from the March annotation. MarProtIDList is used to match up the hypothetical protein IDs with information/predictions about them from DeepEC, SAdLSA, hhblits, etc.

The files ResParse.py and CountRes.py are used on the output from hhblits and SAdLSA. The term ‘res’ refers to the name of the original directory with files from the Georgia Tech collaboration.

Files associated with the DeepEC method include 3digit_EC_prediction.txt, Enzyme_prediction.txt, 3digitHypos.txt, BinaryHypos.txt, CheckRelate.py, and ExtractEnzymes.py. ExtractEnzymes.py extracts the rows of 3digit_EC_prediction.txt and Enzyme_prediction.txt which correspond with hypothetical proteins and creates the files 3digitHypos.txt and BinaryHypos.txt out of these rows. EnzymeTranslate.py translates EC numbers to descriptions of the enzymes biological functions.

The MultispeciesHypotheticals directory is a collection of files used to extract DeepEC results on hypotheticals of DSVH and separate them into Multispecies and Non-Multispecies proteins. This was a small assignment that’s not connected to any code in the main directory.

The two files SalsaGetUP.py and BlitsGetUP.py share a lot of repeated content and could be condensed into a single file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProteinFunctionalAnnotationScripts

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
BlitsPDB		BlitsPDB
MultispeciesHypotheticals		MultispeciesHypotheticals
SalsaPDB		SalsaPDB
3digitHypos.txt		3digitHypos.txt
3digit_EC_prediction.txt		3digit_EC_prediction.txt
BinaryHypos.txt		BinaryHypos.txt
BlitsGetUP.py		BlitsGetUP.py
Brenda.py		Brenda.py
Check3digitHyp.py		Check3digitHyp.py
CheckRelate.py		CheckRelate.py
CountRes.py		CountRes.py
DSVcheck.py		DSVcheck.py
DSVtags.py		DSVtags.py
DSVtypes.py		DSVtypes.py
ECTable.csv		ECTable.csv
EnzymeTranslate.py		EnzymeTranslate.py
Enzyme_prediction.txt		Enzyme_prediction.txt
ExtractEnzymes.py		ExtractEnzymes.py
FileTypes.tar.gz		FileTypes.tar.gz
GoneList.py		GoneList.py
Hypo.csv		Hypo.csv
Hypo.py		Hypo.py
MakeECtable.py		MakeECtable.py
MakeHypoTable.py		MakeHypoTable.py
MakeProbTable.py		MakeProbTable.py
MakeSummaryTable.py		MakeSummaryTable.py
MakeValues.py		MakeValues.py
MarDSV.gbff		MarDSV.gbff
MarProtIdDup.py		MarProtIdDup.py
OctDSV.gbff		OctDSV.gbff
OctOldHypos.csv		OctOldHypos.csv
PrintSpecials.py		PrintSpecials.py
ProbTable.csv		ProbTable.csv
README.md		README.md
ResParse.py		ResParse.py
Retrieve2014.py		Retrieve2014.py
SalsaGetUP.py		SalsaGetUP.py
SummaryTable.csv		SummaryTable.csv
brenda_download.tar.gz		brenda_download.tar.gz
de_hildenborough_hhblits_pdb70_200902_top1.dat		de_hildenborough_hhblits_pdb70_200902_top1.dat
de_hildenborough_sadlsa_pdb70_210210_top1.dat		de_hildenborough_sadlsa_pdb70_210210_top1.dat
de_hildenborough_sadlsa_pfamA_v33.1_top1.dat		de_hildenborough_sadlsa_pfamA_v33.1_top1.dat
enzclass.txt		enzclass.txt

tceffler/ProteinFunctionalAnnotationScripts

Folders and files

Latest commit

History

Repository files navigation

ProteinFunctionalAnnotationScripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages