Set of code to parse GenBank(.gbff) files containing sequence information for Desulfovibrio Vulgaris str. Hildenborough, as well as files with functional annotation of protein coding regions of this genome. Extracts the hypothetical proteins from the genome sequence and compiles annotations of these proteins from a few different predictive methods into several .csv files. Each file of code in the repo contains comments at the top describing its purpose.
The 4 files MakeHypoTable.py, MakeECtable.py, MakeSummaryTable.py, and MakeProbTable.py respectively create the main output tables Hypo.csv, ECtable.csv, SummaryTable.csv, and ProbTable.csv. These files import a long chain of other files throughout the directory to create the tables.The primary method for constructing tables throughout the files is to assemble a python dictionary with hypothetical protein IDs as keys and information of interest as values in the form of lists. The dictionary is then converted to a pandas dataframe, then to a .csv file.
The first files that were made, and the primary files in the order of the pipeline include Hypo.py, DSVtags.py, DSVcheck.py, DSVtypes.py, MakeValues.py, MakeHypoTable.py, ResParse.py, and GoneList.py. These files extract basic information about the hypothetical proteins from GenBank files. The March version of the DSV genome annotation was primarily used, but the October annotation is also available and was compared to the March annotation. The Retrieve2014.py file fetches the 2014 DSV chromosome genome and the 2016 DSV plasmid genome.
The two directories SalsaPDB and BlitsPDB contain code that can be used to download all PDB files in the output of SAdLSA and hhblits respectively. These PDB files must be downloaded in the directories in order for SalsaGetUP.py and BlitsGetUP.py to run. Both SalsaGetUP.py and BlitsGetUP.py are necessary for Brenda.py and MakeECtable.py to run, so this download should be completed before running any of those codes. The download is time consuming and could take 30-40 minutes.
A couple of files were compressed into tar files due to large size. ‘FileTypes’ is a collection of 2 fasta files, an October .gbff file (this file is also in the main directory as OctDSV.gbff), and a .gff file with information about genes and proteins of DSV Hildenborough. Aside from the October .gbff file, none of these are necessary to run any code in the main directory, but they were added for possible future use. Another tarred file is ‘brenda_download’ which is a very large file containing lists of proteins for each enzyme commission number according to the Brenda enzyme database. This file is read in by Brenda.py in order to find EC numbers, and Brenda.py is imported to MakeECTable.py, so the brenda_download file should be available before running either of those files.
MakeValues.py creates the MarProtIDList, which is used throughout most of the files as the list of protein IDs for the hypotheticals from the March annotation. MarProtIDList is used to match up the hypothetical protein IDs with information/predictions about them from DeepEC, SAdLSA, hhblits, etc.
The files ResParse.py and CountRes.py are used on the output from hhblits and SAdLSA. The term ‘res’ refers to the name of the original directory with files from the Georgia Tech collaboration.
Files associated with the DeepEC method include 3digit_EC_prediction.txt, Enzyme_prediction.txt, 3digitHypos.txt, BinaryHypos.txt, CheckRelate.py, and ExtractEnzymes.py. ExtractEnzymes.py extracts the rows of 3digit_EC_prediction.txt and Enzyme_prediction.txt which correspond with hypothetical proteins and creates the files 3digitHypos.txt and BinaryHypos.txt out of these rows. EnzymeTranslate.py translates EC numbers to descriptions of the enzymes biological functions.
The MultispeciesHypotheticals directory is a collection of files used to extract DeepEC results on hypotheticals of DSVH and separate them into Multispecies and Non-Multispecies proteins. This was a small assignment that’s not connected to any code in the main directory.
The two files SalsaGetUP.py and BlitsGetUP.py share a lot of repeated content and could be condensed into a single file.