First of all thanks for developing DeepBGC and making it available to the community.
I came across a bug in HmmscanPfamRecordAnnotator when generating the proteins_by_id dictionary. The util function get_proteins_by_id is currently looping through all the potential protein ids of a feature (e.g. unique_protein_id, protein_id and locus_tag) and this can cause features with id based on protein_id qualifier to be overwritten by another feature that shares the same protein_id but it was deduplicated using the unique_protein_id. This is causing PFAM_domain features to be incorrectly placed in the genomic sequence because protein_id used in hmmscan output file will match a different feature and pick the incorrect feature location.
First of all thanks for developing DeepBGC and making it available to the community.
I came across a bug in
HmmscanPfamRecordAnnotatorwhen generating theproteins_by_iddictionary. Theutilfunctionget_proteins_by_idis currently looping through all the potential protein ids of a feature (e.g.unique_protein_id,protein_idandlocus_tag) and this can cause features with id based onprotein_idqualifier to be overwritten by another feature that shares the sameprotein_idbut it was deduplicated using theunique_protein_id. This is causingPFAM_domainfeatures to be incorrectly placed in the genomic sequence becauseprotein_idused inhmmscanoutput file will match a different feature and pick the incorrect feature location.