This project develops machine learning models to predict species' biological kingdom based on codon usage frequency patterns. The dataset comprises over 13,000 samples with codon usage frequencies across 64 codons, enabling multi-class classification across various biological kingdoms.
Source: Codon Usage Dataset on Kaggle
- Logistic Regression (with multinomial classification)
- Support Vector Machines (SVM)
conda env create -f env.yml
conda activate codon-classification- Best Performing Model: SVM with rbf kernel
- Well-classified Kingdoms: Bacteria, Viruses, Vertebrates, Plants (F1-scores ~0.96)
- Challenging Categories: low-sample classes(Archaea, Phage)