- Project Overview
- Project Objective
- Data Sources
- Data Preprocessing
- Machine-Learning-Model
- Evaluation Metrics
- Key Insights
- Conclusion
The Diabetes Prediction using Machine Learning project aims to develop a robust model capable of identifying patients who are positive for diabetes. By leveraging machine learning techniques, this project enhances the accurate prediction of diabetes prevalence, allowing for timely and targeted preventive measures.
The primary objective of this project is to build and train a robust machine learning model that can accurately predict the presence or absence of diabetes among patients.
The dataset used in this project was provided by 10Alytics, a company I have worked with for the past 6 months. The dataset contains a collection of features extracted from patients' medical history, including smoking history, BMI, blood glucose level etc.
Before feeding the data into the machine learning model, extensive data preprocessing was performed. This included handling missing or null values, checking for duplicates, data normalisation and standardisation. Additionally, feature engineering techniques were applied to extract relevant information from the raw data.
The Stark Hospital diabetes prediction machine learning model was built using a supervised machine learning approach. Training and test data was split into 80:20. Several classification algorithms were experimented with, including but not limited to:
- Logistic Regression
- Random Forest
- K-Nearest Neighbors
- Support Vector Machine
- XGB Classifier
- Decision Tree etc. After extensive experimentation and hyperparameter turning, the final machine learning model was selected based on its performance and generalisation capabilities. Furthermore, these models are compared to determine the most effective model in this regard by evaluating their accuracy of prediction, alongside other performance metrics such as precision, recall and ROC score.
To assess the performance of the machine learning model, the following evaluation metrics were used:
- Precision: The proportion of correctly predicted positive (diabetes) patients among all patients that are classified diabetic.
- Recall: The proportion of all actual positives that were classified correctly as positives.
- Accuracy: The overall proportion of correctly predicted patients (both positive and negative).
- ROC: The trade-off between that are true positive prediction and false positive prediction.
- After cross validation, the model with the highest accuracy will be deployed. Accuracy is the most relevant matrix for evaluation in this project due to the significant target imbalance.
- The confusion matrix for two models (Random Forest and Logistic Regression) displays the error value for each model in terms of False Positives (patients predicted to have diabetes while in actuality they are not) and False Negatives (patients predicted not having diabetes but in actuality they have it)
The primary objective of this project is to apply different machine learning algorithms to predict the presence and absence of diabetes. Eight machine learning models are compared to determine the most effective model in this regard by evaluating their accuracy of prediction, alongside other performance metrics such as precision, recall and ROC score. Of the models investigated, the XGB Classifier significantly outperformed the others, achieving an accuracy of 97.16%.
