Stroke Prediction Analysis 🚨
A comprehensive pipeline for predicting stroke risk using real-world clinical data, with an emphasis on handling severe class imbalance and optimizing clinically relevant metrics.
Predicting stroke risk in patients is critical for early intervention. This project:
- Utilizes a real-world Stroke Prediction dataset from Kaggle.
- Tackles severe class imbalance (~4% stroke cases).
- Compares multiple classifiers (e.g., Logistic Regression, XGBoost).
Source: Kaggle Stroke Prediction Dataset
| Feature Category | Columns |
|---|---|
| Demographics | gender, age |
| Health Metrics | avg_glucose_level, bmi |
| Medical History | hypertension, heart_disease |
| Lifestyle & Social | ever_married, work_type, residence_type, smoking_status |
| Target | stroke (0 = No, 1 = Yes) |
After cleaning (dropping missing BMI, removing outliers), 4,909 records remain.
-
Data Loading & Cleaning
- Standardize column names
- Handle missing BMI values
-
Feature Engineering
- Encoding for linear models
- Label Encoding for tree-based models
-
Train-Test Split
- Stratified (80% train / 20% test)
-
Scaling
StandardScalerfor distance-based models
-
Imbalance Handling
SMOTEin animblearn.Pipeline(avoids data leakage)
-
Model Training & Tuning
- KNN, Logistic Regression, Decision Tree, Random Forest, XGBoost
GridSearchCVoptimizing F1 on stroke class
-
Threshold Tuning
- Adjust probability cut-offs for high-recall vs balanced use cases
- GridSearchCV for optimal hyperparameters & SMOTE sampling ratio
- Custom thresholds (e.g., 0.20 vs. 0.50) to target recall or precision based on clinical needs
- Logistic Regression offers the best balance (highest F1 & ROC AUC).
- XGBoost at 0.20 excels in recall—ideal for screening scenarios.
# Load trained model
import joblib
model = joblib.load('gridsearch/xgb_model.pkl')
# Predict on new data
import pandas as pd
new_data = pd.read_csv('csv_files/sample_patient.csv')
pred_probs = model.predict_proba(new_data)[:, 1]
pred_labels = (pred_probs >= 0.2).astype(int)- Fork the repo
- Create a feature branch (
git checkout -b feature/YourFeature) - Commit your changes (
git commit -m "Add new analysis") - Push (
git push origin feature/YourFeature) - Open a Pull Request
This project is licensed under the MIT License. See LICENSE for details.
- Email: michelemontalvo@outlook.com
- GitHub: github.com/Mic-dev-gif
- Email: jorgemmlrodrigues@gmail.com
- GitHub: github.com/JorgeMMLRodrigues
Feel free to open issues or discussions!