GitHub - Mic-dev-gif/stroke_dataset_ml: machine learning project ironhack

Stroke Prediction Analysis 🚨

A comprehensive pipeline for predicting stroke risk using real-world clinical data, with an emphasis on handling severe class imbalance and optimizing clinically relevant metrics.

📋 Table of Contents

Project Overview
Dataset
Pipeline & Workflow
Insights
Usage Examples
Contributing
License
Contact

📌 Project Overview

Predicting stroke risk in patients is critical for early intervention. This project:

Utilizes a real-world Stroke Prediction dataset from Kaggle.
Tackles severe class imbalance (~4% stroke cases).
Compares multiple classifiers (e.g., Logistic Regression, XGBoost).

📊 Dataset

Source: Kaggle Stroke Prediction Dataset

Feature Category	Columns
Demographics	`gender`, `age`
Health Metrics	`avg_glucose_level`, `bmi`
Medical History	`hypertension`, `heart_disease`
Lifestyle & Social	`ever_married`, `work_type`, `residence_type`, `smoking_status`
Target	`stroke` (0 = No, 1 = Yes)

After cleaning (dropping missing BMI, removing outliers), 4,909 records remain.

🔄 Pipeline & Workflow

Data Loading & Cleaning
- Standardize column names
- Handle missing BMI values
Feature Engineering
- Encoding for linear models
- Label Encoding for tree-based models
Train-Test Split
- Stratified (80% train / 20% test)
Scaling
- StandardScaler for distance-based models
Imbalance Handling
- SMOTE in an imblearn.Pipeline (avoids data leakage)
Model Training & Tuning
- KNN, Logistic Regression, Decision Tree, Random Forest, XGBoost
- GridSearchCV optimizing F1 on stroke class
Threshold Tuning
- Adjust probability cut-offs for high-recall vs balanced use cases

Hyperparameter & Threshold Tuning

GridSearchCV for optimal hyperparameters & SMOTE sampling ratio
Custom thresholds (e.g., 0.20 vs. 0.50) to target recall or precision based on clinical needs

📈 Insights

Logistic Regression offers the best balance (highest F1 & ROC AUC).
XGBoost at 0.20 excels in recall—ideal for screening scenarios.

💻 Usage Examples

# Load trained model
import joblib
model = joblib.load('gridsearch/xgb_model.pkl')

# Predict on new data
import pandas as pd
new_data = pd.read_csv('csv_files/sample_patient.csv')
pred_probs = model.predict_proba(new_data)[:, 1]
pred_labels = (pred_probs >= 0.2).astype(int)

🤝 Contributing

Fork the repo
Create a feature branch (git checkout -b feature/YourFeature)
Commit your changes (git commit -m "Add new analysis")
Push (git push origin feature/YourFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License. See LICENSE for details.

📬 Contact

Authors

Mic-dev-gif (Michele Montalvo)

Email: michelemontalvo@outlook.com
GitHub: github.com/Mic-dev-gif

Jorge M. M. L. Rodrigues

Email: jorgemmlrodrigues@gmail.com
GitHub: github.com/JorgeMMLRodrigues

Feel free to open issues or discussions!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
gridsearch		gridsearch
.gitignore		.gitignore
ML_StrokePrediction.twbx		ML_StrokePrediction.twbx
README.md		README.md
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📋 Table of Contents

📌 Project Overview

📊 Dataset

🔄 Pipeline & Workflow

Hyperparameter & Threshold Tuning

📈 Insights

💻 Usage Examples

🤝 Contributing

📄 License

📬 Contact

Authors

Mic-dev-gif (Michele Montalvo)

Jorge M. M. L. Rodrigues

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Mic-dev-gif/stroke_dataset_ml

Folders and files

Latest commit

History

Repository files navigation

📋 Table of Contents

📌 Project Overview

📊 Dataset

🔄 Pipeline & Workflow

Hyperparameter & Threshold Tuning

📈 Insights

💻 Usage Examples

🤝 Contributing

📄 License

📬 Contact

Authors

Mic-dev-gif (Michele Montalvo)

Jorge M. M. L. Rodrigues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages