Skip to content

Mic-dev-gif/stroke_dataset_ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stroke Prediction Analysis 🚨

Python 3.8+ License: MIT

A comprehensive pipeline for predicting stroke risk using real-world clinical data, with an emphasis on handling severe class imbalance and optimizing clinically relevant metrics.


📋 Table of Contents

  1. Project Overview
  2. Dataset
  3. Pipeline & Workflow
  4. Insights
  5. Usage Examples
  6. Contributing
  7. License
  8. Contact

📌 Project Overview

Predicting stroke risk in patients is critical for early intervention. This project:

  • Utilizes a real-world Stroke Prediction dataset from Kaggle.
  • Tackles severe class imbalance (~4% stroke cases).
  • Compares multiple classifiers (e.g., Logistic Regression, XGBoost).

📊 Dataset

Source: Kaggle Stroke Prediction Dataset

Feature Category Columns
Demographics gender, age
Health Metrics avg_glucose_level, bmi
Medical History hypertension, heart_disease
Lifestyle & Social ever_married, work_type, residence_type, smoking_status
Target stroke (0 = No, 1 = Yes)

After cleaning (dropping missing BMI, removing outliers), 4,909 records remain.


🔄 Pipeline & Workflow

  1. Data Loading & Cleaning

    • Standardize column names
    • Handle missing BMI values
  2. Feature Engineering

    • Encoding for linear models
    • Label Encoding for tree-based models
  3. Train-Test Split

    • Stratified (80% train / 20% test)
  4. Scaling

    • StandardScaler for distance-based models
  5. Imbalance Handling

    • SMOTE in an imblearn.Pipeline (avoids data leakage)
  6. Model Training & Tuning

    • KNN, Logistic Regression, Decision Tree, Random Forest, XGBoost
    • GridSearchCV optimizing F1 on stroke class
  7. Threshold Tuning

    • Adjust probability cut-offs for high-recall vs balanced use cases

Hyperparameter & Threshold Tuning

  • GridSearchCV for optimal hyperparameters & SMOTE sampling ratio
  • Custom thresholds (e.g., 0.20 vs. 0.50) to target recall or precision based on clinical needs

📈 Insights

  • Logistic Regression offers the best balance (highest F1 & ROC AUC).
  • XGBoost at 0.20 excels in recall—ideal for screening scenarios.

💻 Usage Examples

# Load trained model
import joblib
model = joblib.load('gridsearch/xgb_model.pkl')

# Predict on new data
import pandas as pd
new_data = pd.read_csv('csv_files/sample_patient.csv')
pred_probs = model.predict_proba(new_data)[:, 1]
pred_labels = (pred_probs >= 0.2).astype(int)

🤝 Contributing

  1. Fork the repo
  2. Create a feature branch (git checkout -b feature/YourFeature)
  3. Commit your changes (git commit -m "Add new analysis")
  4. Push (git push origin feature/YourFeature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License. See LICENSE for details.


📬 Contact

Authors

Mic-dev-gif (Michele Montalvo)

Jorge M. M. L. Rodrigues

Feel free to open issues or discussions!

About

machine learning project ironhack

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •