This project predicts whether water is safe (potable) or unsafe for drinking, using physicochemical features.
The dataset consists of 3,276 water samples with 9 features (e.g., pH, hardness, solids, sulfate).
Clean drinking water is essential for health. Laboratory testing is reliable but costly and time-consuming.
This project aims to build a data-driven screening tool for water safety classification.
- Source: (https://www.kaggle.com/datasets/adityakadiwal/water-potability)) (Water Potability dataset)
- Features:
phHardnessSolidsChloraminesSulfateConductivityOrganic_carbonTrihalomethanesTurbidity
- Target:
Potability(1 = drinkable, 0 = not drinkable)
- Data cleaning & handling missing values
- Outlier detection & mitigation
- Feature engineering (pH categories)
- Train-test split
- No leakage: scaling applied to train only
- SMOTE: handle class imbalance on train set
- Train models separately:
- Logistic Regression
- Support Vector Machine
- Decision Tree
- Random Forest
- XGBoost
- Model evaluation with accuracy, precision, recall, F1, ROC-AUC
- Feature importance & SHAP interpretability
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Logistic Regression | 0.50 | 0.40 | 0.56 | 0.47 |
| SVM | 0.62 | 0.52 | 0.44 | 0.48 |
| Decision Tree | 0.71 | 0.62 | 0.64 | 0.63 |
| Random Forest | 0.75 | 0.69 | 0.63 | 0.66 |
| XGBoost | 0.73 | 0.68 | 0.61 | 0.64 |
β‘οΈ Random Forest achieved the best balance (F1 = 0.66, AUC β 0.74).
- Safe water is more likely when pH is between 6.5β8.5 (WHO standard), but pH alone is not sufficient.
- Sulfate, Solids, and Conductivity were strong predictors in feature importance analysis.
- No single feature separates safe vs. unsafe water β ML models are necessary.
- Random Forest + XGBoost both performed strongly; RF slightly better on recall and F1.
- Confusion matrix
- ROC curve
- Feature importance (Random Forest)
- Model performance comparison chart
- Random Forest was the best performing model (F1 = 0.66).
- While this tool is not a replacement for laboratory testing, it provides a fast, low-cost, first-pass screening for water safety assessment. pip install -r requirements.txt jupyter notebook Water_Prediction.ipynb
git clone https://github.com/sherryyy00/water-potability-ml.git
cd water-potability-ml