🚰 Water Potability Prediction using Machine Learning

📌 Project Overview

This project predicts whether water is safe (potable) or unsafe for drinking, using physicochemical features.
The dataset consists of 3,276 water samples with 9 features (e.g., pH, hardness, solids, sulfate).

🔑 Problem

Clean drinking water is essential for health. Laboratory testing is reliable but costly and time-consuming.
This project aims to build a data-driven screening tool for water safety classification.

🧾 Dataset

Source: (https://www.kaggle.com/datasets/adityakadiwal/water-potability)) (Water Potability dataset)
Features:
- ph
- Hardness
- Solids
- Chloramines
- Sulfate
- Conductivity
- Organic_carbon
- Trihalomethanes
- Turbidity
Target:
- Potability (1 = drinkable, 0 = not drinkable)

⚙️ Workflow / Pipeline

Data cleaning & handling missing values
Outlier detection & mitigation
Feature engineering (pH categories)
Train-test split
No leakage: scaling applied to train only
SMOTE: handle class imbalance on train set
Train models separately:
- Logistic Regression
- Support Vector Machine
- Decision Tree
- Random Forest
- XGBoost
Model evaluation with accuracy, precision, recall, F1, ROC-AUC
Feature importance & SHAP interpretability

📊 Results

Model	Accuracy	Precision	Recall	F1 Score
Logistic Regression	0.50	0.40	0.56	0.47
SVM	0.62	0.52	0.44	0.48
Decision Tree	0.71	0.62	0.64	0.63
Random Forest	0.75	0.69	0.63	0.66
XGBoost	0.73	0.68	0.61	0.64

➡️ Random Forest achieved the best balance (F1 = 0.66, AUC ≈ 0.74).

🔍 Key Insights

Safe water is more likely when pH is between 6.5–8.5 (WHO standard), but pH alone is not sufficient.
Sulfate, Solids, and Conductivity were strong predictors in feature importance analysis.
No single feature separates safe vs. unsafe water → ML models are necessary.
Random Forest + XGBoost both performed strongly; RF slightly better on recall and F1.

📈 Visuals

Confusion matrix
ROC curve
Feature importance (Random Forest)
Model performance comparison chart

🏆 Conclusion

Random Forest was the best performing model (F1 = 0.66).
While this tool is not a replacement for laboratory testing, it provides a fast, low-cost, first-pass screening for water safety assessment. pip install -r requirements.txt jupyter notebook Water_Prediction.ipynb

🚀 How to Run

git clone https://github.com/sherryyy00/water-potability-ml.git
cd water-potability-ml

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
Water_Prediction.ipynb		Water_Prediction.ipynb
water_potability.csv		water_potability.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚰 Water Potability Prediction using Machine Learning

📌 Project Overview

🔑 Problem

🧾 Dataset

⚙️ Workflow / Pipeline

📊 Results

🔍 Key Insights

📈 Visuals

🏆 Conclusion

🚀 How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚰 Water Potability Prediction using Machine Learning

📌 Project Overview

🔑 Problem

🧾 Dataset

⚙️ Workflow / Pipeline

📊 Results

🔍 Key Insights

📈 Visuals

🏆 Conclusion

🚀 How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages