Skip to content

JorgeMMLRodrigues/IronKaggle_Sales

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›οΈ Retail Sales Prediction with Random Forest

This project builds and evaluates various Random Forest models to predict retail sales based on time, holiday, and store-specific features. It includes preprocessing, model training with hyperparameter tuning, evaluation, and prediction on new datasets.

πŸ“‚ Files

  • csv_files/sales.csv β€” Main dataset used for model training.
  • csv_files/ironkaggle_notarget.csv β€” New data for predictions (without target).
  • csv_files/ironkaggle_solutions.csv β€” Comparison/solution dataset.
  • models/ β€” Folder containing saved models (.pkl format).

🧠 Machine Learning Models

Three main Random Forest models were trained:

βœ… model_rf01

  • Features: All features except sales
  • Normalization: StandardScaler
  • Hyperparameters: n_estimators=100, max_depth=20
  • Status: Trained and saved to models/model_rf01.pkl

βœ… model_rf02

  • Features: Dropped open, is_weekend, school_holiday, year
  • Normalization: StandardScaler
  • Hyperparameters: n_estimators=100, max_depth=None
  • Status: Trained and saved to models/model_rf02.pkl

βœ… model_rf03

  • Grid Search (classification mistake β€” used scoring='f1')
  • Status: Trained, but improperly scored for regression

βš™οΈ HalvingGridSearchCV (Experimental)

  • Feature engineering and time estimate logic for efficient hyperparameter search
  • Output model: models/best_rf_halving.pkl
  • Status: Time estimation and progress tracking using tqdm & threading

πŸ€– Other Models Tried

Several other machine learning models were imported (e.g., XBoost, GradientBoostingRegressor, lightgbm etc.) and briefly tested. However, they performed significantly worse than Random Forest on this dataset.
As a result, we removed their implementations to keep the project clean and focused on the best-performing model.

πŸ“Š Evaluation Metrics

Each model is evaluated using:

  • MAE: Mean Absolute Error
  • RMSE: Root Mean Squared Error
  • RΒ² Score: Coefficient of Determination

πŸ” Feature Importance

Feature importances were plotted using model.feature_importances_ to understand which variables most influence predictions.

πŸ“ˆ Correlation Matrix

A masked heatmap using seaborn highlights correlations between features.

πŸ“€ Prediction on New Data

The file ironkaggle_notarget.csv was:

  • Preprocessed with the same steps as training data
  • Normalized using the previously fitted StandardScaler
  • Predicted using model_rf02
  • Output stored in df01_with_predictions.csv

πŸ› οΈ Tools Used

  • pandas, numpy, matplotlib, seaborn
  • scikit-learn: RandomForestRegressor, GridSearchCV, HalvingGridSearchCV
  • joblib: For model saving/loading
  • tqdm and threading: For progress monitoring

πŸ“ How to Run

  1. Ensure required libraries are installed (requirements.txt or use `pip install -r requ

πŸ‘₯ Contributors

We welcome contributions from anyone interested in improving this project! Here are the current contributors:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published