This project builds and evaluates various Random Forest models to predict retail sales based on time, holiday, and store-specific features. It includes preprocessing, model training with hyperparameter tuning, evaluation, and prediction on new datasets.
csv_files/sales.csvβ Main dataset used for model training.csv_files/ironkaggle_notarget.csvβ New data for predictions (without target).csv_files/ironkaggle_solutions.csvβ Comparison/solution dataset.models/β Folder containing saved models (.pklformat).
Three main Random Forest models were trained:
- Features: All features except
sales - Normalization:
StandardScaler - Hyperparameters:
n_estimators=100,max_depth=20 - Status: Trained and saved to
models/model_rf01.pkl
- Features: Dropped
open,is_weekend,school_holiday,year - Normalization:
StandardScaler - Hyperparameters:
n_estimators=100,max_depth=None - Status: Trained and saved to
models/model_rf02.pkl
- Grid Search (classification mistake β used
scoring='f1') - Status: Trained, but improperly scored for regression
- Feature engineering and time estimate logic for efficient hyperparameter search
- Output model:
models/best_rf_halving.pkl - Status: Time estimation and progress tracking using
tqdm&threading
Several other machine learning models were imported (e.g., XBoost, GradientBoostingRegressor, lightgbm etc.) and briefly tested. However, they performed significantly worse than Random Forest on this dataset.
As a result, we removed their implementations to keep the project clean and focused on the best-performing model.
Each model is evaluated using:
- MAE: Mean Absolute Error
- RMSE: Root Mean Squared Error
- RΒ² Score: Coefficient of Determination
Feature importances were plotted using model.feature_importances_ to understand which variables most influence predictions.
A masked heatmap using seaborn highlights correlations between features.
The file ironkaggle_notarget.csv was:
- Preprocessed with the same steps as training data
- Normalized using the previously fitted
StandardScaler - Predicted using
model_rf02 - Output stored in
df01_with_predictions.csv
pandas,numpy,matplotlib,seabornscikit-learn: RandomForestRegressor, GridSearchCV, HalvingGridSearchCVjoblib: For model saving/loadingtqdmandthreading: For progress monitoring
- Ensure required libraries are installed (
requirements.txtor use `pip install -r requ
We welcome contributions from anyone interested in improving this project! Here are the current contributors:
-
Author: JorgeMMLRodrigues
-
Email: jorgemmlrodrigues@gmail.com
-
Author: Mic-dev-gif
-
Email: michelemontalvo@outlook.com
-
GitHub: https://github.com/Mic-dev-gif
-
Email: simiatawane@gmail.com
-
Email: felipe.rocha@ironhack.com