To understand a model's functionality and find its underlying problems, we need to take care of the following things:
- Periodic training
- Figure out an optimal retraining strategy.
- Monitor model performance
- Analyse
- Clear visibility of the model helps us and guides the model performance.
In this section we will diagnose and fix problems in a production deployed code. To that end, we will:
- Use evidently and mlflow libraries.
- Deploy a machine learning model in Heroku.
- Calculate data drift for the model.
- Use mlflow Tracking for the training experiments indicating data drift.
- Explore the results using mlflow UI.
In this tutorial we will use the Bike Sharing Dataset.
You can use the following code snippet (from train.py) to analyze this data before starting this tutorial.
content = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip").content
with zipfile.ZipFile(io.BytesIO(content)) as arc:
raw_data = pd.read_csv(arc.open("day.csv"), header=0, sep=',', parse_dates=['dteday'], index_col='dteday')
# observe data structure
raw_data.tail()- Heroku account
- GitHub account
- Clone the Github repo
https://github.com/udacity/cd0583-diagnose-and-fix.git
- conda create -n mldrift "python=3.10"
- pip install -r requirements.txt: All the dependencies are listed in the
requirements.txtfile. You can setup a virtual environment using Anaconda and install the required dependencies there. runtime.txtcontains the python version that is used for this tutorial.
Follow the steps below for deploying this model:
- Ensure that all the dependencies listed in the
requirements.txtfile are installed. - Run the
train.pyfile to log experiments in mlflow - View the results in the mlflow webui: run
mlflow ui - open url that is printed, e.g.
http://127.0.0.1:5000
- If there is substantial data drift then you should reweigh samples in the training data, giving more importance to newer patterns.
- Identify new segments where the model fails, and create a different model for it. Consider using an ensemble of several models for different segments of the data.
- Change the prediction target. For example, switch from weekly to daily forecast or replace the regression model with classification into categories from "high" to "low."
- Pick a different model architecture to account for ongoing drift. You can consider incremental or online learning, where the model continuously adapts to new data.
- Apply domain adaptation strategies. There are a number of approaches to help the model better generalize to a new target domain.