Skip to content
This repository was archived by the owner on Jun 29, 2019. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions CATelcoCustomerChurnModeling.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Customer Churn Prediction

import dataprep
from dataprep.Package import Package
# Use the Azure Machine Learning data preparation package
from azureml.dataprep import package
import pickle

from sklearn.naive_bayes import GaussianNB
Expand All @@ -19,8 +19,10 @@
run_logger = get_azureml_logger()
run_logger.log('amlrealworld.ChurnPrediction.CATelcoCustomerChurnModeling','true')

with Package.open_package('CATelcoCustomerChurnTrainingSample.dprep') as pkg:
df = pkg.dataflows[0].get_dataframe(spark=False)
# This call will load the referenced package and return a DataFrame.
# If run in a PySpark environment, this call returns a
# Spark DataFrame. If not, it will return a Pandas DataFrame.
df = package.run('CATelcoCustomerChurnTrainingSample.dprep', dataflow_idx=0)

columns_to_encode = list(df.select_dtypes(include=['category','object']))
for column_to_encode in columns_to_encode:
Expand Down
78 changes: 52 additions & 26 deletions docs/ModelingAndEvaluation.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,30 @@
# Churn Prediction using AMLWorkbench - Modeling and Evaluation
# Churn Prediction using AML - Modeling and Evaluation
## 1. Objectives

The aim of this lab is to use the .dprep file created from the previous lab to develop a churn classifier. More specifically, in this lab, we will use sklearn library’s Naïve Bayesian and Decision Tree algorithm to develop a churn classifier, evaluate, and compare.
The aim of this lab is to use the `.dprep` file created from the previous lab to develop a churn classifier. More specifically, in this lab, we will use sklearn library’s Naïve Bayesian and Decision Tree algorithm to develop a churn classifier, evaluate, and compare.

## 2. Data Access Code
The steps of estimating and evaluating models are already reflected in the `CATelcoCustomerChurnModeling.py` script.

2.1. The previous lab showed how to create a .dprep file. Use the .dprep file to generate data access code file by going to File Explorer, selecting the .dprep file and then choosing Generate Data Access Code File via right-click drop-down menu.
## 2. Data Access Code File

2.1. The previous lab showed how to create a .dprep file. Use the .dprep file to generate data access code file by going to File Explorer, right-clicking the .dprep file and then choosing `Generate Data Access Code File`.

![GenerateDataAccessCode](Images/GenerateDataAccessCode.png)

2.2. A new python file is created with the "drep" snippet. This snippet (shown below) can be seen in lines 22 and 23 of CATelcoCustomerChurnModeling.py
2.2. This creates a new python file that uses the appropriate modules to process this `.dprep` file and to load it as a Spark or Pandas DataFrame. This snippet (shown below) can be seen in lines 22-25 of `CATelcoCustomerChurnModeling.py`

```
with Package.open_package('CATelcoCustomerChurnTrainingSample.dprep') as pkg:
df = pkg.dataflows[0].get_dataframe()
# This call will load the referenced package and return a DataFrame.
# If run in a PySpark environment, this call returns a
# Spark DataFrame. If not, it will return a Pandas DataFrame.
df = package.run('CATelcoCustomerChurnTrainingSample.dprep', dataflow_idx=0)
```
The dataframe df can then be used in the code for advanced analytics.

The dataframe `df` can then be used in the code for advanced analytics.

## 3. One-hot encoding

The dataset imported from French Telecom company Orange consists of heterogeneous noisy data (numerical/categorical variables). One hot encoding transforms categorical features to a format that works better with classification and regression algorithms. Some algorithms, like random forests, handle categorical values natively. Then, one hot encoding is not necessary. The process of one hot encoding may seem tedious, but fortunately, most modern machine learning libraries (such as pandas) can take care of it.
The dataset imported from French Telecom company Orange consists of heterogeneous noisy data (numerical/categorical variables). One-hot encoding transforms categorical features to a format that works better with classification and regression algorithms. Some algorithms, like random forests, handle categorical values natively. Then, one-hot encoding is not necessary. The process of one-hot encoding may seem tedious, but fortunately, most modern machine learning libraries (such as pandas) can take care of it.

The following code is used to perform one-hot encoding:

Expand All @@ -35,15 +41,15 @@ for column_to_encode in columns_to_encode:
```
Code highlights

* list(df.select_dtypes(include=['category','object'])) identifies all categorical fields.
* get_dummies converts categorical variable into dummy/indicator variables.
* There can be more than one categorical variable containing the same values. Hence, column_to_encode + '_' + col_name is used to produce a unique column name.
* `list(df.select_dtypes(include=['category','object']))` identifies all categorical fields.
* `get_dummies()` converts categorical variable into dummy/indicator variables.
* There can be more than one categorical variable containing the same values. Hence, `column_to_encode + '_' + col_name` is used to produce a unique column name.

## 4. Modeling and Evaluation

Naïve Bayes
### Naïve Bayes

In this lab, we will begin with Sklearn’s GaussianNB to build our model. GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:
In this lab, we will begin with Sklearn’s `GaussianNB()` to build our model. `GaussianNB()` implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:

![Algorithm](Images/Formula.png)

Expand All @@ -69,32 +75,41 @@ test = test.drop('churn', 1)
predicted = model.predict(test)
```

The metrics module from sklearn implements functions assessing prediction error for specific purposes such as classification, clustering, regression, etc. The classification_report function from the metrics module produces a report of commonly used measures such as precision, recall, f-measure for the test data. In addition, accuracy_score is a straightforward function that you can leverage to get the accuracy of the classifier. accuracy_score takes in expected and predicted values as shown below:
The metrics module from sklearn implements functions assessing prediction error for specific purposes such as classification, clustering, regression, etc. The `classification_report()` function from the `metrics` module produces a report of commonly used measures such as precision, recall, f-measure for the test data. In addition, `accuracy_score()` is a straightforward function that you can leverage to get the accuracy of the classifier. `accuracy_score()` takes in expected and predicted values as shown below:

```
accuracy_score(expected, predicted)
```
Decision Tree

### Decision Tree

Decision Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the churn data features.

The train and test datasets created from the above section can be used to build a Decision Tree Classifier. The Decision Tree is initialized with two parameters: min_samples_split=20 requires 20 samples in a node for it to be split and random_state=99 to seed the random number generator. The below code can be used to build the tree and get the accuracy to compare with the Naïve Bayes classifier.
The train and test datasets created from the above section can be used to build a Decision Tree Classifier. The Decision Tree is initialized with two parameters:

- `min_samples_split=20` requires 20 samples in a node for it to be split
- `random_state=99` seeds the random number generator.

The below code can be used to build the tree and get the accuracy to compare with the Naïve Bayes classifier.

```
dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(train, target)
predicted = dt.predict(test)
print("Decision Tree Classification Accuracy", accuracy_score(expected, predicted))
```

## 5. Execution – Local Computer

Launch the Command Line Interface (CLI) window by clicking on File --> Open Command-Line Interface. When you launch the CLI window from AMLWorkbench, you will automatically be placed in the project folder. The CLI needs to authenticate and set the current subscription to the one your AMLWorkbench Team Account is in. Run the following az commands from the CLI window launched from AMLWorkbench.
Launch the Command Line Interface (CLI) window by clicking on **File --> Open Command-Line Interface**. When you launch the CLI window from AMLWorkbench, you will automatically be placed in the project folder. The CLI needs to authenticate and set the current subscription to the one your AMLWorkbench Team Account is in. Run the following az commands from the CLI window launched from AMLWorkbench.

```
az login
az account list –o table
az account set –s <subscription_id_where_your_AMLWorkbench_team_account_is_in>
```
The following command executes the CATelcoCustomerChurnModeling.py file locally. After the execution finishes, you should see the output in the CLI window. The classification report is printed out using the metrics module for both Naïve Bayes and Decision Tree classifiers.

The following command executes the `CATelcoCustomerChurnModeling.py` file locally. After the execution finishes, you should see the output in the CLI window. The classification report is printed out using the metrics module for both Naïve Bayes and Decision Tree classifiers.

```
az ml experiment submit -c local CATelcoCustomerChurnModeling.py
Expand All @@ -104,36 +119,47 @@ az ml experiment submit -c local CATelcoCustomerChurnModeling.py

## 6. Jobs

On successful run, you will also find entries in Jobs tab. On selecting the job, notice the evaluation metrics obtained.
On successful completion, you will also find entries in Jobs tab. On selecting the job, notice the evaluation metrics obtained.

![Evaluation_Metrics](Images/EvaluationMetrics.png)

## 7. Pickled Model

In the CATelcoCustomerChurnModeling.py script, we serialize the decision tree model using the popular object serialization package -- pickle, into a file named model.pkl on disk. The code snippet is as follows:
In the `CATelcoCustomerChurnModeling.py` script, we serialize the decision tree model using the popular object serialization package -- pickle, into a file named `model.pkl` on disk. The code snippet is as follows:

```
print ("Export the model to model.pkl")
f = open('./outputs/model.pkl', 'wb')
pickle.dump(dt, f)
f.close()
```
When you executed the CATelcoCustomerChurnModeling.py script using the az ml execute command, the model was written to the outputs folder with the name model.pkl. This folder is only accessible from the AMLWorkbench app. You can find it in the run history detail page and download this binary file by clicking on the download button next to the file name.

When you executed the `CATelcoCustomerChurnModeling.py` script using the `az ml experiment` command, the model was written to the outputs folder with the name `model.pkl`. This folder is a special directory in a storage account on Azure. You can find it in the run history detail page and download this binary file by clicking on the download button next to the file name.

![OutputSfolder](Images/OutputsFolder.png)

Download the model file model.pkl and save it to the root of your project folder. You need it in the later steps.
Download the model file `model.pkl` and save it to the root of your project folder. You need it in the later steps.

## 8. Execution – Local Docker Container

If you have a Docker engine running locally, in the CLI window, run the below command. Note the change the run configuration from local to docker. PrepareEnvironment must be set to true in aml_config/docker.runconfig before you can submit.
If you have a Docker engine running locally, in the CLI window, run the below command. Note the change the run configuration from local to docker. You need to do one of the following before you submit:

- Prepare the compute target explicitly via:

```
az ml experiment prepare -c docker
```

or

- set the `PrepareEnvironment` variable to `true` in `aml_config/docker.runconfig` before you can submit.

```
az ml experiment submit -c docker CATelcoCustomerChurnModeling.py
```

This command pulls down a base docker image, layers a conda environment on the base image based on the `conda_dependencies.yml` file in the project's _aml_config_ directory, and then starts a Docker container.
This command pulls down a base docker image, layers a conda environment on the base image based on the `conda_dependencies.yml` file in the project's `aml_config` directory, and then starts a Docker container.

The results will be very, very similar to the results that you got when you ran the results locally.
The results will be very similar to the results that you got when you ran the results locally.

[Go to next hands-on lab](https://github.com/Azure/MachineLearningSamples-ChurnPrediction/blob/master/docs/ModelingAndEvaluationWithoutDprep.md)
42 changes: 20 additions & 22 deletions docs/Operationalization.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,7 @@ Local mode deployments run in docker containers on your local computer, whether
To prepare the operationalization environment, in the CLI window type the following to set up the environment for local operationalization:

```
az ml env setup -n <envname> -g <resourcegroup> -l <resourceslocation>
```

Follow the instructions to provision an Azure Container Registry (ACR) instance and a storage account in which to store the Docker image we are about to create. When finished, a file named .amlenvrc.cmd is created in your home directory (usually C:\Users<username>) which contains then names and credentials of the ACR and storage account.

To set the environment variables required for operationalization, execute the .amlenvrc.cmd file from the command line.

```
c:\Users\<username>\.amlenvrc.cmd
az ml env setup -n <ENV_NAME> -g <AZURE_RESOURCEGROUP> -l <AZURE_LOCATION>
```

To verify that you have properly configured your operationalization environment for local web service deployment, enter the following command:
Expand All @@ -29,8 +21,9 @@ az ml env local

## 3. Scoring and Schema Files

Open churn_schema_gen.py. Churn_schema_gen.py is responsible for generating the scoring and schema files necessary to operationalize Churn Prediction. Prepare the web service definition by authoring init() and run() functions.
The init() function loads the model (model.pkl) as shown below:
Open `churn_schema_gen.py`. `churn_schema_gen.py` is responsible for generating the scoring and schema files necessary to operationalize Churn Prediction. Prepare the web service definition by authoring `init()` and `run()` functions.
The `init()` function loads the model (`model.pkl`) as shown below:

```
def init():
from sklearn.externals import joblib
Expand All @@ -39,7 +32,9 @@ def init():
global model
model = joblib.load('model.pkl')
```
The run() function takes the input dataframe, input_df and performs one-hot encoding. Columns_encoded is the list of all columns after encoding from the modeling exercise and the encoded dataframe is passed to the model for prediction. The three columns year, month and churn (class) are also deleted to be consistent with the preprocessing performed with the modeling.

The `run()` function takes the input dataframe (`input_df`) and performs one-hot encoding. `columns_encoded` is the list of all columns after encoding from the modeling exercise and the encoded dataframe is passed to the model for prediction. The three columns `year`, `month`, and `churn` are also deleted to be consistent with the preprocessing performed with the modeling.

```
def run(input_df):
import json
Expand Down Expand Up @@ -93,17 +88,18 @@ def run(input_df):

To deploy the web service, you must have a model, a scoring script, and optionally a schema for the web service input data. The scoring script loads the model.pkl file from the current folder and uses it to produce a new predicted class. The input to the model is encoded features.

To generate the scoring and schema files, execute the churn_schema_gen.py file that comes with the sample project in the AMLWorkbench CLI command prompt using Python interpreter directly.
To generate the scoring and schema files, execute the `churn_schema_gen.py` file that comes with the sample project in the AMLWorkbench CLI command prompt using Python interpreter directly.

```
python churn_schema_gen.py
```

This will create service_schema.json (this file contains the schema of the web service input)
This will create a file called `service_schema.json`. This file contains the schema of the web service input.

### Model Management

The real-time web service requires a modelmanagement account. This can be created using the following commands:

```
az group create -l <location> -n <name>
az ml account modelmanagement create -l <location> -g <resource group> -n <account name>
Expand All @@ -113,21 +109,23 @@ az ml account modelmanagement set -n <account name> -g <resource group>
To create the real-time web service, run the following command:

```
az ml service create realtime -f score.py --model-file model.pkl -s service_schema.json -n <name> -r python
az ml service create realtime -f score.py --model-file model.pkl -s service_schema.json -n <SERVICE_NAME> -r python
```



![AzureML_Service](Images/AzureMLService.png)

The different az ml service create realtime command parameters are as follows:
* -n: app name, must be lower case.
* -f: scoring script file name
* --model-file: model file, in this case it is the pickled sklearn model
* -r: type of model, in this case it is the scikit-learn model
The model and the scoring file are uploaded into an Azure service that we manage. As part of deployment process, the operationalization component uses the pickled model model.pkl and main.py to build a Docker image named <ACR_name>.azureacr.io/irisapp. It registers the image with your Azure Container Registry (ACR) service, pulls down that image locally to your computer, and starts a Docker container based on that image. As part of the deployment, an HTTP REST endpoint for the web service is created on your local machine.
The different `az ml service create realtime` command parameters are as follows:

* `-n <SERVICE_NAME>` (must be lower case)
* `-f <SCORING_SCRIPT_FILE_NAME>`
* `--model-file <MODEL_FILENAME>`
* `-r <RUNTIME>`

The model and the scoring file are uploaded into an Azure service that we manage. As part of deployment process, the operationalization component uses the pickled model `model.pkl` and the script file `score.py` to build a Docker image named `<ACR_name>.azureacr.io/<SERVICE_NAME>`. It registers the image with your Azure Container Registry (ACR) service, pulls down that image locally to your computer, and starts a Docker container based on that image. As part of the deployment, an HTTP REST endpoint for the web service is created on your local machine.

Run docker ps to see the churn image as shown below:
Run `docker ps` to see the churn image as shown below:

![RunDocker](Images/RunDocker.png)

Expand Down