Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/workflows/python-package-conda.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: Python Package using Conda

on: [push]

jobs:
build-linux:
runs-on: ubuntu-latest
strategy:
max-parallel: 5

steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: '3.10'
- name: Add conda to system path
run: |
# $CONDA is an environment variable pointing to the root of the miniconda directory
echo $CONDA/bin >> $GITHUB_PATH
- name: Install dependencies
run: |
conda env update --file environment.yml --name base
- name: Lint with flake8
run: |
conda install flake8
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
conda install pytest
pytest
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
https://github.com/mha2112/Deploying-a-Scalable-ML-Pipeline-with-FastAPI


Working in a command line environment is recommended for ease of use with git and dvc. If on Windows, WSL1 or 2 is recommended.

# Environment Set up (pip or conda)
Expand All @@ -9,10 +12,11 @@ Working in a command line environment is recommended for ease of use with git an
* As you work on the code, continually commit changes. Trained models you want to use in production must be committed to GitHub.
* Connect your local git repo to GitHub.
* Setup GitHub Actions on your repo. You can use one of the pre-made GitHub Actions if at a minimum it runs pytest and flake8 on push and requires both to pass without error.
* Make sure you set up the GitHub Action to have the same version of Python as you used in development.
* Make sure you set up the GitHub Action to have the same version of Python as you used in development."

# Data
* Download census.csv and commit it to dvc.

# Data-----from chat===Ray, "The goal of that cleaning step was to remove those errors, but this iteration appears to have removed those... so there's really no point in cleaning the data at all. It's already in an optimal state."
(((((((* Download census.csv and commit it to dvc.---------pg.2 in comments says do not have to do)))))))))
* This data is messy, try to open it in pandas and see what you get.
* To clean it, use your favorite text editor to remove all spaces.

Expand Down
17 changes: 11 additions & 6 deletions local_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,16 @@
import requests

# TODO: send a GET using the URL http://127.0.0.1:8000
r = None # Your code here


#r = None # Your code here
url = "http://127.0.0.1:8000"
r = requests.get(url)

# TODO: print the status code
# print()
print(r.status_code)
# TODO: print the welcome message
# print()
print(r.json())



Expand All @@ -30,9 +34,10 @@
}

# TODO: send a POST using the data above
r = None # Your code here
#r = None # Your code here
r = requests.post(f"{url}/predict",json=data)

# TODO: print the status code
# print()
print(r.status_code)
# TODO: print the result
# print()
print(r.json())
30 changes: 23 additions & 7 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@
from fastapi import FastAPI
from pydantic import BaseModel, Field

from ml.model import save_model
from ml.data import apply_label, process_data
from ml.model import inference, load_model
from ml.model import train_and_save_final_model

# DO NOT MODIFY
class Data(BaseModel):
Expand All @@ -26,25 +28,30 @@ class Data(BaseModel):
hours_per_week: int = Field(..., example=40, alias="hours-per-week")
native_country: str = Field(..., example="United-States", alias="native-country")

path = None # TODO: enter the path for the saved encoder
encoder = load_model(path)
#path = None # TODO: enter the path for the saved encoder
encoder_path = "model/encoder.pkl"
encoder = load_model(encoder_path)

path = None # TODO: enter the path for the saved model
model = load_model(path)
#path = None # TODO: enter the path for the saved model
model_path = "model/model.pkl"
model = load_model(model_path)

# TODO: create a RESTful API using FastAPI
app = None # your code here
#app = None # your code here
app = FastAPI()


# TODO: create a GET on the root giving a welcome message
@app.get("/")
async def get_root():
""" Say hello!"""
# your code here
return{"message":"Hello!"}
pass


# TODO: create a POST on a different path that does model inference
@app.post("/data/")
@app.post("/predict/")
async def post_inference(data: Data):
# DO NOT MODIFY: turn the Pydantic model into a dict.
data_dict = data.dict()
Expand All @@ -66,9 +73,18 @@ async def post_inference(data: Data):
]
data_processed, _, _, _ = process_data(
# your code here
X=data,
categorical_features=cat_features,
label=None,
training=False,
encoder=encoder,
# use data as data input
# use training = False
# do not need to pass lb as input
)
_inference = None # your code here to predict the result using data_processed
#_inference = None # your code here to predict the result using data_processed
_inference = inference(model, data_processed)
return {"result": apply_label(_inference)}



54 changes: 53 additions & 1 deletion ml/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@
from sklearn.metrics import fbeta_score, precision_score, recall_score
from ml.data import process_data
# TODO: add necessary import
import os
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd


# Optional: implement hyperparameter tuning.
def train_model(X_train, y_train):
Expand All @@ -20,6 +25,9 @@ def train_model(X_train, y_train):
Trained machine learning model.
"""
# TODO: implement the function
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
return model
pass


Expand Down Expand Up @@ -60,6 +68,7 @@ def inference(model, X):
Predictions from the model.
"""
# TODO: implement the function
return model.predict(X)
pass

def save_model(model, path):
Expand All @@ -73,11 +82,16 @@ def save_model(model, path):
Path to save pickle file.
"""
# TODO: implement the function
with open(path,"wb") as f:
pickle.dump(model,f)
pass

def load_model(path):
""" Loads pickle file from `path` and returns it."""
# TODO: implement the function
print(f'loading model path {path}')
with open(path,"rb") as f:
return pickle.load(f)
pass


Expand Down Expand Up @@ -117,12 +131,50 @@ def performance_on_categorical_slice(
fbeta : float

"""
sliced_data = data[data[column_name] == slice_value]

# TODO: implement the function
X_slice, y_slice, _, _ = process_data(
# your code here
# for input data, use data in column given as "column_name", with the slice_value
# use training = False
sliced_data,
categorical_features=categorical_features,
label=label,
training=False,
encoder=encoder,
lb=lb,
)
preds = None # your code here to get prediction on X_slice using the inference function
#preds = None # your code here to get prediction on X_slice using the inference function
preds = inference(model, X_slice)
precision, recall, fbeta = compute_model_metrics(y_slice, preds)
return precision, recall, fbeta

#added for k-fold model
def train_and_save_final_model(data, categorical_features, label, model_dir):
X_all, y_all, encoder, lb = process_data(
data,
categorical_features=categorical_features,
label=label,
training=True,
)

model = train_model(X_all, y_all)

os.makedirs(model_dir, exist_ok=True)
model_path = os.path.join(model_dir, "model.pk1")
encoder_path= os.path.join(model_dir, "encoder.pkl")
lb_path= os.path.join(model_dir, "lb.pkl")

save_model(model, model_path)
save_model(encoder, encoder_path)
save_model(lb,lb_path)

#didn't see model_saved after running the code
print(f'Model saved to {model_path}')
print(f'Model saved to {encoder_path}')
print(f'Model saved to {lb_path}')


return model, encoder, lb

Binary file added model/encoder.pkl
Binary file not shown.
Binary file added model/lb.pkl
Binary file not shown.
Binary file added model/model.pkl
Binary file not shown.
38 changes: 37 additions & 1 deletion model_card_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,52 @@
For additional information see the Model Card paper: https://arxiv.org/pdf/1810.03993.pdf

## Model Details
Salary prediction from census data using a random forest classifier. Random forest uses multiple decision trees to train the data that improves training efficiency being an ensemble.
OneHotEncoding is used for the categorical column information and the labelBinarizer for the Salary column.

## Intended Use
Salary prediction from census data shows what common categories make less than or greater than 50K a year.

## Training Data

The model used k-fold validation with standard k=5 parameter. Consistency metrics are done on the slicing of the k-folds.
## Evaluation Data
When the model predicts the salary to be over 50K with applied variables at testing it is 73% of time, correct.
From the slice_output.txt file there are some categories that have a higher count creating metrics that come together, but the under represented categories with few counts have a high metric in one score and a low in the others:

relationship: Own-child, Count: 1,019
Precision: 1.0000 | Recall: 0.1765 | F1: 0.3000

the own=child category has a high count and a low recall and f1
everyone in this category did make over 50K, when predicting over 50K it is 100% right. But it still underpredicts with 1,019 samples.

native-country: Cambodia, Count: 3
Precision: 1.0000 | Recall: 0.0000 | F1: 0.0000

Native country being Cambodia only has 3 people representing that country so Precision is high but recall and f1 are low. The sample is small. The test data found 1 that made over 50K from Cambodia, but recall and f1 are zero showing not definite or confident results for this categoric data.


## Metrics
_Please include the metrics used and your model's performance on those metrics._
Precision, Recall and F1 score were used to test the model with the following values:

Precision: 0.7451 | Recall: 0.6308 | F1: 0.6832
Precision: 0.7087 | Recall: 0.6164 | F1: 0.6593
Precision: 0.7346 | Recall: 0.6180 | F1: 0.6713
Precision: 0.7475 | Recall: 0.6222 | F1: 0.6791
Precision: 0.7453 | Recall: 0.6307 | F1: 0.6832

The average of these scores were calculated as:
avg precision: 0.7363
avg recall: 0.6236
avg f1_score: 0.6752

Precision being what values were actually correcty or true positive is 73%
recall or sensitivity of values that could be missed is 62%
F1 score the balance between both precision and recall is 67%

## Ethical Considerations
The model will predict salary based on the other categorical information and is using performance slicing to id bias or gender, sex, race.

## Caveats and Recommendations
The missing values could even be question marks: occupation: ?, Count: 389
Precision: 0.6923 | Recall: 0.4286 | F1: 0.5294, this slice_output found 39 question marks for the census question occupattion. The data is not clean and can create distraction from the true numbers or data analyzed.
Empty file added python3
Empty file.
Binary file added screenshots/continuous_integration.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/local_api.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added screenshots/unit_test.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading