udacity · mha2112 · Sep 18, 2025 · Sep 18, 2025 · Sep 18, 2025 · Sep 18, 2025
@@ -0,0 +1,34 @@
+name: Python Package using Conda
+
+on: [push]
+
+jobs:
+  build-linux:
+    runs-on: ubuntu-latest
+    strategy:
+      max-parallel: 5
+
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python 3.10
+      uses: actions/setup-python@v3
+      with:
+        python-version: '3.10'
+    - name: Add conda to system path
+      run: |
+        # $CONDA is an environment variable pointing to the root of the miniconda directory
+        echo $CONDA/bin >> $GITHUB_PATH
+    - name: Install dependencies
+      run: |
+        conda env update --file environment.yml --name base
+    - name: Lint with flake8
+      run: |
+        conda install flake8
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+    - name: Test with pytest
+      run: |
+        conda install pytest
+        pytest
@@ -1,3 +1,6 @@
+https://github.com/mha2112/Deploying-a-Scalable-ML-Pipeline-with-FastAPI
+
+
 Working in a command line environment is recommended for ease of use with git and dvc. If on Windows, WSL1 or 2 is recommended.
 
 # Environment Set up (pip or conda)
@@ -9,10 +12,11 @@ Working in a command line environment is recommended for ease of use with git an
     * As you work on the code, continually commit changes. Trained models you want to use in production must be committed to GitHub.
 * Connect your local git repo to GitHub.
 * Setup GitHub Actions on your repo. You can use one of the pre-made GitHub Actions if at a minimum it runs pytest and flake8 on push and requires both to pass without error.
-    * Make sure you set up the GitHub Action to have the same version of Python as you used in development.
+    * Make sure you set up the GitHub Action to have the same version of Python as you used in development."
 
-# Data
-* Download census.csv and commit it to dvc.
+
+# Data-----from chat===Ray, "The goal of that cleaning step was to remove those errors, but this iteration appears to have removed those... so there's really no point in cleaning the data at all. It's already in an optimal state."
+(((((((* Download census.csv and commit it to dvc.---------pg.2 in comments says do not have to do)))))))))
 * This data is messy, try to open it in pandas and see what you get.
 * To clean it, use your favorite text editor to remove all spaces.
 

@@ -3,12 +3,16 @@
 import requests
 
 # TODO: send a GET using the URL http://127.0.0.1:8000
-r = None # Your code here
+
+
+#r = None # Your code here
+url = "http://127.0.0.1:8000"
+r = requests.get(url)
 
 # TODO: print the status code
-# print()
+print(r.status_code)
 # TODO: print the welcome message
-# print()
+print(r.json())
 
 
 
@@ -30,9 +34,10 @@
 }
 
 # TODO: send a POST using the data above
-r = None # Your code here
+#r = None # Your code here
+r = requests.post(f"{url}/predict",json=data)
 
 # TODO: print the status code
-# print()
+print(r.status_code)
 # TODO: print the result
-# print()
+print(r.json())
@@ -4,8 +4,10 @@
 from fastapi import FastAPI
 from pydantic import BaseModel, Field
 
+from ml.model import save_model
 from ml.data import apply_label, process_data
 from ml.model import inference, load_model
+from ml.model import train_and_save_final_model
 
 # DO NOT MODIFY
 class Data(BaseModel):
@@ -26,25 +28,30 @@ class Data(BaseModel):
     hours_per_week: int = Field(..., example=40, alias="hours-per-week")
     native_country: str = Field(..., example="United-States", alias="native-country")
 
-path = None # TODO: enter the path for the saved encoder 
-encoder = load_model(path)
+#path = None # TODO: enter the path for the saved encoder
+encoder_path = "model/encoder.pkl"
+encoder = load_model(encoder_path)
 
-path = None # TODO: enter the path for the saved model 
-model = load_model(path)
+#path = None # TODO: enter the path for the saved model
+model_path = "model/model.pkl"
+model = load_model(model_path)
 
 # TODO: create a RESTful API using FastAPI
-app = None # your code here
+#app = None # your code here
+app = FastAPI()
+
 
 # TODO: create a GET on the root giving a welcome message
 @app.get("/")
 async def get_root():
     """ Say hello!"""
     # your code here
+    return{"message":"Hello!"}
     pass
 
 
 # TODO: create a POST on a different path that does model inference
-@app.post("/data/")
+@app.post("/predict/")
 async def post_inference(data: Data):
     # DO NOT MODIFY: turn the Pydantic model into a dict.
     data_dict = data.dict()
@@ -66,9 +73,18 @@ async def post_inference(data: Data):
     ]
     data_processed, _, _, _ = process_data(
         # your code here
+        X=data,
+        categorical_features=cat_features,
+        label=None,
+        training=False,
+        encoder=encoder,
         # use data as data input
         # use training = False
         # do not need to pass lb as input
     )
-    _inference = None # your code here to predict the result using data_processed
+    #_inference = None # your code here to predict the result using data_processed
+    _inference = inference(model, data_processed)
     return {"result": apply_label(_inference)}
+
+
+
@@ -2,6 +2,11 @@
 from sklearn.metrics import fbeta_score, precision_score, recall_score
 from ml.data import process_data
 # TODO: add necessary import
+import os
+from sklearn.ensemble import RandomForestClassifier
+import numpy as np
+import pandas as pd
+
 
 # Optional: implement hyperparameter tuning.
 def train_model(X_train, y_train):
@@ -20,6 +25,9 @@ def train_model(X_train, y_train):
         Trained machine learning model.
     """
     # TODO: implement the function
+    model = RandomForestClassifier(random_state=42)
+    model.fit(X_train, y_train)
+    return model 
     pass
 
 
@@ -60,6 +68,7 @@ def inference(model, X):
         Predictions from the model.
     """
     # TODO: implement the function
+    return model.predict(X)
     pass
 
 def save_model(model, path):
@@ -73,11 +82,16 @@ def save_model(model, path):
         Path to save pickle file.
     """
     # TODO: implement the function
+    with open(path,"wb") as f:
+        pickle.dump(model,f)
     pass
 
 def load_model(path):
     """ Loads pickle file from `path` and returns it."""
     # TODO: implement the function
+    print(f'loading model path {path}')
+    with open(path,"rb") as f:
+        return pickle.load(f)
     pass
 
 
@@ -117,12 +131,50 @@ def performance_on_categorical_slice(
     fbeta : float
 
     """
+    sliced_data = data[data[column_name] == slice_value]
+
     # TODO: implement the function
     X_slice, y_slice, _, _ = process_data(
         # your code here
         # for input data, use data in column given as "column_name", with the slice_value 
         # use training = False
+        sliced_data,
+        categorical_features=categorical_features,
+        label=label,
+        training=False,
+        encoder=encoder,
+        lb=lb,
     )
-    preds = None # your code here to get prediction on X_slice using the inference function
+    #preds = None # your code here to get prediction on X_slice using the inference function
+    preds = inference(model, X_slice)
     precision, recall, fbeta = compute_model_metrics(y_slice, preds)
     return precision, recall, fbeta
+
+#added for k-fold model
+def train_and_save_final_model(data, categorical_features, label, model_dir):
+    X_all, y_all, encoder, lb = process_data(
+        data,
+        categorical_features=categorical_features,
+        label=label,
+        training=True,
+    )
+
+    model = train_model(X_all, y_all)
+
+    os.makedirs(model_dir, exist_ok=True)
+    model_path = os.path.join(model_dir, "model.pk1")
+    encoder_path= os.path.join(model_dir, "encoder.pkl")
+    lb_path= os.path.join(model_dir, "lb.pkl")
+
+    save_model(model, model_path)
+    save_model(encoder, encoder_path)
+    save_model(lb,lb_path)
+
+    #didn't see model_saved after running the code
+    print(f'Model saved to {model_path}')
+    print(f'Model saved to {encoder_path}')
+    print(f'Model saved to {lb_path}')
+
+
+    return model, encoder, lb
+
@@ -3,16 +3,52 @@
 For additional information see the Model Card paper: https://arxiv.org/pdf/1810.03993.pdf
 
 ## Model Details
+Salary prediction from census data using a random forest classifier. Random forest uses multiple decision trees to train the data that improves training efficiency being an ensemble. 
+OneHotEncoding is used for the categorical column information and the labelBinarizer for the Salary column.
 
 ## Intended Use
+Salary prediction from census data shows what common categories make less than or greater than 50K a year. 
 
 ## Training Data
-
+The model used k-fold validation with standard k=5 parameter. Consistency metrics are done on the slicing of the k-folds.
 ## Evaluation Data
+When the model predicts the salary to be over 50K with applied variables at testing it is 73% of time, correct.
+From the slice_output.txt file there are some categories that have a higher count creating metrics that come together, but the under represented categories with few counts have a high metric in one score and a low in the others:
+
+relationship: Own-child, Count: 1,019
+Precision: 1.0000 | Recall: 0.1765 | F1: 0.3000
+
+the own=child category has a high count and a low recall and f1
+everyone in this category did make over 50K, when predicting over 50K it is 100% right. But it still underpredicts with 1,019 samples. 
+
+native-country: Cambodia, Count: 3
+Precision: 1.0000 | Recall: 0.0000 | F1: 0.0000
+
+Native country being Cambodia only has 3 people representing that country so Precision is high but recall and f1 are low. The sample is small. The test data found 1 that made over 50K from Cambodia, but recall and f1 are zero showing not definite or confident results for this categoric data. 
+
 
 ## Metrics
 _Please include the metrics used and your model's performance on those metrics._
+Precision, Recall and F1 score were used to test the model with the following values:
+
+Precision: 0.7451 | Recall: 0.6308 | F1: 0.6832
+Precision: 0.7087 | Recall: 0.6164 | F1: 0.6593
+Precision: 0.7346 | Recall: 0.6180 | F1: 0.6713
+Precision: 0.7475 | Recall: 0.6222 | F1: 0.6791
+Precision: 0.7453 | Recall: 0.6307 | F1: 0.6832
+
+The average of these scores were calculated as:
+avg precision:  0.7363
+avg recall:  0.6236
+avg f1_score:  0.6752
+
+Precision being what values were actually correcty or true positive is 73%
+recall or sensitivity of values that could be missed is 62%
+F1 score the balance between both precision and recall is 67%
 
 ## Ethical Considerations
+The model will predict salary based on the other categorical information and is using performance slicing to id bias or gender, sex, race. 
 
 ## Caveats and Recommendations
+The missing values could even be question marks: occupation: ?, Count: 389
+Precision: 0.6923 | Recall: 0.4286 | F1: 0.5294, this slice_output found 39 question marks for the census question occupattion. The data is not clean and can create distraction from the true numbers or data analyzed.