Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions docs/machine-learning/advanced-ml-topics/mlops/ci-cd.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
title: "CI/CD/CT: Automated Pipelines for ML"
sidebar_label: CI/CD for ML
description: "Exploring Continuous Integration, Continuous Delivery, and Continuous Training in MLOps."
tags: [mlops, cicd, continuous-training, automation, jenkins, github-actions]
---

In traditional software, we have **CI** (Continuous Integration) and **CD** (Continuous Delivery). However, Machine Learning introduces a third dimension: **Data**. Because data changes over time, we need a third pillar: **CT** (Continuous Training).

## 1. The Three Pillars of MLOps Automation

To build a robust ML system, we must automate three distinct cycles:

### Continuous Integration (CI)
Beyond testing code, ML CI involves testing **data schemas** and **models**.
* **Code Testing:** Unit tests for feature engineering logic.
* **Data Testing:** Validating that incoming data matches expected distributions.
* **Model Validation:** Ensuring the model architecture compiles and training runs without memory leaks.

### Continuous Delivery (CD)
This is the automation of deploying the model as a service.
* **Artifact Packaging:** Wrapping the model in a [Docker container](./model-deployment#2-the-containerization-standard-docker).
* **Integration Testing:** Ensuring the API endpoint responds correctly to requests.
* **Deployment:** Moving the model to a staging or production environment using [Canary or Blue-Green strategies](./model-deployment#3-deployment-strategies).

### Continuous Training (CT)
This is unique to ML. It is a property of an ML system that automatically retrains and serves the model based on new data or [Model Drift](./monitoring#1-why-models-decay).

## 2. The MLOps Maturity Levels

Google defines the evolution of CI/CD in ML through three levels of maturity:

1. **Level 0 (Manual):** Every step (data prep, training, deployment) is done manually in notebooks.
2. **Level 1 (Automated Training):** The pipeline is automated. Whenever new data arrives, the training and validation happen automatically (CT).
3. **Level 2 (CI/CD Pipeline Automation):** The entire workflow—from code commits to model monitoring—is a fully automated CI/CD pipeline.

## 3. The Automated Workflow

The following diagram illustrates how a code change or a "Drift" alert triggers a sequence of automated events.

```mermaid
graph TD
Code[Code Commit / Data Drift Alert] --> CI[CI: Build & Test]

subgraph Pipeline [Automated ML Pipeline]
CI --> Train[Continuous Training]
Train --> Eval[Model Evaluation]
Eval --> Validate{Meets Threshold?}
end

Validate -- No --> Fail[Alert Developer]
Validate -- Yes --> Register[Model Registry]

Register --> CD[CD: Deploy to Prod]
CD --> Monitor[Monitoring & Observability]
Monitor -- Drift Detected --> Code

style Pipeline fill:#f0f4ff,stroke:#5c7aff,stroke-width:2px,color:#333
style Validate fill:#fff3e0,stroke:#ef6c00,color:#333
style Register fill:#c8e6c9,stroke:#2e7d32,color:#333

```

## 4. Key Components of the Pipeline

* **Feature Store:** A centralized repository where features are stored and shared, ensuring that the same feature logic is used in both training and serving.
* **Model Registry:** A "version control" for models. It stores trained models, their metadata (hyperparameters, accuracy), and their environment dependencies.
* **Metadata Store:** Records every execution of the pipeline, allowing you to trace a specific model version back to the exact dataset and code used to create it.

## 5. Tools of the Trade

Depending on your cloud provider, the tools for CI/CD/CT vary:

| Component | Open Source | AWS | Google Cloud |
| --- | --- | --- | --- |
| **Orchestration** | Kubeflow / Airflow | Step Functions | Vertex AI Pipelines |
| **CI/CD** | GitHub Actions / GitLab | CodePipeline | Cloud Build |
| **Tracking** | MLflow | SageMaker Experiments | Vertex AI Metadata |
| **Storage** | DVC (Data Version Control) | S3 | GCS |

## 6. Implementation: A GitHub Actions Snippet

A simple CI task to check if a model's accuracy meets a threshold before allowing a "Push" to production.

```yaml
name: Model Training CI
on: [push]

jobs:
train-and-validate:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2

- name: Install dependencies
run: pip install -r requirements.txt

- name: Run Training & Evaluation
run: python train.py # Script generates 'metrics.json'

- name: Check Accuracy Threshold
run: |
ACCURACY=$(jq '.accuracy' metrics.json)
if (( $(echo "$ACCURACY < 0.85" | bc -l) )); then
echo "Accuracy too low ($ACCURACY). Deployment failed."
exit 1
fi

```

## References

* **Google Cloud:** [MLOps: Continuous delivery and automation pipelines](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)
* **ThoughtWorks:** [Continuous Delivery for Machine Learning (CD4ML)](https://martinfowler.com/articles/cd4ml.html)
* **MLflow:** [Introduction to Model Registry](https://www.mlflow.org/docs/latest/model-registry.html)

---

**With CI/CD/CT, your model is now a living, breathing part of your infrastructure. But how do we ensure it remains ethical and unbiased throughout these cycles?**
110 changes: 110 additions & 0 deletions docs/machine-learning/advanced-ml-topics/mlops/data-versioning.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: "Data Versioning: The Git for Data"
sidebar_label: Data Versioning
description: "Understanding how to track changes in datasets to ensure reproducibility and auditability in ML experiments."
tags: [mlops, data-versioning, dvc, reproducibility, data-lake]
---

In traditional software development, versioning code with **Git** is enough to recreate any state of an application. In Machine Learning, code is only half the story. The resulting model depends on both the **Code** and the **Data**.

If you retrain your model today and get different results than yesterday, you need to know exactly which version of the dataset was used. **Data Versioning** provides the "undo button" for your data.

## 1. Why Git Isn't Enough for Data

Git is designed to track small text files. It struggles with the large binary files (CSV, Parquet, Images, Audio) typically used in ML for several reasons:

* **Storage Limits:** Storing gigabytes of data in a Git repository slows down operations significantly.
* **Diffing:** Git cannot efficiently show differences between two 5GB binary files.
* **Cost:** Hosting large blobs in GitHub or GitLab is expensive and inefficient.

**Data Versioning tools** solve this by tracking "pointers" (metadata) in Git, while storing the actual data in external storage (S3, GCS, Azure Blob).

## 2. The Core Concept: Metadata vs. Storage

Data versioning works by creating a **hash** (unique ID) of your data files.

1. **The Data:** Stored in a scalable cloud bucket (e.g., AWS S3).
2. **The Metafile:** A tiny text file containing the hash and file path. This file **is** committed to Git.

<br />

<img className="rounded" src="/tutorial/img/tutorials/ml/git-dvc-s3.png" alt="The relationship between Git (tracking .dvc files) and S3 (tracking large datasets)" />


## 3. Workflow Logic

The following diagram illustrates how DVC (Data Version Control) interacts with Git and remote storage to maintain synchronization.

```mermaid
graph TD
subgraph Local_Machine [Local Workspace]
Code[script.py] -- "git commit" --> Git[(Git Repo)]
Data[data.csv] -- "dvc add" --> Meta[.dvc file]
Meta -- "git commit" --> Git
end

subgraph Storage [Remote Storage]
Data -- "dvc push" --> Cloud[(S3 / GCS Bucket)]
end

subgraph Collaborator [Team Member]
Git -- "git pull" --> NewMeta[.dvc file]
NewMeta -- "dvc pull" --> NewData[data.csv]
Cloud -- download --> NewData
end

style Storage fill:#f1f8e9,stroke:#558b2f,color:#333
style Git fill:#e1f5fe,stroke:#01579b,color:#333
style Data fill:#fff3e0,stroke:#ef6c00,color:#333

```

## 4. Popular Data Versioning Tools

| Tool | Focus | Best For |
| --- | --- | --- |
| **DVC (Data Version Control)** | Open-source, Git-like CLI. | Teams already comfortable with Git. |
| **Pachyderm** | Data lineage and pipelining. | Complex data pipelines on Kubernetes. |
| **LakeFS** | Git-like branches for Data Lakes. | Teams using S3/GCS as their primary data source. |
| **W&B Artifacts** | Integrated with experiment tracking. | Visualizing data lineage alongside model training. |

## 5. Implementation with DVC

DVC is the most popular tool because it integrates seamlessly with your existing Git workflow.

```bash
# 1. Initialize DVC in your project
dvc init

# 2. Add a large dataset (this creates data.csv.dvc)
dvc add data/train_images.zip

# 3. Track the metadata in Git
git add data/train_images.zip.dvc .gitignore
git commit -m "Add raw training images version 1.0"

# 4. Push the actual data to a remote (S3, GCS, etc.)
dvc remote add -d myremote s3://my-bucket/data
dvc push

# 5. Switching versions
git checkout v2.0-experiment
dvc checkout # This physically swaps the data files in your folder

```

## 6. The Benefits of Versioning Data

* **Reproducibility:** You can recreate the exact environment of a model trained 6 months ago.
* **Compliance & Auditing:** In regulated industries (finance/healthcare), you must be able to show exactly what data was used to train a model to explain its decisions.
* **Collaboration:** Multiple researchers can work on different versions of the data without overwriting each other's work.
* **Data Lineage:** Tracking the "ancestry" of a dataset—knowing that `clean_data.csv` was generated from `raw_data.csv` using `clean.py`.

## References

* **DVC Documentation:** [Get Started with DVC](https://dvc.org/doc/start)
* **LakeFS:** [Git for Data Lakes](https://lakefs.io/)

---

**Data versioning is the foundation of a reproducible pipeline. Now that we can track our data and code, how do we track the experiments and hyperparameter results?**
112 changes: 112 additions & 0 deletions docs/machine-learning/advanced-ml-topics/mlops/model-deployment.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: "Model Deployment: Moving from Lab to Production"
sidebar_label: Deployment
description: "Strategies for serving machine learning models, including batch vs. real-time, containerization, and deployment patterns."
tags: [mlops, deployment, docker, kubernetes, api, serving]
---

**Model Deployment** is the process of integrating a machine learning model into an existing production environment where it can take in data and return predictions. It is the final stage of the ML pipeline, but it is also the beginning of the model's "life" where it provides actual value.

## 1. Deployment Modes

Before choosing a tool, you must decide how the users will consume the predictions.

| Mode | Description | Example |
| :--- | :--- | :--- |
| **Request-Response (Real-time)** | The model lives behind an API. Predictions are returned instantly (low latency). | **Fraud Detection** during a credit card swipe. |
| **Batch Scoring** | The model runs on a large set of data at scheduled intervals (e.g., every night). | **Recommendation Emails** sent to users once a day. |
| **Streaming** | The model consumes data from a queue (like Kafka) and outputs predictions continuously. | **Log Monitoring** for cybersecurity threats. |

## 2. The Containerization Standard: Docker

In MLOps, we don't just deploy code; we deploy the **environment**. To avoid the "it works on my machine" problem, we use **Docker**.

A Docker container packages the model file, the Python runtime, and all dependencies (NumPy, Scikit-Learn, etc.) into a single image that runs identically on any server.

## 3. Deployment Strategies

Deploying a model isn't just about "overwriting" the old one. We use strategies to minimize risk.

* **Blue-Green Deployment:** You have two identical environments. You route traffic to "Green" (new model). If it fails, you instantly flip back to "Blue" (old model).
* **Canary Deployment:** You route 5% of traffic to the new model. If the metrics look good, you slowly increase it to 100%.
* **A/B Testing:** You run two models simultaneously and compare their real-world performance (e.g., which one leads to more clicks?).

## 4. Logical Workflow: The Deployment Pipeline

The following diagram illustrates the path from a trained model to a live API endpoint.

```mermaid
graph LR
Model[Trained Model File .pkl / .h5] --> Wrap[API Wrapper: Flask/FastAPI]
Wrap --> Docker[Docker Image]
Docker --> Registry[Container Registry]

subgraph Infrastructure [Production Environment]
Registry --> K8s[Kubernetes / Cloud Run]
K8s --> LoadBalancer[Load Balancer]
end

User((User)) --> LoadBalancer
LoadBalancer --> K8s

style Docker fill:#e1f5fe,stroke:#01579b,color:#333
style K8s fill:#fff3e0,stroke:#ef6c00,color:#333
style Model fill:#c8e6c9,stroke:#2e7d32,color:#333

```

## 5. Model Serving Frameworks

While you can write your own API using **FastAPI**, dedicated "Model Serving" tools handle scaling and versioning better:

1. **TensorFlow Serving:** Highly optimized for TF models.
2. **TorchServe:** The official serving library for PyTorch.
3. **KServe (formerly KFServing):** A serverless way to deploy models on Kubernetes.
4. **BentoML:** A framework that simplifies the packaging and deployment of any Python model.

## 6. Implementation Sketch (FastAPI + Uvicorn)

This is a minimal example of serving a Scikit-Learn model as a REST API.

```python
from fastapi import FastAPI
import joblib
import pydantic

app = FastAPI()

# 1. Load the pre-trained model
model = joblib.load("model_v1.pkl")

# 2. Define the input schema
class InputData(pydantic.BaseModel):
feature_1: float
feature_2: float

# 3. Create the prediction endpoint
@app.post("/predict")
def predict(data: InputData):
prediction = model.predict([[data.feature_1, data.feature_2]])
return {"prediction": int(prediction[0])}

# Run with: uvicorn main:app --reload

```

## 7. Post-Deployment: Monitoring

Once a model is live, its performance will likely decrease over time (**Model Drift**). We must monitor:

* **Latency:** How long does a prediction take?
* **Data Drift:** Is the incoming data different from the training data?
* **Concept Drift:** Has the relationship between features and the target changed?

## References

* **Google Cloud:** [Practices for MLOps and CI/CD](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)
* **FastAPI:** [Official Documentation](https://fastapi.tiangolo.com/)
* **MLOps.community:** [Deployment Patterns](https://mlops.community/)

---

**Deployment is just the beginning. How do we ensure our model stays accurate as the world changes?**
Loading