codeharborhub · ajay-dhangar · Jan 26, 2026 · Jan 26, 2026
@@ -0,0 +1,123 @@
+---
+title: "CI/CD/CT: Automated Pipelines for ML"
+sidebar_label: CI/CD for ML
+description: "Exploring Continuous Integration, Continuous Delivery, and Continuous Training in MLOps."
+tags: [mlops, cicd, continuous-training, automation, jenkins, github-actions]
+---
+
+In traditional software, we have **CI** (Continuous Integration) and **CD** (Continuous Delivery). However, Machine Learning introduces a third dimension: **Data**. Because data changes over time, we need a third pillar: **CT** (Continuous Training).
+
+## 1. The Three Pillars of MLOps Automation
+
+To build a robust ML system, we must automate three distinct cycles:
+
+### Continuous Integration (CI)
+Beyond testing code, ML CI involves testing **data schemas** and **models**.
+* **Code Testing:** Unit tests for feature engineering logic.
+* **Data Testing:** Validating that incoming data matches expected distributions.
+* **Model Validation:** Ensuring the model architecture compiles and training runs without memory leaks.
+
+### Continuous Delivery (CD)
+This is the automation of deploying the model as a service.
+* **Artifact Packaging:** Wrapping the model in a [Docker container](./model-deployment#2-the-containerization-standard-docker).
+* **Integration Testing:** Ensuring the API endpoint responds correctly to requests.
+* **Deployment:** Moving the model to a staging or production environment using [Canary or Blue-Green strategies](./model-deployment#3-deployment-strategies).
+
+### Continuous Training (CT)
+This is unique to ML. It is a property of an ML system that automatically retrains and serves the model based on new data or [Model Drift](./monitoring#1-why-models-decay).
+
+## 2. The MLOps Maturity Levels
+
+Google defines the evolution of CI/CD in ML through three levels of maturity:
+
+1.  **Level 0 (Manual):** Every step (data prep, training, deployment) is done manually in notebooks.
+2.  **Level 1 (Automated Training):** The pipeline is automated. Whenever new data arrives, the training and validation happen automatically (CT).
+3.  **Level 2 (CI/CD Pipeline Automation):** The entire workflow—from code commits to model monitoring—is a fully automated CI/CD pipeline.
+
+## 3. The Automated Workflow
+
+The following diagram illustrates how a code change or a "Drift" alert triggers a sequence of automated events.
+
+```mermaid
+graph TD
+    Code[Code Commit / Data Drift Alert] --> CI[CI: Build & Test]
+
+    subgraph Pipeline [Automated ML Pipeline]
+    CI --> Train[Continuous Training]
+    Train --> Eval[Model Evaluation]
+    Eval --> Validate{Meets Threshold?}
+    end
+
+    Validate -- No --> Fail[Alert Developer]
+    Validate -- Yes --> Register[Model Registry]
+
+    Register --> CD[CD: Deploy to Prod]
+    CD --> Monitor[Monitoring & Observability]
+    Monitor -- Drift Detected --> Code
+
+    style Pipeline fill:#f0f4ff,stroke:#5c7aff,stroke-width:2px,color:#333
+    style Validate fill:#fff3e0,stroke:#ef6c00,color:#333
+    style Register fill:#c8e6c9,stroke:#2e7d32,color:#333
+
+```
+
+## 4. Key Components of the Pipeline
+
+* **Feature Store:** A centralized repository where features are stored and shared, ensuring that the same feature logic is used in both training and serving.
+* **Model Registry:** A "version control" for models. It stores trained models, their metadata (hyperparameters, accuracy), and their environment dependencies.
+* **Metadata Store:** Records every execution of the pipeline, allowing you to trace a specific model version back to the exact dataset and code used to create it.
+
+## 5. Tools of the Trade
+
+Depending on your cloud provider, the tools for CI/CD/CT vary:
+
+| Component | Open Source | AWS | Google Cloud |
+| --- | --- | --- | --- |
+| **Orchestration** | Kubeflow / Airflow | Step Functions | Vertex AI Pipelines |
+| **CI/CD** | GitHub Actions / GitLab | CodePipeline | Cloud Build |
+| **Tracking** | MLflow | SageMaker Experiments | Vertex AI Metadata |
+| **Storage** | DVC (Data Version Control) | S3 | GCS |
+
+## 6. Implementation: A GitHub Actions Snippet
+
+A simple CI task to check if a model's accuracy meets a threshold before allowing a "Push" to production.
+
+```yaml
+name: Model Training CI
+on: [push]
+
+jobs:
+  train-and-validate:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v2
+
+      - name: Set up Python
+        uses: actions/setup-python@v2
+
+      - name: Install dependencies
+        run: pip install -r requirements.txt
+
+      - name: Run Training & Evaluation
+        run: python train.py  # Script generates 'metrics.json'
+
+      - name: Check Accuracy Threshold
+        run: |
+          ACCURACY=$(jq '.accuracy' metrics.json)
+          if (( $(echo "$ACCURACY < 0.85" | bc -l) )); then
+            echo "Accuracy too low ($ACCURACY). Deployment failed."
+            exit 1
+          fi
+
+```
+
+## References
+
+* **Google Cloud:** [MLOps: Continuous delivery and automation pipelines](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)
+* **ThoughtWorks:** [Continuous Delivery for Machine Learning (CD4ML)](https://martinfowler.com/articles/cd4ml.html)
+* **MLflow:** [Introduction to Model Registry](https://www.mlflow.org/docs/latest/model-registry.html)
+
+---
+
+**With CI/CD/CT, your model is now a living, breathing part of your infrastructure. But how do we ensure it remains ethical and unbiased throughout these cycles?**
@@ -0,0 +1,110 @@
+---
+title: "Data Versioning: The Git for Data"
+sidebar_label: Data Versioning
+description: "Understanding how to track changes in datasets to ensure reproducibility and auditability in ML experiments."
+tags: [mlops, data-versioning, dvc, reproducibility, data-lake]
+---
+
+In traditional software development, versioning code with **Git** is enough to recreate any state of an application. In Machine Learning, code is only half the story. The resulting model depends on both the **Code** and the **Data**.
+
+If you retrain your model today and get different results than yesterday, you need to know exactly which version of the dataset was used. **Data Versioning** provides the "undo button" for your data.
+
+## 1. Why Git Isn't Enough for Data
+
+Git is designed to track small text files. It struggles with the large binary files (CSV, Parquet, Images, Audio) typically used in ML for several reasons:
+
+* **Storage Limits:** Storing gigabytes of data in a Git repository slows down operations significantly.
+* **Diffing:** Git cannot efficiently show differences between two 5GB binary files.
+* **Cost:** Hosting large blobs in GitHub or GitLab is expensive and inefficient.
+
+**Data Versioning tools** solve this by tracking "pointers" (metadata) in Git, while storing the actual data in external storage (S3, GCS, Azure Blob).
+
+## 2. The Core Concept: Metadata vs. Storage
+
+Data versioning works by creating a **hash** (unique ID) of your data files. 
+
+1.  **The Data:** Stored in a scalable cloud bucket (e.g., AWS S3).
+2.  **The Metafile:** A tiny text file containing the hash and file path. This file **is** committed to Git.
+
+<br />
+
+<img className="rounded" src="/tutorial/img/tutorials/ml/git-dvc-s3.png" alt="The relationship between Git (tracking .dvc files) and S3 (tracking large datasets)" />
+
+
+## 3. Workflow Logic
+
+The following diagram illustrates how DVC (Data Version Control) interacts with Git and remote storage to maintain synchronization.
+
+```mermaid
+graph TD
+    subgraph Local_Machine [Local Workspace]
+    Code[script.py] -- "git commit" --> Git[(Git Repo)]
+    Data[data.csv] -- "dvc add" --> Meta[.dvc file]
+    Meta -- "git commit" --> Git
+    end
+
+    subgraph Storage [Remote Storage]
+    Data -- "dvc push" --> Cloud[(S3 / GCS Bucket)]
+    end
+
+    subgraph Collaborator [Team Member]
+    Git -- "git pull" --> NewMeta[.dvc file]
+    NewMeta -- "dvc pull" --> NewData[data.csv]
+    Cloud -- download --> NewData
+    end
+
+    style Storage fill:#f1f8e9,stroke:#558b2f,color:#333
+    style Git fill:#e1f5fe,stroke:#01579b,color:#333
+    style Data fill:#fff3e0,stroke:#ef6c00,color:#333
+
+```
+
+## 4. Popular Data Versioning Tools
+
+| Tool | Focus | Best For |
+| --- | --- | --- |
+| **DVC (Data Version Control)** | Open-source, Git-like CLI. | Teams already comfortable with Git. |
+| **Pachyderm** | Data lineage and pipelining. | Complex data pipelines on Kubernetes. |
+| **LakeFS** | Git-like branches for Data Lakes. | Teams using S3/GCS as their primary data source. |
+| **W&B Artifacts** | Integrated with experiment tracking. | Visualizing data lineage alongside model training. |
+
+## 5. Implementation with DVC
+
+DVC is the most popular tool because it integrates seamlessly with your existing Git workflow.
+
+```bash
+# 1. Initialize DVC in your project
+dvc init
+
+# 2. Add a large dataset (this creates data.csv.dvc)
+dvc add data/train_images.zip
+
+# 3. Track the metadata in Git
+git add data/train_images.zip.dvc .gitignore
+git commit -m "Add raw training images version 1.0"
+
+# 4. Push the actual data to a remote (S3, GCS, etc.)
+dvc remote add -d myremote s3://my-bucket/data
+dvc push
+
+# 5. Switching versions
+git checkout v2.0-experiment
+dvc checkout # This physically swaps the data files in your folder
+
+```
+
+## 6. The Benefits of Versioning Data
+
+* **Reproducibility:** You can recreate the exact environment of a model trained 6 months ago.
+* **Compliance & Auditing:** In regulated industries (finance/healthcare), you must be able to show exactly what data was used to train a model to explain its decisions.
+* **Collaboration:** Multiple researchers can work on different versions of the data without overwriting each other's work.
+* **Data Lineage:** Tracking the "ancestry" of a dataset—knowing that `clean_data.csv` was generated from `raw_data.csv` using `clean.py`.
+
+## References
+
+* **DVC Documentation:** [Get Started with DVC](https://dvc.org/doc/start)
+* **LakeFS:** [Git for Data Lakes](https://lakefs.io/)
+
+---
+
+**Data versioning is the foundation of a reproducible pipeline. Now that we can track our data and code, how do we track the experiments and hyperparameter results?**
@@ -0,0 +1,112 @@
+---
+title: "Model Deployment: Moving from Lab to Production"
+sidebar_label: Deployment
+description: "Strategies for serving machine learning models, including batch vs. real-time, containerization, and deployment patterns."
+tags: [mlops, deployment, docker, kubernetes, api, serving]
+---
+
+**Model Deployment** is the process of integrating a machine learning model into an existing production environment where it can take in data and return predictions. It is the final stage of the ML pipeline, but it is also the beginning of the model's "life" where it provides actual value.
+
+## 1. Deployment Modes
+
+Before choosing a tool, you must decide how the users will consume the predictions.
+
+| Mode | Description | Example |
+| :--- | :--- | :--- |
+| **Request-Response (Real-time)** | The model lives behind an API. Predictions are returned instantly (low latency). | **Fraud Detection** during a credit card swipe. |
+| **Batch Scoring** | The model runs on a large set of data at scheduled intervals (e.g., every night). | **Recommendation Emails** sent to users once a day. |
+| **Streaming** | The model consumes data from a queue (like Kafka) and outputs predictions continuously. | **Log Monitoring** for cybersecurity threats. |
+
+## 2. The Containerization Standard: Docker
+
+In MLOps, we don't just deploy code; we deploy the **environment**. To avoid the "it works on my machine" problem, we use **Docker**.
+
+A Docker container packages the model file, the Python runtime, and all dependencies (NumPy, Scikit-Learn, etc.) into a single image that runs identically on any server.
+
+## 3. Deployment Strategies
+
+Deploying a model isn't just about "overwriting" the old one. We use strategies to minimize risk.
+
+* **Blue-Green Deployment:** You have two identical environments. You route traffic to "Green" (new model). If it fails, you instantly flip back to "Blue" (old model).
+* **Canary Deployment:** You route 5% of traffic to the new model. If the metrics look good, you slowly increase it to 100%.
+* **A/B Testing:** You run two models simultaneously and compare their real-world performance (e.g., which one leads to more clicks?).
+
+## 4. Logical Workflow: The Deployment Pipeline
+
+The following diagram illustrates the path from a trained model to a live API endpoint.
+
+```mermaid
+graph LR
+    Model[Trained Model File .pkl / .h5] --> Wrap[API Wrapper: Flask/FastAPI]
+    Wrap --> Docker[Docker Image]
+    Docker --> Registry[Container Registry]
+
+    subgraph Infrastructure [Production Environment]
+    Registry --> K8s[Kubernetes / Cloud Run]
+    K8s --> LoadBalancer[Load Balancer]
+    end
+
+    User((User)) --> LoadBalancer
+    LoadBalancer --> K8s
+
+    style Docker fill:#e1f5fe,stroke:#01579b,color:#333
+    style K8s fill:#fff3e0,stroke:#ef6c00,color:#333
+    style Model fill:#c8e6c9,stroke:#2e7d32,color:#333
+
+```
+
+## 5. Model Serving Frameworks
+
+While you can write your own API using **FastAPI**, dedicated "Model Serving" tools handle scaling and versioning better:
+
+1. **TensorFlow Serving:** Highly optimized for TF models.
+2. **TorchServe:** The official serving library for PyTorch.
+3. **KServe (formerly KFServing):** A serverless way to deploy models on Kubernetes.
+4. **BentoML:** A framework that simplifies the packaging and deployment of any Python model.
+
+## 6. Implementation Sketch (FastAPI + Uvicorn)
+
+This is a minimal example of serving a Scikit-Learn model as a REST API.
+
+```python
+from fastapi import FastAPI
+import joblib
+import pydantic
+
+app = FastAPI()
+
+# 1. Load the pre-trained model
+model = joblib.load("model_v1.pkl")
+
+# 2. Define the input schema
+class InputData(pydantic.BaseModel):
+    feature_1: float
+    feature_2: float
+
+# 3. Create the prediction endpoint
+@app.post("/predict")
+def predict(data: InputData):
+    prediction = model.predict([[data.feature_1, data.feature_2]])
+    return {"prediction": int(prediction[0])}
+
+# Run with: uvicorn main:app --reload
+
+```
+
+## 7. Post-Deployment: Monitoring
+
+Once a model is live, its performance will likely decrease over time (**Model Drift**). We must monitor:
+
+* **Latency:** How long does a prediction take?
+* **Data Drift:** Is the incoming data different from the training data?
+* **Concept Drift:** Has the relationship between features and the target changed?
+
+## References
+
+* **Google Cloud:** [Practices for MLOps and CI/CD](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)
+* **FastAPI:** [Official Documentation](https://fastapi.tiangolo.com/)
+* **MLOps.community:** [Deployment Patterns](https://mlops.community/)
+
+---
+
+**Deployment is just the beginning. How do we ensure our model stays accurate as the world changes?**