Machine Learning Operations Playbook Adoption Workshop – Phase 1: Core Architecture - Hands-On Workshop

Module Learning Objectives

By the end of this workshop, participants will be able to:

Compare and contrast AWS and Google Cloud global infrastructure architectures
Explore topologies across both platforms
Map AWS services to Google Cloud service equivalents for AI/ML pipeline workloads

Prerequisites

Completion of Module 1: Cost Optimization
AWS Management Console access with infrastructure permissions
Google Cloud Console access with project access rights
Billing account configured

Workshop Overview

This hands-on workshop builds upon the cost management foundation from Module 1 to establish the technical architecture knowledge required for successful AWS to Google Cloud ML migrations. Using console interfaces and CloudShell, participants will gain practical experience with infrastructure services, networking, and IAM configurations across both platforms.

Module 1: AWS Global Infrastructure and Core Resources

🧪 Lab 2.1: AWS Regions and Availability Zones Architecture Deep Dive

Duration: 45 minutes Objective: Explore AWS global infrastructure and availability zone design using CLI and console-based inspection—without finalizing resource creation.

1. Prerequisites

AWS Management Console access with EC2 and CloudShell permissions
AWS CLI available via CloudShell or local environment
Familiarity with basic AWS terminology (Region, AZ, CLI)
No EC2 instance creation required

2. Theory Overview

AWS infrastructure is organized into regions and availability zones
Each region is a geographically isolated location with multiple AZs
Availability Zones are independent failure domains within a region
Opt-in regions must be manually enabled before use
High availability strategies use multiple AZs to ensure fault tolerance

3. Hands-On Exploration Steps (Do Not Finalize Resources)

10. Access AWS Console

Navigate to AWS Console
Launch CloudShell from the top navigation bar

11. Explore Available Regions

Run: aws ec2 describe-regions --output table
Run: aws ec2 describe-regions --query 'Regions[*].[RegionName,OptInStatus]' --output table
Identify which regions require opt-in

12. Explore Availability Zones

Run: aws ec2 describe-availability-zones --output table
Run: aws ec2 describe-availability-zones --region us-east-1 --output table
Observe zone names and states

13. Inspect Region Selector in EC2 Console

Navigate to EC2 > Instances > Launch Instance
Use the region dropdown to compare AZ counts
Cancel before launching any instance

14. Sketch Region-to-Zone Mapping

Identify 3 regions and list their AZs
Note differences in zone naming and availability

4. Deliverables

Table of AWS Regions and Opt-In status
List of AZs for us-east-1 and two other regions
Notes on regional design considerations and zone distribution

5. Supplemental Materials

Runbook: runbooks/aws-region-az-exploration.md
Playbook: playbooks/aws-ha-topology-strategy.md

6. Notes and Warnings

Do not launch EC2 instances or other resources during this lab
AZ names (e.g., us-east-1a) are account-specific and may vary
Opt-in regions may require manual activation before use

7. Verification Source

Validated against AWS EC2 Regions and AZs Documentation

🧪 Lab 2.2: AWS Edge Locations and CloudFront Global Network Exploration

Duration: 45 minutes Objective: Explore AWS’s global content delivery infrastructure using CloudFront and edge location metadata—without deploying distributions or modifying resources.

1. Prerequisites

AWS Management Console access with CloudFront and CloudShell permissions
AWS CLI available via CloudShell or local environment
Basic understanding of CDN concepts (edge location, origin, cache)
No CloudFront distribution creation required

2. Theory Overview

AWS CloudFront is a content delivery network (CDN) that uses a global network of edge locations
Edge locations cache content closer to users to reduce latency
Regional edge caches act as mid-tier caches between origin and edge locations
CloudFront integrates with other AWS services like S3, EC2, and Lambda@Edge
Edge locations are distributed across major cities and regions worldwide

3. Hands-On Exploration Steps (Do Not Finalize Resources)

10. Access AWS Console

Navigate to AWS Console
Launch CloudShell from the top navigation bar

11. Explore Edge Location Metadata

Run: aws cloudfront list-distributions --output json
If no distributions exist, proceed to inspect global infrastructure
Run: aws cloudfront get-distribution-config --id <distribution-id> (only if read-only distributions exist)

12. Review Global Edge Network

Navigate to CloudFront > Locations in AWS Console
Observe map of edge locations and regional edge caches
Note geographic distribution and latency zones

13. Inspect CloudFront Console

Navigate to CloudFront > Distributions
Select any existing distribution (if available)
Review origin settings, cache behaviors, and edge associations
Do not create or modify any distributions

14. Compare Edge Location Coverage

Identify 3 cities with edge locations from the AWS map : https://aws.amazon.com/about-aws/global-infrastructure/
Note proximity to major user populations
Record latency benefits and strategic placement rationale

4. Deliverables

List of global edge locations and regional edge caches
Summary of CloudFront distribution architecture (if read-only access available)
Notes on geographic distribution and latency optimization strategy

5. Supplemental Materials

Runbook: runbooks/aws-cloudfront-edge-location-inspection.md
Playbook: playbooks/aws-global-cdn-strategy.md

6. Notes and Warnings

Do not create or modify CloudFront distributions during this lab
Edge location availability may vary by region and account
CLI output may be empty if no distributions exist—this is expected

7. Verification Source

Validated against AWS CloudFront Documentation

🧪 Lab 2.3: Google Cloud Regions and Zones Architecture Analysis

Duration: 45 minutes Objective: Explore Google Cloud’s global infrastructure, focusing on regions, zones, and service availability—without deploying any resources.

1. Prerequisites

Google Cloud Console access with project-level permissions
Cloud Shell enabled
Basic understanding of cloud infrastructure concepts
Ensure the Compute Engine API is enabled for your project
No VM instance creation required

2. Theory Overview

Google Cloud has over 40 regions and 100+ zones globally
Each region is a geographic location containing multiple isolated zones
Zones are independent failure domains connected via Google’s private high-speed network
Most regions contain three or more zones housed in separate physical facilities
Service availability may vary by region and zone

3. Hands-On Exploration Steps (Do Not Finalize Resources)

11. Access Google Cloud Console

Navigate to Google Cloud Console
Use the Project Picker to select your project

12. Activate Cloud Shell or Use Cloud Workstations

Enter Workstations in the search bar.

Select Create workstation
Enter a unique display name
Select test-configuration
In the configuration field drop down, select test-configuration
Select Create. Note: Creation may take several minutes to complete.
Select Start, located in the All workstations section, below the Quick actions column. Note: Creation may take several minutes to complete.

Select Launch, afterwards, using the new workstation select the menu icon to access options, select terminal from the options.

Review the terminal area.

13. Explore Regions and Zones via CLI

Run: gcloud auth login
Select the clickable link. Afterwards, select Open, upon selection a new browser session will start. Follow the prompts in the new session to login and get a verification code.

Select Continue
Follow the prompts and provide username or password if required.
Select Copy. Note: The credential is a verfication code.

Paste the verification code into the terminal
Run: gcloud config set project mfav2-374520
Run: gcloud compute regions list --format="table(name,status,zones.len():label=ZONES)"
Run: gcloud compute regions describe us-east1
Run: gcloud compute zones list --format="table(name,region,status)"
Run: gcloud compute zones list --filter="region:us-east1" --format="table(name,status)"

14. Inspect Region-Zone Mapping via Console

Navigate to Compute Engine > VM instances > Create Instance
Use the Region dropdown to view available zones
Cancel before deploying any instance

15. Check Service Availability

Run: gcloud ai models list --region=us-east1 2>/dev/null || echo "Vertex AI not available in this region"
Run: gcloud compute machine-types list --zones=us-east1-a --filter="name:n1-standard"

4. Deliverables

Region and zone availability matrix
Notes on service availability for Vertex AI and machine types
Observations on zone distribution and naming conventions

5. Supplemental Materials

Runbook: runbooks/gcp-region-zone-exploration.md
Playbook: playbooks/gcp-multi-zone-deployment-strategy.md

6. Notes and Warnings

Do not finalize VM creation during this lab
Zone names (e.g., us-central1-a) may vary by region and project
Some services are region-specific—verify availability before planning deployments

7. Verification Source

Validated against Google Cloud Regions and Zones Documentation

🧪 Lab 2.4: Google Cloud Edge Network and Cloud CDN Exploration

Duration: 45 minutes
Objective: Explore Google Cloud’s edge infrastructure and understand how Cloud CDN accelerates content delivery using globally distributed edge locations—without finalizing resource creation.

1. Prerequisites

Google Cloud Console access with project-level permissions
Cloud Shell enabled or Cloud Workstations created and launched
Basic understanding of networking and CDN principles
Ensure the Cloud CDN API is enabled for your project
No load balancer or backend service creation required

2. Theory Overview

Google Cloud’s edge network includes over 200 edge locations globally
Edge PoPs (Points of Presence) cache and serve content closer to users
Cloud CDN integrates with HTTP(S) Load Balancing to deliver content via edge caches
CDN reduces latency, offloads origin servers, and improves user experience

3. Hands-On Exploration Steps (Do Not Finalize Resources)

10. Access Google Cloud Console

Navigate to Google Cloud Console
Select your project using the Project Picker

11. Activate Cloud Shell or Utilize Cloud Workstations

Note: If you cannot launch the cloud shell or encounter warnings or errors, use cloud workstations.
Review the previous instructions for lab 2.3, task number 12 and 13. Note: If you did not delete the cloud workstation, continue by selecting start for the stopped workstation and then select launch.

12. Review Edge Location Coverage

Visit Google Cloud CDN Locations
Note geographic distribution and latency zones

13. Navigate to Cloud CDN

Search for Cloud CDN
Click Add origin
In the Define your backend bucket section, Select Browse from the Cloud Storage bucket field.
In the Select bucket side panel, select groundeddiabetes. Afterwards choose Select
In the Origin name field, enter a unique name.
Select Next
Select the Create new load balancer for me button
Enter a unique name in the Load Balancer name field.
Select Next
Review Cache Performance Basic options
Review cache key and TTL settings
Cancel configuration before saving

15. Inspect Existing CDN-Enabled Backends (if available)

Run: gcloud compute backend-services list --filter="cdnPolicy.enable:true" --format="table(name,protocol,cdnPolicy.cacheMode)"

Deliverables

Notes on CDN activation and configuration
Observations on cache behavior and latency

Summary of edge location coverage and performance benefits

Supplemental Materials

Runbook: runbooks/gcp-cloud-cdn-exploration.md
Playbook: playbooks/gcp-edge-network-strategy.md

Notes and Warnings

Do not finalize load balancer or backend service creation during this lab
Cloud CDN only works with HTTP(S) load balancers
Edge locations are managed by Google and not directly configurable
Cache behavior may vary based on content type and headers

Verification Source

Validated against Google Cloud CDN Documentation: Google Cloud CDN & Edge

🧪 Lab 2.5: EC2 Instance Types and Sizing for ML Workloads

Objective: Explore EC2 instance families optimized for machine learning workloads, focusing on sizing strategies, accelerator options, and pricing—without launching any instances.

1. Prerequisites

AWS Console access with read-only permissions
Familiarity with EC2 concepts and ML workload characteristics
No EC2 instance launch or billing-incurring actions required
AWS CLI installed in CloudShell or local environment

2. Theory Overview

ML workloads vary in compute, memory, and accelerator needs
EC2 instance families include general purpose, compute optimized, memory optimized, and accelerated computing
GPU-based instances (e.g., p4, g5) are ideal for training deep learning models
Inferentia-based instances (e.g., inf1) are optimized for inference
Sizing depends on model complexity, batch size, and training duration

3. Hands-On Exploration Steps (Do Not Launch Instances)

10. Access AWS Console

Navigate to AWS EC2 Dashboard
Select Instances > Launch Instance
Select the area displaying an instance type. For example: t3.micro
In the search field, enter g5. Review the instance type details.
Do not launch an instance. Cancel before finalizing any configuration

11. Explore Instance Types

Search for EC2 if you cannot navigate back to the EC2 Dashboard
Select Instance Types
Select Instance type finder
Select the drop down menue from Workload type field, and then select Machine Learning.
Leave the rest of the fields with thier default settings.
Select Get instance type advice
Review the Additional information

12. Use AWS CLI to List ML-Optimized Instances

Run:
aws ec2 describe-instance-types \ --query 'InstanceTypes[?GpuInfo != null].{InstanceType:InstanceType, GPU:GpuInfo.Gpus[0].Name, Count:GpuInfo.Gpus[0].Count}' \ --output table

(Optional) Review Pricing and Regional Availability

Visit AWS Pricing Calculator
Compare hourly costs for GPU vs CPU instances
Note availability zones for ML-optimized types

Examine Instance Limits

Go to Limits tab in EC2 Dashboard
Check quotas for GPU instances in your region

Deliverables

Summary of instance types suitable for ML training vs inference
Notes on pricing, GPU accelerator support, and regional availability
CLI output showing GPU-enabled instance types

Supplemental Materials

Runbook: runbooks/aws-ec2-ml-sizing.md
Playbook: playbooks/aws-ml-instance-selection.md

Notes and Warnings

Do not launch EC2 instances during this lab
GPU instances may have limited availability in some regions
Pricing varies significantly based on GPU accelerator type and tenancy
Always validate instance compatibility with ML frameworks (e.g., TensorFlow, PyTorch)

Verification Source Verified against AWS EC2 Instance Types and GPU Accelerated Computing Guidance

🧪 Lab 2.6: S3 Storage Classes and Lifecycle Management

Duration: 30 minutes
Objective: Explore Amazon S3 storage tiers and lifecycle rule configurations to optimize cost for ML datasets—without finalizing resource creation.

1. Prerequisites

AWS Management Console access with S3 permissions
CloudShell or AWS CLI enabled
Sample bucket available or simulated for exploration
Familiarity with object storage concepts and cost optimization strategies
No lifecycle rule creation or bucket modification required

2. Theory Overview

Amazon S3 offers multiple storage classes tailored to access patterns and cost
Lifecycle rules automate transitions and deletions based on object age or tags
Storage Class Comparison:

Storage Class	Use Case	Durability	Availability	Retrieval Time	Cost (per GB/month)
Standard	Frequent access	99.999999999%	99.99%	Immediate	High
Intelligent-Tiering	Unknown/changing access patterns	99.999999999%	99.9–99.99%	Immediate	Variable
Standard-IA	Infrequent access	99.999999999%	99.9%	Immediate	Lower
One Zone-IA	Infrequent, single AZ	99.999999999%	99.5%	Immediate	Lowest IA
Glacier	Archival, occasional retrieval	99.999999999%	N/A	Minutes–hours	Very low
Glacier Deep Archive	Long-term archival	99.999999999%	N/A	Hours	Lowest

3. Hands-On Exploration Steps (Do Not Finalize Resources)

9. Access S3 Console

Navigate to S3 Buckets
Select bucket (e.g., theiiadiabetesaws)

10. Open the Management Tab

Click the bucket name
Select the tab named Management > and review Lifecycle rules
Click Create lifecycle rule

11. Explore Lifecycle Rule Configuration

Enter Rule name: ml-lifecycle-demo
Filter: Prefix = / . For example: raw/
Transitions:
- Select Transition current versions of objects between storage classes
Note: Select **I acknowledge that this lifecyce rule will incur a transition cost per request
- In the Transition current versions of objects between storage classes

- Update Choose storage class transitions to Standard-IA after 30 days

Select Add transition

- Update Choose storage class transitions to Glacier Flexible Retrieval after 90 days

Cancel before saving

12. (Optional) Review Pricing

Visit S3 Pricing
Compare cost for storing data over 12 months across classes

13. Inspect Existing Lifecycle Rules (Optional)

Run:
Note: The following command is expected to fail because you were instructed cancel before saving your lifecycle rule.

aws s3api get-bucket-lifecycle-configuration --bucket [YOUR_BUCKET_NAME]

Deliverables

Summary of lifecycle rule configuration explored
Pricing comparison table for storage classes
CLI output of existing lifecycle rules (if applicable)

Supplemental Materials

Runbook: runbooks/aws-s3-lifecycle-exploration.md
Playbook: playbooks/aws-storage-optimization-strategy.md

Notes and Warnings

Do not finalize lifecycle rule creation during this lab
Lifecycle transitions may incur retrieval fees—review pricing carefully
Use tags and prefixes to scope rules narrowly in production environments

Verification Source

Verified against Amazon S3 Lifecycle Configuration Documentation

🧪 Lab 2.7: Compute Engine Machine Types and Custom Configurations

Objective: Explore Google Cloud Compute Engine machine families, sizing strategies, and custom VM configurations for ML workloads—without finalizing resource creation.

1. Prerequisites

Google Cloud Console access with Compute Engine permissions
Cloud Shell enabled
Familiarity with VM sizing and ML workload characteristics
Billing enabled (for simulation only—no resource creation)
No VM instance creation required

2. Theory Overview

Machine Family Categories

Compute Engine offers predefined and custom machine types across multiple families
General-purpose families (E2, N2, N2D, N4, C3, C3D, C4, C4A, C4D) support custom configurations
Accelerator-optimized families (A3, A2, G2) provide fixed GPU configurations
Memory-optimized families (M1, M2, M3, M4) for memory-intensive workloads
Compute-optimized families (C2, C2D, H3) for compute-intensive tasks
Cost-optimized families (T2A, T2D, E2) for budget-conscious workloads

Key Features

Custom machine types allow fine-grained control over vCPU and memory
Extended memory is available for N4, N2, N2D, and N1 series (not E2 or G2)
Default Vertex AI configuration: e2-standard-4 (4 vCPUs, 16GB memory)
GPU availability varies by zone and requires specific machine series

Current GPU Options (2025)

NVIDIA L4: Cost-effective inference, video processing (G2 series)
NVIDIA A100: High-performance training and inference (A2 series)
NVIDIA H100/H200: Latest generation for demanding AI workloads (A3 series)
Legacy GPUs: T4, V100, P4, P100 (N1 series only)

3. Hands-On Exploration Steps (Do Not Finalize Resources)

Access Google Cloud Console
Navigate to Google Cloud Console https://console.cloud.google.com
Use the Project Picker to select your project

Use Cloud Workstations
Enter Workstations in the search bar.

Select Create workstation
Enter a unique ID name (e.g. your first name and variable characters)
Enter a unique display name
Select test-configuration
In the configuration field drop down, select test-configuration
Select Create. Note: Creation may take several minutes to complete.
Select Start, located in the All workstations section, below the Quick actions column. Note: Creation may take several minutes to complete.

Select Launch, afterwards, using the new workstation select the menu icon to access options, select terminal from the options.

Review the terminal area.
Run: gcloud auth login
Select the clickable link. Afterwards, select Open, upon selection a new browser session will start. Follow the prompts in the new session to login and get a verification code.

Select Continue
Follow the prompts and provide username or password if required.
Select Copy. Note: The credential is a verfication code.

Paste the verification code into the terminal
Run: gcloud config set project mfav2-374520

10. Access Compute Engine Console

Navigate to Compute Engine > VM Instances
Alternatively, search for Compute Engine
Click Create Instance (do not complete creation)

11. Explore Machine Type Options

Under Machine configuration, review current series:
- General-purpose: E2, N2, N2D, N4, C3, C3D, C4, etc.
- Navigate below Machine types for common workloads, optimized for cost and flexibility, and locate Machine type
- In the Machine type area review Preset types: shared-core, standard, highmem, highcpu
- Click Customize to manually set vCPU and memory
- Note pricing differences between series including Preset or Custom by reviewing Monthly estimate
Note: Notice that logging, monitoring (metrics), and snapshot are additional varying costs.

12. Simulate ML Workload Sizing

Example Configuration 1: Training Workload

vCPU: 16
Memory: 128 GB
Family: N4 (latest general-purpose)
Observe estimated monthly cost

Challenge Activity Question:

Do you recommend using Preset to support your machine learning operation training workload system requirements? Yes or No. Please explain your answer using the Teams chat.

Example Configuration 2: Inference Workload

vCPU: 8
Memory: 32 GB
Family: E2 (cost-optimized)
Compare pricing with N4 equivalent

Challenge Activity Question:

What is the cost difference? Please explain your answer using the Teams chat.

13. Review GPU Options

A3 Series (Latest - 2025)

Under Machine configuration, select GPUs
Review: NVIDIA H100/H200 GPUs (fixed configurations)
Review: Machine type section and review the drop down options. (e.g. a3-highgpu)
Observe Monthly estimate.

G2 Series (Cost-Effective)

Under Machine configuration, select GPUs
Review: NVIDIA L4 GPUs (1, 2, 4, or 8 GPUs)
Review Preset and Custom vCPU and memory range options.

A2 Series (Established)

Under Machine configuration, select GPUs
Review NVIDIA A100 GPUs
Make note of the Machine types. (e.g. a2-highgpu-1g)
Note: Fixed configurations (1, 2, 4, 8, or 16 GPUs)

Challenge Activity Question:

What is the machine type name for NVIDIA A100 GPUs including 1 GPUs and 85 GB memory, and the name if I need 16 GPUs ? Please explain your answer using the Teams chat.

Note: Here is some insight regarding machine type names. In the CLI and SDK the syntax for machine type may differ. For example ( --accelerator=count=16,type=nvidia-tesla-a100 ) . It is important to know the machine type name to locate the SDK syntax version.

14. Explore Extended Memory Options

Select General purpose
Select N2 series
Under Machine type, select Custom
Under Cores and Memory, select Extend Memory
Note the the significant increase to memory range in GB

15. Check Regional GPU Availability

Select GPUs
In Machine configuration, change Region and Zone
Note how GPU availability changes by location
Example: Compare us-central1-a vs us-east1-b

16. Inspect via CLI (Optional)

List Available Machine Types

gcloud compute machine-types list --zones=us-central1-a --format="table(name,guestCpus,memoryMb)"

List GPU Types by Zone

gcloud compute accelerator-types list --filter="zone:us-central1"

Check Quotas

gcloud compute regions describe us-central1 --format="table(quotas.metric,quotas.limit,quotas.usage)" --flatten="quotas[]"

(Optional) Vertex AI Pipeline Machine Type

Presenter will demonstrate

4. Vertex AI Integration Notes

Automated CICD Pipeline training and deployment Job components. (e.g. preprocess_data_op, train_model_op, evaluate_model_op, model_approved_op, register_model_op, deployment_model_op)

Serverless (Most flexible and supports varying use cases)
Default machine type: e2-standard-4 (4 vCPUs, 16GB memory)
CustomJob supports varying vCPUs and memory ranges
GPU support requires A2, N1, or G2 machine types

MLOPS Levels of optimization

MLOPS Level 0: Manual process
- Utilizes Vertex AI Workbench
MLOPS Level 1: ML Pipeline automation
- Utilizes Vertex AI Pipelines
MLOPS Level 2: CI/CD Pipeline automation
- Utilizes CI/CD Orchestration with Vertex AI pipline automation
- Best Practice : https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning#mlops_level_2_cicd_pipeline_automation

Vertex AI Workbench (Less Flexible Server)

Supports configurable machine types for notebooks
Both CPU-only and GPU-enabled instances available
Automatic GPU driver installation option

Best Practices

Start with E2 for cost optimization and Pipelines for automation
Use N4 for balanced performance
Choose G2 for cost-effective AI inference
Select A3 for demanding training workloads
Use Vertex AI Workbench for exploratory data analysis during short term expirementation including sampling. Note: Do not run Vertex AI Workbench long term or persist disk usage. Shutdown or remove to control significant cost expenditures.

5. Deliverables

Summary of machine families and sizing options explored
Screenshots or notes from custom configuration simulations
Comparison table of pricing across different series
CLI output of available machine types and GPUs (if applicable)
Regional availability observations

6. Supplemental Materials

Runbook: runbooks/gcp-compute-machine-types.md
Playbook: playbooks/gcp-ml-instance-sizing-strategy.md
Reference: GPU regions and zones availability

7. Notes and Warnings

Do not finalize VM creation during this lab
Custom configurations incur a 5% pricing premium over predefined types
Extended memory does not qualify for committed use discounts
GPU instances require quota approval and billing setup—explore only
GPU availability varies significantly by zone—always check before deployment
Some newer GPU types (H100, H200) may have limited availability

8. Verification Source

Verified against Google Cloud Compute Engine Documentation (Updated August 2025)
Cross-referenced with Vertex AI machine type specifications
GPU availability confirmed via Google Cloud Console and CLI

9. Quick Reference

Machine Series by Use Case (2025) - GCP vs AWS Comparison

Google Cloud Platform (GCP)

Use Case	Recommended Series	GPU Options	Notes
Development/Testing	E2	None	Most cost-effective
General ML Training	N4, N2	Add via N1	Balanced performance
Cost-Optimized Inference	G2	NVIDIA L4	Built-in GPUs
High-Performance Training	A3	NVIDIA H100/H200	Latest generation
Large-Scale Training	A2	NVIDIA A100	Established option
Memory-Intensive	N2D, N4	Add via N1	Extended memory support

Amazon Web Services (AWS SageMaker)

Use Case	Recommended Series	GPU Options	Notes	GCP Equivalent
Development/Testing	ml.t3.medium	None	Default CPU instance, free tier	E2
General ML Training	ml.m5, ml.c5	Add ml.g4dn	Balanced compute/memory	N4, N2
Cost-Optimized Inference	ml.g4dn, ml.g5	NVIDIA T4, A10G	Built-in GPUs	G2
High-Performance Training	ml.p4d, ml.p5	NVIDIA A100, H100	UltraCluster support	A3, A2
Large-Scale Training	ml.p4d.24xlarge	8x NVIDIA A100	Multi-node capability	A2
Memory-Intensive	ml.r5, ml.r6i	Add via ml.p*	High memory ratios	N2D, N4
Custom Silicon Inference	ml.inf1, ml.inf2	AWS Inferentia 1/2	Cost-effective inference	No direct equivalent
Custom Silicon Training	ml.trn1, ml.trn2	AWS Trainium	50% cost savings	No direct equivalent

Key Differences & Mapping:

Compute Families:

GCP E2 ↔ AWS ml.t3: Cost-optimized, burstable
GCP N4/N2 ↔ AWS ml.m5/ml.c5: General purpose, balanced resources
GCP G2 ↔ AWS ml.g4dn/ml.g5: GPU-optimized for inference
GCP A3 ↔ AWS ml.p5: Latest generation high-end training
GCP A2 ↔ AWS ml.p4d: Established high-performance training

GPU Comparison:

GCP NVIDIA L4 ↔ AWS NVIDIA T4: Cost-effective inference
GCP NVIDIA A100 ↔ AWS NVIDIA A100: Same GPU, different platforms
GCP NVIDIA H100/H200 ↔ AWS NVIDIA H100: Latest generation training

Unique to AWS:

ml.inf1/inf2: AWS Inferentia chips for cost-effective inference
ml.trn1/trn2: AWS Trainium chips for training cost savings
UltraCluster: Multi-node scaling up to thousands of GPUs

Unique to GCP:

Extended memory: Higher memory-to-CPU ratios on select series
Custom machine types: Fine-grained vCPU/memory control
Titanium integration: Hardware acceleration for networking

Pricing Model Differences:

GCP: Pay-per-second, custom configurations, sustained use discounts
AWS: Pay-per-hour, predefined sizes, Savings Plans and Spot instances

Default Configurations

Vertex AI CustomJob: e2-standard-4 (4 vCPUs, 16GB)
Minimum for GPU workloads: Avoid small instances (e.g., n1-highmem-2) with GPUs

🧪 Lab 2.8: Cloud Storage Classes and Object Lifecycle Management

Duration: 30 minutes
Objective: Explore Google Cloud Storage classes and lifecycle rule configurations to optimize cost and retention for ML datasets—without finalizing resource creation.

1. Prerequisites

Google Cloud Console access with Storage permissions
Cloud Shell enabled
Familiarity with object storage and data retention strategies
Billing enabled (for simulation only—no resource creation)
No lifecycle rule creation required

2. Theory Overview

Cloud Storage offers multiple classes based on access frequency and availability
Storage classes include: Standard, Nearline, Coldline, Archive
Each class has minimum storage durations and retrieval fees
Lifecycle rules automate transitions and deletions based on object age or conditions
Actions include: SetStorageClass, Delete, AbortIncompleteMultipartUpload

3. Hands-On Exploration Steps (Do Not Finalize Resources)

10. Access Cloud Storage Console

Navigate to Cloud Storage > Buckets
Alternatively, search for storage.
Select buckets and review available bucket options(e.g., ml-datasets)
Select the bucket named groundeddiabetes

11. Open Lifecycle Rules Panel

Click the bucket name
Go to Lifecycle tab
Click + Add a rule

12. Explore Lifecycle Rule Configuration

In the Select an action
Select Continue
Select Set storage class to Nearline
Set Condition: Age > 30 days
- Select Create
Add second rule:

- Age > 90 days → Coldline

Select Create

- Age > 365 days → Delete

Select Create
Select Delete all and confirm.

(Optional) BigQuery and GCS for ML

Presenter Demonstration

13. (Optional) Review Pricing

Visit Cloud Storage Pricing
Compare cost for storing 1 TB over 12 months across classes

14. Inspect via CLI (Optional)

Run:
gsutil lifecycle get gs://[YOUR_BUCKET_NAME]

Deliverables

Summary of lifecycle rule configuration explored
Pricing comparison table for storage classes
CLI output of existing lifecycle rules (if applicable)

Supplemental Materials

Runbook: runbooks/gcp-storage-lifecycle-exploration.md
Playbook: playbooks/gcp-storage-optimization-strategy.md

Notes and Warnings

Do not finalize lifecycle rule creation during this lab
Minimum storage durations may incur early deletion fees
Lifecycle rules should be scoped carefully using conditions

Verification Source

Verified against Google Cloud Storage Lifecycle Documentation

FilesExpand file tree

module2.md

Latest commit

History

module2.md

File metadata and controls

Machine Learning Operations Playbook Adoption Workshop – Phase 1: Core Architecture - Hands-On Workshop

Module Learning Objectives

Prerequisites

Workshop Overview

Module 1: AWS Global Infrastructure and Core Resources

🧪 Lab 2.1: AWS Regions and Availability Zones Architecture Deep Dive

1. Prerequisites

2. Theory Overview

3. Hands-On Exploration Steps (Do Not Finalize Resources)

10. Access AWS Console

11. Explore Available Regions

12. Explore Availability Zones

13. Inspect Region Selector in EC2 Console

14. Sketch Region-to-Zone Mapping

4. Deliverables

5. Supplemental Materials

6. Notes and Warnings

7. Verification Source

🧪 Lab 2.2: AWS Edge Locations and CloudFront Global Network Exploration

1. Prerequisites

2. Theory Overview

3. Hands-On Exploration Steps (Do Not Finalize Resources)

10. Access AWS Console

11. Explore Edge Location Metadata

12. Review Global Edge Network

13. Inspect CloudFront Console

14. Compare Edge Location Coverage

4. Deliverables

5. Supplemental Materials

6. Notes and Warnings

7. Verification Source

🧪 Lab 2.3: Google Cloud Regions and Zones Architecture Analysis

1. Prerequisites

2. Theory Overview

3. Hands-On Exploration Steps (Do Not Finalize Resources)

11. Access Google Cloud Console

12. Activate Cloud Shell or Use Cloud Workstations

13. Explore Regions and Zones via CLI

14. Inspect Region-Zone Mapping via Console

15. Check Service Availability

4. Deliverables

5. Supplemental Materials

6. Notes and Warnings

7. Verification Source

🧪 Lab 2.4: Google Cloud Edge Network and Cloud CDN Exploration

1. Prerequisites

2. Theory Overview

3. Hands-On Exploration Steps (Do Not Finalize Resources)

10. Access Google Cloud Console

11. Activate Cloud Shell or Utilize Cloud Workstations

12. Review Edge Location Coverage

13. Navigate to Cloud CDN

15. Inspect Existing CDN-Enabled Backends (if available)

🧪 Lab 2.5: EC2 Instance Types and Sizing for ML Workloads

1. Prerequisites

2. Theory Overview

3. Hands-On Exploration Steps (Do Not Launch Instances)

10. Access AWS Console

11. Explore Instance Types

12. Use AWS CLI to List ML-Optimized Instances

🧪 Lab 2.6: S3 Storage Classes and Lifecycle Management

1. Prerequisites

2. Theory Overview

3. Hands-On Exploration Steps (Do Not Finalize Resources)

9. Access S3 Console

10. Open the Management Tab

11. Explore Lifecycle Rule Configuration

12. (Optional) Review Pricing

13. Inspect Existing Lifecycle Rules (Optional)

🧪 Lab 2.7: Compute Engine Machine Types and Custom Configurations

1. Prerequisites

2. Theory Overview

Machine Family Categories

Key Features