Machine Learning Operations Playbook Adoption Workshop – Phase 1: Core Architecture - Hands-On Workshop
By the end of this workshop, participants will be able to:
- Compare and contrast AWS and Google Cloud global infrastructure architectures
- Explore topologies across both platforms
- Map AWS services to Google Cloud service equivalents for AI/ML pipeline workloads
- Completion of Module 1: Cost Optimization
- AWS Management Console access with infrastructure permissions
- Google Cloud Console access with project access rights
- Billing account configured
This hands-on workshop builds upon the cost management foundation from Module 1 to establish the technical architecture knowledge required for successful AWS to Google Cloud ML migrations. Using console interfaces and CloudShell, participants will gain practical experience with infrastructure services, networking, and IAM configurations across both platforms.
Duration: 45 minutes Objective: Explore AWS global infrastructure and availability zone design using CLI and console-based inspection—without finalizing resource creation.
-
AWS Management Console access with EC2 and CloudShell permissions
-
AWS CLI available via CloudShell or local environment
-
Familiarity with basic AWS terminology (Region, AZ, CLI)
-
No EC2 instance creation required
-
AWS infrastructure is organized into regions and availability zones
-
Each region is a geographically isolated location with multiple AZs
-
Availability Zones are independent failure domains within a region
-
Opt-in regions must be manually enabled before use
-
High availability strategies use multiple AZs to ensure fault tolerance
-
Navigate to AWS Console
-
Launch CloudShell from the top navigation bar
-
Run:
aws ec2 describe-regions --output table -
Run:
aws ec2 describe-regions --query 'Regions[*].[RegionName,OptInStatus]' --output table -
Identify which regions require opt-in
-
Run:
aws ec2 describe-availability-zones --output table -
Run:
aws ec2 describe-availability-zones --region us-east-1 --output table -
Observe zone names and states
-
Navigate to EC2 > Instances > Launch Instance
-
Use the region dropdown to compare AZ counts
-
Cancel before launching any instance
-
Identify 3 regions and list their AZs
-
Note differences in zone naming and availability
-
Table of AWS Regions and Opt-In status
-
List of AZs for
us-east-1and two other regions -
Notes on regional design considerations and zone distribution
-
Runbook:
runbooks/aws-region-az-exploration.md -
Playbook:
playbooks/aws-ha-topology-strategy.md
-
Do not launch EC2 instances or other resources during this lab
-
AZ names (e.g.,
us-east-1a) are account-specific and may vary -
Opt-in regions may require manual activation before use
- Validated against AWS EC2 Regions and AZs Documentation
Duration: 45 minutes Objective: Explore AWS’s global content delivery infrastructure using CloudFront and edge location metadata—without deploying distributions or modifying resources.
-
AWS Management Console access with CloudFront and CloudShell permissions
-
AWS CLI available via CloudShell or local environment
-
Basic understanding of CDN concepts (edge location, origin, cache)
-
No CloudFront distribution creation required
-
AWS CloudFront is a content delivery network (CDN) that uses a global network of edge locations
-
Edge locations cache content closer to users to reduce latency
-
Regional edge caches act as mid-tier caches between origin and edge locations
-
CloudFront integrates with other AWS services like S3, EC2, and Lambda@Edge
-
Edge locations are distributed across major cities and regions worldwide
-
Navigate to AWS Console
-
Launch CloudShell from the top navigation bar
-
Run:
aws cloudfront list-distributions --output json -
If no distributions exist, proceed to inspect global infrastructure
-
Run:
aws cloudfront get-distribution-config --id <distribution-id>(only if read-only distributions exist)
-
Navigate to CloudFront > Locations in AWS Console
-
Observe map of edge locations and regional edge caches
-
Note geographic distribution and latency zones
-
Navigate to CloudFront > Distributions
-
Select any existing distribution (if available)
-
Review origin settings, cache behaviors, and edge associations
-
Do not create or modify any distributions
-
Identify 3 cities with edge locations from the AWS map : https://aws.amazon.com/about-aws/global-infrastructure/
-
Note proximity to major user populations
-
Record latency benefits and strategic placement rationale
-
List of global edge locations and regional edge caches
-
Summary of CloudFront distribution architecture (if read-only access available)
-
Notes on geographic distribution and latency optimization strategy
-
Runbook:
runbooks/aws-cloudfront-edge-location-inspection.md -
Playbook:
playbooks/aws-global-cdn-strategy.md
-
Do not create or modify CloudFront distributions during this lab
-
Edge location availability may vary by region and account
-
CLI output may be empty if no distributions exist—this is expected
- Validated against AWS CloudFront Documentation
Duration: 45 minutes Objective: Explore Google Cloud’s global infrastructure, focusing on regions, zones, and service availability—without deploying any resources.
-
Google Cloud Console access with project-level permissions
-
Cloud Shell enabled
-
Basic understanding of cloud infrastructure concepts
-
Ensure the Compute Engine API is enabled for your project
-
No VM instance creation required
-
Google Cloud has over 40 regions and 100+ zones globally
-
Each region is a geographic location containing multiple isolated zones
-
Zones are independent failure domains connected via Google’s private high-speed network
-
Most regions contain three or more zones housed in separate physical facilities
-
Service availability may vary by region and zone
-
Navigate to Google Cloud Console
-
Use the Project Picker to select your project
- Enter Workstations in the search bar.
-
Select Create workstation
-
Enter a unique display name
-
Select test-configuration
-
In the configuration field drop down, select test-configuration
-
Select Create. Note: Creation may take several minutes to complete.
-
Select Start, located in the All workstations section, below the Quick actions column. Note: Creation may take several minutes to complete.
- Select Launch, afterwards, using the new workstation select the menu icon to access options, select terminal from the options.
- Review the terminal area.
-
Run:
gcloud auth login -
Select the clickable link. Afterwards, select Open, upon selection a new browser session will start. Follow the prompts in the new session to login and get a verification code.
-
Select Continue
-
Follow the prompts and provide username or password if required.
-
Select Copy. Note: The credential is a verfication code.
-
Paste the verification code into the terminal
-
Run:
gcloud config set project mfav2-374520 -
Run:
gcloud compute regions list --format="table(name,status,zones.len():label=ZONES)" -
Run:
gcloud compute regions describe us-east1 -
Run:
gcloud compute zones list --format="table(name,region,status)" -
Run:
gcloud compute zones list --filter="region:us-east1" --format="table(name,status)"
-
Navigate to Compute Engine > VM instances > Create Instance
-
Use the Region dropdown to view available zones
-
Cancel before deploying any instance
-
Run:
gcloud ai models list --region=us-east1 2>/dev/null || echo "Vertex AI not available in this region" -
Run:
gcloud compute machine-types list --zones=us-east1-a --filter="name:n1-standard"
-
Region and zone availability matrix
-
Notes on service availability for Vertex AI and machine types
-
Observations on zone distribution and naming conventions
-
Runbook:
runbooks/gcp-region-zone-exploration.md -
Playbook:
playbooks/gcp-multi-zone-deployment-strategy.md
-
Do not finalize VM creation during this lab
-
Zone names (e.g.,
us-central1-a) may vary by region and project -
Some services are region-specific—verify availability before planning deployments
- Validated against Google Cloud Regions and Zones Documentation
Duration: 45 minutes
Objective: Explore Google Cloud’s edge infrastructure and understand how Cloud CDN accelerates content delivery using globally distributed edge locations—without finalizing resource creation.
-
Google Cloud Console access with project-level permissions
-
Cloud Shell enabled or Cloud Workstations created and launched
-
Basic understanding of networking and CDN principles
-
Ensure the Cloud CDN API is enabled for your project
-
No load balancer or backend service creation required
-
Google Cloud’s edge network includes over 200 edge locations globally
-
Edge PoPs (Points of Presence) cache and serve content closer to users
-
Cloud CDN integrates with HTTP(S) Load Balancing to deliver content via edge caches
-
CDN reduces latency, offloads origin servers, and improves user experience
-
Navigate to Google Cloud Console
-
Select your project using the Project Picker
-
Note: If you cannot launch the cloud shell or encounter warnings or errors, use cloud workstations.
-
Review the previous instructions for lab 2.3, task number 12 and 13. Note: If you did not delete the cloud workstation, continue by selecting start for the stopped workstation and then select launch.
-
Note geographic distribution and latency zones
-
Search for Cloud CDN
-
Click Add origin
-
In the Define your backend bucket section, Select Browse from the Cloud Storage bucket field.
-
In the Select bucket side panel, select groundeddiabetes. Afterwards choose Select
-
In the Origin name field, enter a unique name.
-
Select Next
-
Select the Create new load balancer for me button
-
Enter a unique name in the Load Balancer name field.
-
Select Next
-
Review Cache Performance Basic options
-
Review cache key and TTL settings
-
Cancel configuration before saving
- Run: gcloud compute backend-services list --filter="cdnPolicy.enable:true" --format="table(name,protocol,cdnPolicy.cacheMode)"
- Deliverables
-
Notes on CDN activation and configuration
-
Observations on cache behavior and latency
Summary of edge location coverage and performance benefits
- Supplemental Materials
-
Runbook: runbooks/gcp-cloud-cdn-exploration.md
-
Playbook: playbooks/gcp-edge-network-strategy.md
- Notes and Warnings
-
Do not finalize load balancer or backend service creation during this lab
-
Cloud CDN only works with HTTP(S) load balancers
-
Edge locations are managed by Google and not directly configurable
-
Cache behavior may vary based on content type and headers
- Verification Source
- Validated against Google Cloud CDN Documentation: Google Cloud CDN & Edge
Objective: Explore EC2 instance families optimized for machine learning workloads, focusing on sizing strategies, accelerator options, and pricing—without launching any instances.
-
AWS Console access with read-only permissions
-
Familiarity with EC2 concepts and ML workload characteristics
-
No EC2 instance launch or billing-incurring actions required
-
AWS CLI installed in CloudShell or local environment
-
ML workloads vary in compute, memory, and accelerator needs
-
EC2 instance families include general purpose, compute optimized, memory optimized, and accelerated computing
-
GPU-based instances (e.g.,
p4,g5) are ideal for training deep learning models -
Inferentia-based instances (e.g.,
inf1) are optimized for inference -
Sizing depends on model complexity, batch size, and training duration
-
Navigate to AWS EC2 Dashboard
-
Select Instances > Launch Instance
-
Select the area displaying an instance type. For example: t3.micro
-
In the search field, enter g5. Review the instance type details.
-
Do not launch an instance. Cancel before finalizing any configuration
-
Search for EC2 if you cannot navigate back to the EC2 Dashboard
-
Select Instance Types
-
Select Instance type finder
-
Select the drop down menue from Workload type field, and then select Machine Learning.
-
Leave the rest of the fields with thier default settings.
-
Select Get instance type advice
-
Review the Additional information
- Run:
aws ec2 describe-instance-types \ --query 'InstanceTypes[?GpuInfo != null].{InstanceType:InstanceType, GPU:GpuInfo.Gpus[0].Name, Count:GpuInfo.Gpus[0].Count}' \ --output table
- (Optional) Review Pricing and Regional Availability
-
Visit AWS Pricing Calculator
-
Compare hourly costs for GPU vs CPU instances
-
Note availability zones for ML-optimized types
- Examine Instance Limits
-
Go to Limits tab in EC2 Dashboard
-
Check quotas for GPU instances in your region
- Deliverables
-
Summary of instance types suitable for ML training vs inference
-
Notes on pricing, GPU accelerator support, and regional availability
-
CLI output showing GPU-enabled instance types
- Supplemental Materials
-
Runbook: runbooks/aws-ec2-ml-sizing.md
-
Playbook: playbooks/aws-ml-instance-selection.md
- Notes and Warnings
-
Do not launch EC2 instances during this lab
-
GPU instances may have limited availability in some regions
-
Pricing varies significantly based on GPU accelerator type and tenancy
-
Always validate instance compatibility with ML frameworks (e.g., TensorFlow, PyTorch)
- Verification Source Verified against AWS EC2 Instance Types and GPU Accelerated Computing Guidance
Duration: 30 minutes
Objective: Explore Amazon S3 storage tiers and lifecycle rule configurations to optimize cost for ML datasets—without finalizing resource creation.
-
AWS Management Console access with S3 permissions
-
CloudShell or AWS CLI enabled
-
Sample bucket available or simulated for exploration
-
Familiarity with object storage concepts and cost optimization strategies
-
No lifecycle rule creation or bucket modification required
-
Amazon S3 offers multiple storage classes tailored to access patterns and cost
-
Lifecycle rules automate transitions and deletions based on object age or tags
-
Storage Class Comparison:
| Storage Class | Use Case | Durability | Availability | Retrieval Time | Cost (per GB/month) |
|---|---|---|---|---|---|
| Standard | Frequent access | 99.999999999% | 99.99% | Immediate | High |
| Intelligent-Tiering | Unknown/changing access patterns | 99.999999999% | 99.9–99.99% | Immediate | Variable |
| Standard-IA | Infrequent access | 99.999999999% | 99.9% | Immediate | Lower |
| One Zone-IA | Infrequent, single AZ | 99.999999999% | 99.5% | Immediate | Lowest IA |
| Glacier | Archival, occasional retrieval | 99.999999999% | N/A | Minutes–hours | Very low |
| Glacier Deep Archive | Long-term archival | 99.999999999% | N/A | Hours | Lowest |
-
Navigate to S3 Buckets
-
Select bucket (e.g.,
theiiadiabetesaws)
-
Click the bucket name
-
Select the tab named Management > and review Lifecycle rules
-
Click Create lifecycle rule
-
Enter Rule name:
ml-lifecycle-demo -
Filter: Prefix =
/. For example:raw/ -
Transitions:
- Select Transition current versions of objects between storage classes
Note: Select **I acknowledge that this lifecyce rule will incur a transition cost per request
- In the Transition current versions of objects between storage classes
- Update Choose storage class transitions to Standard-IA after 30 days
- Select Add transition
- Update Choose storage class transitions to Glacier Flexible Retrieval after 90 days
- Cancel before saving
-
Visit S3 Pricing
-
Compare cost for storing data over 12 months across classes
-
Run:
-
Note: The following command is expected to fail because you were instructed cancel before saving your lifecycle rule.
aws s3api get-bucket-lifecycle-configuration --bucket [YOUR_BUCKET_NAME]
- Deliverables
-
Summary of lifecycle rule configuration explored
-
Pricing comparison table for storage classes
-
CLI output of existing lifecycle rules (if applicable)
- Supplemental Materials
-
Runbook: runbooks/aws-s3-lifecycle-exploration.md
-
Playbook: playbooks/aws-storage-optimization-strategy.md
- Notes and Warnings
-
Do not finalize lifecycle rule creation during this lab
-
Lifecycle transitions may incur retrieval fees—review pricing carefully
-
Use tags and prefixes to scope rules narrowly in production environments
- Verification Source
- Verified against Amazon S3 Lifecycle Configuration Documentation
Objective: Explore Google Cloud Compute Engine machine families, sizing strategies, and custom VM configurations for ML workloads—without finalizing resource creation.
- Google Cloud Console access with Compute Engine permissions
- Cloud Shell enabled
- Familiarity with VM sizing and ML workload characteristics
- Billing enabled (for simulation only—no resource creation)
- No VM instance creation required
- Compute Engine offers predefined and custom machine types across multiple families
- General-purpose families (E2, N2, N2D, N4, C3, C3D, C4, C4A, C4D) support custom configurations
- Accelerator-optimized families (A3, A2, G2) provide fixed GPU configurations
- Memory-optimized families (M1, M2, M3, M4) for memory-intensive workloads
- Compute-optimized families (C2, C2D, H3) for compute-intensive tasks
- Cost-optimized families (T2A, T2D, E2) for budget-conscious workloads
- Custom machine types allow fine-grained control over vCPU and memory
- Extended memory is available for N4, N2, N2D, and N1 series (not E2 or G2)
- Default Vertex AI configuration: e2-standard-4 (4 vCPUs, 16GB memory)
- GPU availability varies by zone and requires specific machine series
- NVIDIA L4: Cost-effective inference, video processing (G2 series)
- NVIDIA A100: High-performance training and inference (A2 series)
- NVIDIA H100/H200: Latest generation for demanding AI workloads (A3 series)
- Legacy GPUs: T4, V100, P4, P100 (N1 series only)
-
Access Google Cloud Console
-
Navigate to Google Cloud Console https://console.cloud.google.com
-
Use the Project Picker to select your project
-
Use Cloud Workstations
-
Enter Workstations in the search bar.
-
Select Create workstation
-
Enter a unique ID name (e.g. your first name and variable characters)
-
Enter a unique display name
-
Select test-configuration
-
In the configuration field drop down, select test-configuration
-
Select Create. Note: Creation may take several minutes to complete.
-
Select Start, located in the All workstations section, below the Quick actions column. Note: Creation may take several minutes to complete.
- Select Launch, afterwards, using the new workstation select the menu icon to access options, select terminal from the options.
-
Review the terminal area.
-
Run:
gcloud auth login -
Select the clickable link. Afterwards, select Open, upon selection a new browser session will start. Follow the prompts in the new session to login and get a verification code.
-
Select Continue
-
Follow the prompts and provide username or password if required.
-
Select Copy. Note: The credential is a verfication code.
-
Paste the verification code into the terminal
-
Run:
gcloud config set project mfav2-374520
- Navigate to Compute Engine > VM Instances
- Alternatively, search for Compute Engine
- Click Create Instance (do not complete creation)
-
Under Machine configuration, review current series:
-
General-purpose: E2, N2, N2D, N4, C3, C3D, C4, etc.
-
Navigate below Machine types for common workloads, optimized for cost and flexibility, and locate Machine type
-
In the Machine type area review Preset types: shared-core, standard, highmem, highcpu
-
Click Customize to manually set vCPU and memory
-
Note pricing differences between series including Preset or Custom by reviewing Monthly estimate
-
-
Note: Notice that logging, monitoring (metrics), and snapshot are additional varying costs.
- vCPU: 16
- Memory: 128 GB
- Family: N4 (latest general-purpose)
- Observe estimated monthly cost
Challenge Activity Question:
Do you recommend using Preset to support your machine learning operation training workload system requirements? Yes or No. Please explain your answer using the Teams chat.
- vCPU: 8
- Memory: 32 GB
- Family: E2 (cost-optimized)
- Compare pricing with N4 equivalent
Challenge Activity Question:
What is the cost difference? Please explain your answer using the Teams chat.
- Under Machine configuration, select GPUs
- Review: NVIDIA H100/H200 GPUs (fixed configurations)
- Review: Machine type section and review the drop down options. (e.g.
a3-highgpu) - Observe Monthly estimate.
- Under Machine configuration, select GPUs
- Review: NVIDIA L4 GPUs (1, 2, 4, or 8 GPUs)
- Review Preset and Custom vCPU and memory range options.
- Under Machine configuration, select GPUs
- Review NVIDIA A100 GPUs
- Make note of the Machine types. (e.g. a2-highgpu-1g)
- Note: Fixed configurations (1, 2, 4, 8, or 16 GPUs)
Challenge Activity Question:
What is the machine type name for NVIDIA A100 GPUs including 1 GPUs and 85 GB memory, and the name if I need 16 GPUs ? Please explain your answer using the Teams chat.
- Note: Here is some insight regarding machine type names. In the CLI and SDK the syntax for machine type may differ. For example (
--accelerator=count=16,type=nvidia-tesla-a100) . It is important to know the machine type name to locate the SDK syntax version.
- Select General purpose
- Select N2 series
- Under Machine type, select Custom
- Under Cores and Memory, select Extend Memory
- Note the the significant increase to memory range in GB
- Select GPUs
- In Machine configuration, change Region and Zone
- Note how GPU availability changes by location
- Example: Compare
us-central1-avsus-east1-b
gcloud compute machine-types list --zones=us-central1-a --format="table(name,guestCpus,memoryMb)"gcloud compute accelerator-types list --filter="zone:us-central1"gcloud compute regions describe us-central1 --format="table(quotas.metric,quotas.limit,quotas.usage)" --flatten="quotas[]"- Presenter will demonstrate
Automated CICD Pipeline training and deployment Job components. (e.g. preprocess_data_op, train_model_op, evaluate_model_op, model_approved_op, register_model_op, deployment_model_op)
- Serverless (Most flexible and supports varying use cases)
- Default machine type: e2-standard-4 (4 vCPUs, 16GB memory)
- CustomJob supports varying vCPUs and memory ranges
- GPU support requires A2, N1, or G2 machine types
- MLOPS Level 0: Manual process
- Utilizes Vertex AI Workbench
- MLOPS Level 1: ML Pipeline automation
- Utilizes Vertex AI Pipelines
- MLOPS Level 2: CI/CD Pipeline automation
- Utilizes CI/CD Orchestration with Vertex AI pipline automation
- Best Practice : https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning#mlops_level_2_cicd_pipeline_automation
- Supports configurable machine types for notebooks
- Both CPU-only and GPU-enabled instances available
- Automatic GPU driver installation option
- Start with E2 for cost optimization and Pipelines for automation
- Use N4 for balanced performance
- Choose G2 for cost-effective AI inference
- Select A3 for demanding training workloads
- Use Vertex AI Workbench for exploratory data analysis during short term expirementation including sampling. Note: Do not run Vertex AI Workbench long term or persist disk usage. Shutdown or remove to control significant cost expenditures.
- Summary of machine families and sizing options explored
- Screenshots or notes from custom configuration simulations
- Comparison table of pricing across different series
- CLI output of available machine types and GPUs (if applicable)
- Regional availability observations
- Runbook:
runbooks/gcp-compute-machine-types.md - Playbook:
playbooks/gcp-ml-instance-sizing-strategy.md - Reference: GPU regions and zones availability
- Do not finalize VM creation during this lab
- Custom configurations incur a 5% pricing premium over predefined types
- Extended memory does not qualify for committed use discounts
- GPU instances require quota approval and billing setup—explore only
- GPU availability varies significantly by zone—always check before deployment
- Some newer GPU types (H100, H200) may have limited availability
- Verified against Google Cloud Compute Engine Documentation (Updated August 2025)
- Cross-referenced with Vertex AI machine type specifications
- GPU availability confirmed via Google Cloud Console and CLI
| Use Case | Recommended Series | GPU Options | Notes |
|---|---|---|---|
| Development/Testing | E2 | None | Most cost-effective |
| General ML Training | N4, N2 | Add via N1 | Balanced performance |
| Cost-Optimized Inference | G2 | NVIDIA L4 | Built-in GPUs |
| High-Performance Training | A3 | NVIDIA H100/H200 | Latest generation |
| Large-Scale Training | A2 | NVIDIA A100 | Established option |
| Memory-Intensive | N2D, N4 | Add via N1 | Extended memory support |
| Use Case | Recommended Series | GPU Options | Notes | GCP Equivalent |
|---|---|---|---|---|
| Development/Testing | ml.t3.medium | None | Default CPU instance, free tier | E2 |
| General ML Training | ml.m5, ml.c5 | Add ml.g4dn | Balanced compute/memory | N4, N2 |
| Cost-Optimized Inference | ml.g4dn, ml.g5 | NVIDIA T4, A10G | Built-in GPUs | G2 |
| High-Performance Training | ml.p4d, ml.p5 | NVIDIA A100, H100 | UltraCluster support | A3, A2 |
| Large-Scale Training | ml.p4d.24xlarge | 8x NVIDIA A100 | Multi-node capability | A2 |
| Memory-Intensive | ml.r5, ml.r6i | Add via ml.p* | High memory ratios | N2D, N4 |
| Custom Silicon Inference | ml.inf1, ml.inf2 | AWS Inferentia 1/2 | Cost-effective inference | No direct equivalent |
| Custom Silicon Training | ml.trn1, ml.trn2 | AWS Trainium | 50% cost savings | No direct equivalent |
Compute Families:
- GCP E2 ↔ AWS ml.t3: Cost-optimized, burstable
- GCP N4/N2 ↔ AWS ml.m5/ml.c5: General purpose, balanced resources
- GCP G2 ↔ AWS ml.g4dn/ml.g5: GPU-optimized for inference
- GCP A3 ↔ AWS ml.p5: Latest generation high-end training
- GCP A2 ↔ AWS ml.p4d: Established high-performance training
GPU Comparison:
- GCP NVIDIA L4 ↔ AWS NVIDIA T4: Cost-effective inference
- GCP NVIDIA A100 ↔ AWS NVIDIA A100: Same GPU, different platforms
- GCP NVIDIA H100/H200 ↔ AWS NVIDIA H100: Latest generation training
Unique to AWS:
- ml.inf1/inf2: AWS Inferentia chips for cost-effective inference
- ml.trn1/trn2: AWS Trainium chips for training cost savings
- UltraCluster: Multi-node scaling up to thousands of GPUs
Unique to GCP:
- Extended memory: Higher memory-to-CPU ratios on select series
- Custom machine types: Fine-grained vCPU/memory control
- Titanium integration: Hardware acceleration for networking
- GCP: Pay-per-second, custom configurations, sustained use discounts
- AWS: Pay-per-hour, predefined sizes, Savings Plans and Spot instances
- Vertex AI CustomJob: e2-standard-4 (4 vCPUs, 16GB)
- Minimum for GPU workloads: Avoid small instances (e.g., n1-highmem-2) with GPUs
Duration: 30 minutes
Objective: Explore Google Cloud Storage classes and lifecycle rule configurations to optimize cost and retention for ML datasets—without finalizing resource creation.
-
Google Cloud Console access with Storage permissions
-
Cloud Shell enabled
-
Familiarity with object storage and data retention strategies
-
Billing enabled (for simulation only—no resource creation)
-
No lifecycle rule creation required
-
Cloud Storage offers multiple classes based on access frequency and availability
-
Storage classes include: Standard, Nearline, Coldline, Archive
-
Each class has minimum storage durations and retrieval fees
-
Lifecycle rules automate transitions and deletions based on object age or conditions
-
Actions include: SetStorageClass, Delete, AbortIncompleteMultipartUpload
-
Navigate to Cloud Storage > Buckets
-
Alternatively, search for storage.
-
Select buckets and review available bucket options(e.g.,
ml-datasets) -
Select the bucket named
groundeddiabetes
-
Click the bucket name
-
Go to Lifecycle tab
-
Click + Add a rule
-
In the Select an action
-
Select Continue
-
Select Set storage class to Nearline
-
Set Condition: Age > 30 days
- Select Create
-
Add second rule:
- Age > 90 days → Coldline
- Select Create
- Age > 365 days → Delete
-
Select Create
-
Select Delete all and confirm.
- Presenter Demonstration
-
Visit Cloud Storage Pricing
-
Compare cost for storing 1 TB over 12 months across classes
- Run:
gsutil lifecycle get gs://[YOUR_BUCKET_NAME]
- Deliverables
-
Summary of lifecycle rule configuration explored
-
Pricing comparison table for storage classes
-
CLI output of existing lifecycle rules (if applicable)
- Supplemental Materials
-
Runbook: runbooks/gcp-storage-lifecycle-exploration.md
-
Playbook: playbooks/gcp-storage-optimization-strategy.md
- Notes and Warnings
-
Do not finalize lifecycle rule creation during this lab
-
Minimum storage durations may incur early deletion fees
-
Lifecycle rules should be scoped carefully using conditions
- Verification Source
- Verified against Google Cloud Storage Lifecycle Documentation






