Cloud Computing Fundamentals

EE 547 - Unit 3

Dr. Brandon Franzke

Fall 2025

Local Infrastructure Scaling Limits

Single-Machine Memory and Compute Constraints

Development environments constrain ML system capabilities.

Typical Development Setup (2024)

  • 32GB RAM (64GB if expensive)
  • 8-12 CPU cores
  • Local SSD storage: 1-2TB
  • Single GPU: RTX 4090 (24GB VRAM)
  • Development cost: $3,000-$5,000

Where constraints bind:

  1. Memory: 100GB+ datasets exceed RAM capacity → OOM kills training process
  2. Storage: Multi-TB datasets fill local drives → Training stops mid-epoch
  3. Compute: Model training requires days/weeks → Development velocity drops 10x
  4. GPU memory: Large models exceed 24GB VRAM → Transformer training impossible
  5. Serving: Cannot handle 1000+ concurrent users → Request failures above 10 users/second

Real ML systems require infrastructure that scales beyond individual machines.

Production Workloads Require Distributed Infrastructure

Production workloads exceed development capabilities by orders of magnitude.

Large Language Model Training

  • 1,024+ GPUs running for weeks
  • 8TB+ aggregate GPU memory
  • 10TB+ training data with constant I/O
  • 400+ Gbps GPU interconnect bandwidth
  • $1-5 million per training run

Production Model Serving

  • 10K-1M+ requests per second
  • <100ms response time requirements
  • 99.9%+ uptime (4 minutes downtime/month)
  • Global deployment with data residency constraints
  • Variable cost based on unpredictable traffic peaks

Real-Time Processing Pipelines

  • TB/day data ingestion from multiple sources
  • Feature extraction and transformation processing
  • Millisecond inference decisions
  • Hot data, cold archives, backup storage
  • System health and model drift monitoring

Production workloads require entirely different infrastructure architectures, not scaled development setups.

Local Development Constraints Prevent Production Deployment

Concrete example of where local development assumptions break.

Development Phase (5 engineers, MacBook Pros)

  • Training YOLOv8 on 10K labeled images
  • Local training time: 4 hours per experiment
  • Storage: 50GB dataset fits on local SSDs
  • Cost: $25K in laptops

Production Requirements

  • 1M+ labeled images from customer data
  • Real-time inference: <50ms latency globally
  • Traffic: 10K requests/second peak
  • Deployment: US, Europe, Asia simultaneously

Failure Points

  1. Data pipeline: 500GB dataset cannot fit in memory → Swap thrashing kills performance
  2. Training time: 2 weeks per experiment vs 4 hours locally → 35x slower iteration cycle
  3. GPU memory: Models require 48GB VRAM, have 24GB → CUDA out of memory errors
  4. Serving latency: Global users see 300ms+ latency from US servers → 6x SLA violation
  5. Infrastructure cost: $2M upfront hardware vs $50K/month cloud → 40x capital requirement

Local development assumptions break at cloud scale: datasets exceed memory, training times become prohibitive, single-point failures affect global users.

Resource Pooling Economics

Datacenters achieve cost efficiencies impossible for individual organizations.

Individual Company Infrastructure

  • Purchase servers for peak capacity → 3-5 year depreciation regardless of usage
  • Maintain datacenter facilities → $500K+ annual facility costs
  • Staff specialized operations teams → $200K+ per systems engineer
  • Handle hardware failures independently → 24-48 hour repair time
  • Plan capacity years in advance → 50% over-provisioning for growth

Utilization: 20-30% average utilization with 100% fixed costs → 70% resource waste

Example Startup ML Training

  • Peak need: 100 GPUs for 1 week/month
  • Required purchase: 100 GPUs × $15K = $1.5M
  • Utilization: 25% (3 weeks idle)
  • Annual cost: $1.5M + datacenter + operations staff

Hyperscaler Infrastructure (AWS, Google, Microsoft)

  • 1M+ servers per provider
  • Resource pooling across thousands of customers
  • Automated management at scale
  • Geographic distribution reduces latency
  • Specialized operations expertise

Result: Rent exactly required resources when needed.

Same Startup with Cloud

  • Rent: 100 GPUs for 1 week = $15K/month
  • Annual cost: $180K vs $1.5M+ ownership
  • Zero idle capacity, no operations overhead

Economics: Hyperscalers achieve 10-20x cost efficiency through scale, specialization, and resource pooling.

Service Layers Abstract Hardware Management

Cloud providers abstract physical complexity into consumable services.

Physical Infrastructure Layer

  • Datacenters: 100K+ servers per facility
  • Networking: 100+ Tbps backbone connectivity
  • Power: Megawatts electrical capacity with redundancy
  • Cooling: Industrial-scale temperature control
  • Security: Physical access controls, biometrics

Virtualization Layer

  • Hypervisors: Multiple virtual machines per physical server
  • Resource isolation: CPU, memory, storage quotas per VM
  • Live migration: Move VMs between physical hosts
  • Resource scheduling: Optimize utilization across fleet

Service Layer

  • Compute: Virtual machines, containers, serverless functions
  • Storage: Object stores, databases, file systems
  • Network: Load balancers, CDNs, private networks
  • Management: Monitoring, logging, billing, security

Application Layer

  • Data pipelines: ETL, feature engineering
  • Model training: Distributed training frameworks
  • Model serving: APIs, batch inference
  • Monitoring: Model performance, data drift

Each layer abstracts thousands of operational details. Application development consumes services without managing underlying infrastructure.

Market Competition Drives Service Innovation

Competition between AWS, Google Cloud, and Microsoft Azure drives innovation and price reductions.

Market Share and Positioning (2024)

Provider Market Share Strengths ML Focus
AWS 32% Service breadth, enterprise adoption SageMaker, comprehensive ML tools
Microsoft Azure 23% Enterprise integration, hybrid cloud Azure ML, enterprise AI
Google Cloud 11% ML/AI innovation, data analytics Vertex AI, TensorFlow integration
Others 34% Specialized services, regional players Various

Competitive Pressures

  1. Regular price cuts to match competitors
  2. New services launched monthly
  3. Performance improvements: faster CPUs, newer GPUs
  4. Geographic expansion: global datacenter buildouts
  5. ML/AI specialization: dedicated hardware and services

Competition Results: 75% price reduction over 10 years, specialized ML hardware, new capabilities quarterly, multiple viable providers prevent vendor lock-in.

Pay-per-Use vs Fixed Infrastructure Costs

Cloud fundamentally changes IT spending from capital investment to operational expense.

Traditional Model: Capital Expenditure

Upfront Investment Requirements

  • Purchase servers, storage, networking equipment
  • Build or lease datacenter space
  • Hire operations and maintenance staff
  • Plan capacity for 3-5 year hardware lifecycle

Financial Characteristics

  • Large upfront costs ($100K-$10M+)
  • Hardware depreciation over 3-5 years
  • Fixed costs regardless of utilization
  • Difficult to scale resources up or down
  • Requires accurate long-term demand forecasting

Example: Startup Scaling Challenge

  • Year 0: Purchase $500K GPU servers for anticipated growth
  • Year 1: Using only 20% of capacity (wasted $400K)
  • Year 2: Need 5x capacity, but hardware already purchased
  • Year 3: Original hardware obsolete, must purchase again

Cloud Model: Operating Expenditure

Pay-as-you-go Model

  • Rent computing resources by hour/minute
  • Scale resources up/down based on actual demand
  • Zero upfront hardware investment
  • Provider handles all operations and maintenance

Financial Characteristics

  • Zero upfront costs (start at $0)
  • Monthly bills based on actual resource usage
  • Variable costs that scale with business growth
  • Easy to experiment and pivot directions
  • Budget aligns with revenue growth

Same Startup Example with Cloud

  • Year 0: Start with $100/month for prototypes
  • Year 1: Scale to $5K/month as usage grows
  • Year 2: Scale to $25K/month for higher usage
  • Year 3: Latest GPU hardware automatically available

OpEx model aligns IT costs with business growth, reducing financial risk and enabling rapid experimentation.

Provider Capacity Exceeds Individual Requirements

Cloud providers maintain resource pools orders of magnitude larger than individual user needs.

AWS Global Infrastructure (2024)

  • Compute: 1M+ physical servers across fleet
  • Storage: 100+ exabytes total capacity
  • Network: 400+ Tbps global backbone bandwidth
  • GPUs: 10K+ H100 equivalents for ML workloads
  • Geographic: 33 regions, 105 availability zones
  • CDN: 450+ edge locations worldwide

Practical Implications

  1. GPU availability: 100 GPUs for training available in minutes
  2. Storage capacity: Multi-TB datasets stored without constraint
  3. Global deployment: Applications deployed worldwide instantly
  4. Traffic handling: 10x traffic surges handled automatically
  5. Disaster recovery: Primary region failure triggers automatic backup

Large Model Training Example

  • Local constraint: Limited to 1-8 GPUs maximum
  • Cloud capability: 100+ GPU cluster available in <30 minutes
  • Cost model: Pay only for actual training time (hours vs years of ownership)

Cloud resources appear unlimited because total provider capacity exceeds individual user needs by orders of magnitude. This enables entirely new categories of ML experiments and applications.

Distributed Architecture Increases Operational Complexity

Cloud computing provides massive capabilities while introducing operational complexity.

Cloud Computing Capabilities

Massive Scalability

  • Virtually unlimited compute, storage, networking access
  • Global deployment in minutes vs months
  • Automatic scaling based on demand patterns

Cost Efficiency

  • Pay only for resources actually consumed
  • Zero upfront capital investment required
  • Economies of scale pricing advantages

Operational Simplicity

  • No hardware maintenance or datacenter operations
  • Automated backups, security patches, monitoring
  • Expert-managed infrastructure operations

Innovation Access

  • Latest hardware available immediately
  • New services and capabilities added continuously
  • Focus on application logic vs infrastructure management

Global Reach

  • Deploy applications worldwide instantly
  • Content delivery networks reduce user latency
  • Compliance with regional data regulations

Required Complexity Management

New Technical Skills

  • Distributed systems concepts and failure modes
  • Cloud service APIs and configuration interfaces
  • Network security and access control systems
  • Multi-service monitoring and debugging techniques

Architecture Changes

  • Design for service-oriented architectures
  • Handle network failures and retry logic
  • Consider costs in all architectural decisions
  • Plan for eventual consistency across services

Operational Overhead

  • Manage dependencies between multiple services
  • Understand billing models and cost optimization
  • Security across multiple service boundaries
  • Troubleshoot failures across distributed systems

Vendor Management

  • Service-specific knowledge (AWS vs Azure vs GCP)
  • Potential vendor lock-in with specialized services
  • Track rapidly evolving platform capabilities
  • Manage multiple service accounts and billing

Cloud computing provides extraordinary capabilities, but success requires learning new concepts and managing operational complexity. Benefits outweigh costs for most production ML applications.

Network-Based Services Replace Local File Access

Cloud programming assumes distributed services rather than single-machine execution.

Local Development Model

# Single-machine assumptions
import torch
import pandas as pd

# Load data (assumes local files)
data = pd.read_csv('dataset.csv')

# Train model (uses local GPU/CPU)  
model = train_model(data)

# Save result (local filesystem)
torch.save(model, 'model.pth')

# Serve predictions (single process)
app.run(host='localhost', port=5000)

Assumptions

  • Unlimited local storage access
  • Reliable single machine operation
  • Direct file system access
  • No network latency considerations
  • Single point of failure acceptable

Cloud-Native Development Model

# Distributed service assumptions
import boto3

# Load data (from cloud storage)
s3.download_file('bucket', 'dataset.csv', '/tmp/data.csv')

# Train model (on cloud compute)
ec2_instance.run_training_job(
    data_location='s3://bucket/dataset.csv'
)

# Save result (to cloud storage)  
s3.upload_file('model.pth', 'bucket', 'models/v1.pth')

# Serve predictions (managed service)
lambda_function.deploy(
    model_path='s3://bucket/models/v1.pth'
)

New Assumptions

  • Data stored remotely with network I/O
  • Multiple services fail independently
  • Network latency affects performance
  • Security and permissions required
  • Cost proportional to usage patterns

Cloud development requires designing for network latency, service failures, and distributed data flows.

Distributed Systems Failure Modes

Distributed systems introduce complexity not present in local development.

Network Reliability Constraints

  • Services temporarily unavailable (timeouts, retries required)
  • Data transfer bandwidth and latency limits
  • Must handle connection failures gracefully

Security Requirements Everywhere

  • Access permissions for every service interaction
  • Data encryption in transit and at rest
  • Network security groups and firewall rules

Usage-Based Cost Model

  • Every API call, data transfer, compute hour costs money
  • Poor architectural choices become expensive quickly
  • Continuous monitoring and optimization required

Distributed Debugging Complexity

  • Errors occur across multiple services simultaneously
  • Logs distributed across different systems
  • Troubleshooting requires understanding service interactions

Why This Complexity Exists

Complexity results from solving problems that do not exist in local development:

  • Multi-tenancy: Code runs alongside thousands of other users
  • Global distribution: Data and compute span continents
  • Fault tolerance: Systems handle component failures gracefully
  • Security: Protection against sophisticated attacks and compliance

Cloud development trades local simplicity for global scale and distributed system capabilities.

Cloud Infrastructure and Services

AWS Global Infrastructure: Regions and Availability Zones

Cloud services run on geographically distributed datacenters with specific failure and latency characteristics.

AWS Regions (33 worldwide as of 2024)

Definition: Isolated geographic areas containing multiple datacenters

  • North America: us-east-1 (Virginia), us-west-2 (Oregon), ca-central-1 (Canada)
  • Europe: eu-west-1 (Ireland), eu-central-1 (Frankfurt), eu-north-1 (Stockholm)
  • Asia Pacific: ap-southeast-1 (Singapore), ap-northeast-1 (Tokyo), ap-south-1 (Mumbai)

Region Characteristics

  • Isolation: Complete independence - no shared infrastructure
  • Latency: 150-300ms between distant regions (US-Asia)
  • Compliance: Data residency laws require specific regions
  • Services: Not all AWS services available in all regions
  • Pricing: Different costs per region (Tokyo 20% more expensive than Virginia)

Availability Zones per Region (2-6 AZs)

  • Definition: Separate datacenters within a region
  • Physical separation: 10+ miles apart, separate power/cooling
  • Network: <10ms latency between AZs in same region
  • Failure isolation: AZ failures don’t affect other AZs
  • Examples: us-east-1a, us-east-1b, us-east-1c (Virginia region)

Infrastructure Hierarchy

Global Infrastructure
├── AWS Regions (33)
   ├── us-east-1 (Virginia)
   │   ├── us-east-1a (AZ)
   │   ├── us-east-1b (AZ)
   │   ├── us-east-1c (AZ)
   │   ├── us-east-1d (AZ)
   │   ├── us-east-1e (AZ)
   │   └── us-east-1f (AZ)
   ├── us-west-2 (Oregon)
   │   ├── us-west-2a (AZ)
   │   ├── us-west-2b (AZ)
   │   ├── us-west-2c (AZ)
   │   └── us-west-2d (AZ)
   ├── eu-west-1 (Ireland)
   │   ├── eu-west-1a (AZ)
   │   ├── eu-west-1b (AZ)
   │   └── eu-west-1c (AZ)
   ├── ap-southeast-1 (Singapore)
   │   ├── ap-southeast-1a (AZ)
   │   ├── ap-southeast-1b (AZ)
   │   └── ap-southeast-1c (AZ)
   └── ... (29 more regions)
└── Edge Locations (450+)
    ├── CloudFront CDN
    └── Global Content Delivery

ML System Design Implications

Data Residency Constraints

  • European GDPR: EU citizen data must stay in EU regions
  • Chinese data sovereignty: cn-north-1, cn-northwest-1 required
  • US government: AWS GovCloud (us-gov-east-1, us-gov-west-1)

Multi-AZ Architecture for Availability

  • Training: Data in S3 replicated across AZs automatically
  • Inference: Load balancer distributes across AZ-deployed instances
  • Database: RDS Multi-AZ failover in <60 seconds

Cost vs Latency Trade-offs

  • us-east-1 (and us-west-2): Cheapest region, highest AWS service availability
  • ap-northeast-1: 20-30% more expensive, required for Japan users
  • Cross-region data transfer: $0.09/GB (expensive for large datasets)

ML systems must account for region selection based on data residency, user latency, service availability, and cost constraints.

Resource Placement: Cost and Latency Trade-offs

Cross-AZ data transfer costs create trade-offs between cost and availability for large ML datasets.

ImageNet Training Cost Impact (1.3TB dataset)

Same AZ Placement

  • Training and database: us-east-1a
  • Data transfer cost: $0
  • Risk: Single AZ failure stops training

Cross-AZ Placement

  • Training: us-east-1a, Database: us-east-1b
  • Cross-AZ transfer: $0.01/GB each direction
  • ImageNet daily training: 1.3TB × $0.01 = $13/day
  • Monthly cost: $400 additional for cross-AZ data access
  • Benefit: Training continues during AZ failure

The Trade-off

  • Same AZ: $0 transfer cost, single point of failure
  • Cross-AZ: $400/month cost, survives AZ outages

Production Architecture Decisions

Training Workloads

  • Co-locate compute and data in same AZ
  • Accept single AZ risk to avoid $400/month transfer costs
  • Use S3 checkpointing for recovery

Inference Services

  • Multi-AZ load balancing for availability
  • Smaller data transfers make cross-AZ costs acceptable
  • Database: $0.20/day for 10GB daily queries

Cross-Region Costs

  • ImageNet replication: 1.3TB × $0.09/GB = $117 one-time
  • Used only for disaster recovery, not daily access

Cross-AZ data transfer at $0.01/GB makes dataset placement a key decision for large-scale ML training.

Network Latency Replaces Deterministic Local Access

Distributed systems replace instant local operations with network requests.

Local Development Assumptions

  • File read: 0.1ms from SSD
  • Memory access: 0.001ms RAM lookup
  • Function call: 0.0001ms CPU instruction
  • Database query: 1ms SQLite local file

Network Operation Reality

  • S3 object read: 20-50ms average latency
  • EC2 to RDS query: 1-5ms within AZ, 15-25ms cross-AZ
  • Service-to-service API call: 10-100ms depending on load
  • Cross-region data transfer: 150-300ms transcontinental

ML Training Pipeline Impact

  • Local batch loading: 50ms per 1000 images
  • S3 batch loading: 200-500ms per 1000 images
  • Result: 4-10× slower data pipeline, GPU starvation
  • Distributed training coordination: +200ms per epoch synchronization

Network operations introduce 100-2000× latency increase over local operations, requiring different software design patterns.

Partial Failures Require New Error Handling

Distributed systems fail differently than single machines.

Single Machine Failure Model

  • Process crash: Complete system failure
  • Out of memory: Entire application stops
  • Disk full: All operations fail immediately
  • Network down: No external connectivity

Recovery: Restart entire system, reload from disk

Distributed System Failure Model

  • Partial node failure: 2 out of 8 training nodes crash
  • Network partition: East coast can’t reach West coast servers
  • Service degradation: S3 returns 10% error rate, not 100%
  • Cascading failures: Database overload causes API timeouts

ML Training Example

8-GPU distributed training job:

  • GPU 3 fails at epoch 47 of 100
  • Options: Stop all GPUs (waste 6 hours) or continue with 7 GPUs
  • Gradient synchronization must handle missing node
  • Checkpoint frequency determines maximum lost work

Error Handling Complexity

# Local development - simple error handling
try:
    data = load_training_data('dataset.csv')
    model = train_model(data)
    save_model(model, 'model.pth')
except Exception as e:
    print("Training failed, restart from beginning")

# Distributed training - complex error handling  
try:
    nodes = discover_healthy_training_nodes()
    if len(nodes) < MIN_NODES:
        wait_for_node_recovery()
    
    checkpoint = load_latest_checkpoint_if_exists()
    model = train_distributed(data, nodes, checkpoint)
    
except NodeFailure as e:
    # Continue with remaining nodes or wait for replacement
    handle_node_failure(e.failed_node)
except NetworkPartition as e:
    # Pause training until partition heals
    wait_for_network_recovery()
except ServiceDegradation as e:
    # Retry with exponential backoff
    retry_with_backoff(e.failing_service)

Failure Probability Math

  • Single machine: 99.9% monthly uptime
  • 8-machine system: (0.999)^8 = 99.2% all nodes healthy
  • Result: 8× higher chance of partial system failure

Distributed systems require application logic to handle partial failures that never occur in single-machine development.

S3 Hides Data Replication Implementation

Simple API masks complex distributed storage system.

What You See: Simple File Operations

import boto3
s3 = boto3.client('s3')

# Appears like local file system
s3.put_object(Bucket='my-bucket', Key='data.csv', Body=data)
s3.get_object(Bucket='my-bucket', Key='data.csv')
s3.delete_object(Bucket='my-bucket', Key='data.csv')

What AWS Implements Behind the Scenes

Data Replication

  • Automatically copies data to 3+ physical servers
  • Distributes copies across different data centers
  • Maintains 99.999999999% durability (11 9’s)

Consistency Management

  • Coordinates writes across multiple storage nodes
  • Handles read-after-write consistency
  • Manages eventual consistency for updates

Failure Recovery

  • Detects hardware failures within seconds
  • Automatically replaces failed storage nodes
  • Rebuilds lost data copies from remaining replicas

Complexity You Don’t Handle

# What you would need to implement manually:
# 1. Distributed consensus protocol
# 2. Failure detection and recovery
# 3. Data partitioning and replication  
# 4. Consistent hashing for load distribution
# 5. Network protocol for reliable transfer
# 6. Monitoring and alerting systems
# 7. Hardware provisioning and maintenance

Engineering Cost Avoided

  • Distributed systems team: 5-10 engineers × $200K = $1-2M/year
  • Data center operations: $500K+/year facilities cost
  • Hardware replacement: $100K+/year equipment
  • 24/7 on-call rotation: $300K+/year operations staff

vs S3 Cost: $23/TB/month for most workloads

Development Time Savings

  • Building reliable distributed storage: 18-24 months
  • S3 integration: 1-2 days
  • Focus shift: From infrastructure to ML algorithms

S3 provides distributed storage reliability without requiring distributed systems expertise.

Load Balancers Replace Manual Request Distribution

Automatic traffic distribution across multiple servers.

Manual Load Distribution Problems

Single Server Bottleneck

  • 1 EC2 instance: ~1,000 requests/second maximum
  • Model inference: 50-200ms per request
  • Capacity: 5-20 concurrent users before timeouts

Adding Servers Manually

# Deploy model to 3 servers
server1: ec2-1-2-3-4.compute-1.amazonaws.com
server2: ec2-1-2-3-5.compute-1.amazonaws.com  
server3: ec2-1-2-3-6.compute-1.amazonaws.com

# Client must choose which server to call
if server1_healthy:
    call server1
elif server2_healthy:
    call server2
else:
    call server3

Problems:

  • Client needs health check logic
  • Uneven load distribution
  • Manual server replacement on failures

Application Load Balancer Solution

# Single endpoint for clients
API_ENDPOINT = "https://my-api.elb.amazonaws.com/predict"

# Load balancer handles distribution automatically:
# 1. Health checks servers every 30 seconds
# 2. Routes requests to healthy instances only  
# 3. Distributes load evenly across instances
# 4. Automatically adds/removes instances

response = requests.post(API_ENDPOINT, json=data)

Complexity Abstracted

  • Health monitoring: Automatic detection of failed instances
  • Traffic routing: Weighted round-robin distribution
  • SSL termination: Handles HTTPS certificates automatically
  • Auto scaling integration: Adds servers during traffic spikes

Performance Results

  • 3 instances behind load balancer: 3,000 requests/second capacity
  • Automatic failover: <30 seconds to detect and route around failures
  • Availability: 99.99% with multi-AZ deployment

Client sees single reliable endpoint instead of managing multiple servers.

Load balancers provide high availability and scalability without client-side complexity.

Virtual Resources Replace Physical Infrastructure

Cloud providers abstract physical infrastructure into consumable services.

Traditional Infrastructure Model

  • Purchase physical servers
  • Install operating systems
  • Configure networking equipment
  • Manage storage arrays
  • Handle hardware failures
  • Plan capacity for peak loads

Constraints:

  • Fixed capacity regardless of usage
  • Upfront capital investment required
  • Manual scaling and maintenance
  • Single datacenter deployment

Cloud Service Model

  • Rent virtual resources on-demand
  • Pre-configured software stacks available
  • Managed networking and load balancing
  • Distributed storage with replication
  • Provider handles hardware failures
  • Automatic scaling based on demand

Advantages:

  • Pay only for resources consumed
  • Scale from zero to massive capacity
  • Global deployment in minutes
  • Provider expertise in operations

Core Cloud Service Categories:

  1. Compute: Processing power (CPUs, GPUs, memory)
  2. Storage: Data persistence (files, objects, databases)
  3. Network: Connectivity (load balancers, CDNs, security)

Each category solves specific scaling problems that local infrastructure cannot handle cost-effectively.

Compute Services: Processing Without Hardware Ownership

Compute services provide processing power without hardware ownership.

Virtual Machines (EC2)

  • Complete operating system control
  • Choose CPU, memory, storage, networking
  • Install any software stack
  • Direct SSH/RDP access for development
  • Suitable for existing applications with minimal changes

Containers (ECS/EKS)

  • Application packaging with dependencies
  • Faster startup than virtual machines
  • Resource sharing across containers
  • Orchestration handles scaling and failures
  • Ideal for microservices architectures

Serverless Functions (Lambda)

  • No server management required
  • Automatic scaling to zero and massive concurrency
  • Pay per request execution time
  • Event-driven execution model
  • Best for stateless, short-running tasks

Service selection depends on control requirements, scaling patterns, and operational complexity tolerance.

EC2 Instances Share Physical Servers

EC2 instances are virtual computers running on AWS physical hardware.

What is an EC2 Instance?

  • Virtual machine running on shared physical hardware
  • Complete isolation from other customers’ instances
  • Choose operating system, CPU, memory, storage, networking
  • Full administrative control (root/administrator access)
  • Can install any software, configure any services

Physical to Virtual Mapping

  • Multiple EC2 instances share single physical server
  • Hypervisor manages resource allocation and isolation
  • Each instance appears to have dedicated hardware
  • AWS manages physical hardware maintenance and failures
  • Instances can migrate between physical servers transparently

Instance Lifecycle

  • Stopped: Instance shut down, EBS storage persisted, no compute charges
  • Running: Instance active, incurring compute and storage charges
  • Terminated: Instance deleted, local storage lost, EBS storage can be preserved

EC2 provides the illusion of dedicated hardware while efficiently sharing physical resources among multiple users.

Instance Configuration Determines Functionality and Cost

Four key decisions define every EC2 instance configuration.

1. Amazon Machine Image (AMI)

  • Pre-configured operating system and software stack
  • Ubuntu 22.04, Windows Server 2022, Amazon Linux 2
  • Deep Learning AMIs with ML frameworks pre-installed
  • Custom AMIs with your specific software configurations
  • Determines what software is available when instance starts

2. Instance Type

  • Hardware specification: CPU, memory, storage, networking
  • t3.micro: 2 vCPUs, 1 GB RAM - development/testing
  • m5.large: 2 vCPUs, 8 GB RAM - general purpose applications
  • c5.4xlarge: 16 vCPUs, 32 GB RAM - CPU-intensive workloads
  • p3.2xlarge: 8 vCPUs, 61 GB RAM, 1 GPU - ML training

3. Storage Configuration

  • Root volume: Operating system and applications
  • Additional EBS volumes: Data storage, databases
  • Instance store: Temporary high-speed storage
  • Snapshots: Backup and restore capabilities

4. Network and Security Settings

  • VPC: Virtual network environment
  • Security groups: Firewall rules for inbound/outbound traffic
  • Key pairs: SSH authentication for Linux instances
  • Public IP: Internet accessibility

Configuration Examples:

  • AMI: Ubuntu 22.04 LTS
  • Instance Type: t3.medium
  • Storage: 20 GB root + 100 GB data volume
  • Network: Public IP, SSH key authentication

Configuration Impact on Cost:

  • AMI: Usually free (OS licensing may apply for Windows)
  • Instance type: Primary cost driver ($0.0116-$32.77/hour range)
  • Storage: Additional cost based on size and performance
  • Data transfer: Charges for internet egress traffic

Each configuration choice affects functionality, performance, and monthly costs.

AMIs: Pre-configured Operating Environments

Amazon Machine Images provide the foundation software for EC2 instances.

What AMIs Contain

  • Operating System: Linux distributions, Windows versions
  • System Software: Device drivers, networking stack, AWS tools
  • Application Software: Web servers, databases, ML frameworks
  • Configuration: Users, permissions, startup scripts
  • Customizations: Your specific software installations and settings

AMI Categories

  • AWS-provided: Maintained by Amazon, regular security updates
  • Marketplace AMIs: Third-party vendors, specialized software stacks
  • Community AMIs: Shared by other AWS users, use with caution
  • Custom AMIs: Your own snapshots of configured instances

Deep Learning AMI Features

  • Pre-installed ML frameworks: TensorFlow, PyTorch, MXNet, Hugging Face
  • CUDA drivers and cuDNN for GPU acceleration
  • Conda environments for different framework versions
  • Jupyter notebook server pre-configured
  • Development tools: git, vim, tmux, htop

AMI Selection Impact

  • Launch time: Custom AMIs start faster than base images
  • Maintenance: AWS AMIs get security updates, custom AMIs require manual updates
  • Storage cost: Larger AMIs cost more to store and transfer
  • Compatibility: Must match instance architecture (x86, ARM, GPU support)

AMI choice significantly impacts development velocity, operational overhead, and ongoing maintenance requirements.

Instance Types Optimize Hardware for Workload Patterns

EC2 provides hundreds of instance configurations optimized for different workload patterns.

General Purpose Instances (t3, m5, m6i)

  • Balanced CPU, memory, networking
  • t3.medium: 2 vCPUs, 4GB RAM, $0.0416/hour
  • m5.large: 2 vCPUs, 8GB RAM, $0.096/hour
  • Suitable for web servers, development environments

Compute Optimized (c5, c6i)

  • High-performance processors
  • c5.large: 2 vCPUs, 4GB RAM, $0.085/hour
  • 3.4 GHz sustained all-core frequency
  • Ideal for CPU-intensive ML inference

Memory Optimized (r5, x1e)

  • High memory-to-CPU ratios
  • r5.large: 2 vCPUs, 16GB RAM, $0.126/hour
  • x1e.xlarge: 4 vCPUs, 122GB RAM, $0.834/hour
  • Required for large dataset processing

Storage Optimized (i3, i4i)

  • NVMe SSD storage with high IOPS
  • i3.large: 2 vCPUs, 15.25GB RAM, 475GB NVMe, $0.156/hour
  • Up to 3.3 million IOPS per instance
  • Database workloads and distributed file systems

Instance selection balances CPU performance, memory capacity, storage speed, and hourly cost based on workload requirements.

GPU Instances: Parallel Processing for ML Workloads

GPU instances provide parallel processing power for ML training and inference.

GPU Instance Families

p4d Instances: Latest ML Training

  • NVIDIA A100 GPUs (40GB memory each)
  • p4d.24xlarge: 8x A100, 96 vCPUs, 1152GB RAM
  • 400 Gbps networking for multi-node training
  • $32.77/hour for 8 GPU instance

p3 Instances: General ML Workloads

  • NVIDIA V100 GPUs (16GB memory each)
  • p3.2xlarge: 1x V100, 8 vCPUs, 61GB RAM
  • 25 Gbps networking
  • $3.06/hour for single GPU

g4 Instances: ML Inference

  • NVIDIA T4 GPUs (16GB memory each)
  • g4dn.xlarge: 1x T4, 4 vCPUs, 16GB RAM
  • Optimized for inference workloads
  • $0.526/hour for single GPU

Current Limitations:

  • Limited availability in some regions
  • Requires reservation for large-scale training
  • High cost for continuous operation

GPU selection depends on model size, training duration, and budget constraints. Latest hardware provides better performance-per-dollar for large-scale training.

AMI Selection Impacts Launch Time and Maintenance

AMIs provide pre-built operating system and software configurations.

Base Operating System Images

  • Ubuntu Server 22.04 LTS: Standard Linux distribution
  • Amazon Linux 2: AWS-optimized with pre-installed AWS tools
  • Windows Server 2022: Microsoft environment for .NET applications
  • Red Hat Enterprise Linux: Enterprise-grade Linux support

Deep Learning AMIs

  • AWS Deep Learning AMI (Ubuntu): Pre-installed ML frameworks
    • PyTorch, TensorFlow, MXNet, Hugging Face Transformers
    • CUDA drivers and cuDNN for GPU acceleration
    • Jupyter notebooks and development tools
  • AWS Deep Learning Containers: Docker images for specific frameworks
  • NVIDIA NGC Images: Optimized containers for ML workloads

Custom AMIs

  • Create snapshots of configured instances
  • Share AMIs across accounts or make public
  • Version control for deployment consistency
  • Faster instance launch with pre-installed software

AMI Selection Strategy:

  1. Start with Deep Learning AMI for ML workloads
  2. Use base Ubuntu for custom configurations
  3. Create custom AMI after environment setup
  4. Consider regional availability and update frequency

AMI selection significantly impacts instance launch time, configuration complexity, and ongoing maintenance requirements.

Key Pairs Enable SSH Access

Key pairs provide secure authentication for connecting to EC2 instances without passwords.

AWS Key Pair Integration

  • Launch Requirement: Must specify key pair when creating instance
  • No Password Access: AWS disables password authentication by default
  • Region Specific: Key pairs only available in the region where created
  • Instance Metadata: Public key automatically installed in ~/.ssh/authorized_keys

Key Pair Management

  • AWS Generated: EC2 console creates key pair, you download .pem file
  • Import Existing: Upload your existing public key to AWS
  • One-time Download: Private key only available at creation time
  • No Recovery: Lost private key = permanent loss of access

Access Patterns

  • Single User: One key pair for personal development instances
  • Team Access: Multiple team members’ public keys imported separately
  • Service Access: Dedicated key pairs for automated tools and CI/CD
  • Environment Separation: Different keys for dev/staging/production

Key pairs cannot be added to running instances - losing your private key requires instance replacement or complex recovery procedures.

Security Groups Control Instance Network Access

Security groups act as virtual firewalls controlling inbound and outbound traffic to EC2 instances.

Inbound Rules (Traffic TO Your Instance)

  • SSH (Port 22): Administrative access for configuration and debugging
  • HTTP (Port 80): Web traffic for API endpoints and web applications
  • HTTPS (Port 443): Encrypted web traffic for production services
  • Custom Ports: Application-specific services (Jupyter: 8888, TensorBoard: 6006)

Outbound Rules (Traffic FROM Your Instance)

  • HTTPS (Port 443): Download packages, access S3, API calls
  • HTTP (Port 80): Software updates and package repositories
  • DNS (Port 53): Domain name resolution
  • Database Ports: Connection to RDS or external databases

Source and Destination Options

  • Your IP Address: Restrict access to your current location only
  • Anywhere (0.0.0.0/0): Allow access from entire internet
  • Other Security Groups: Reference groups for multi-tier applications
  • VPC CIDR Block: Allow access from within your virtual network

Security Group Strategy:

  1. Start with restrictive rules (SSH from your IP only)
  2. Add specific ports as needed for your application
  3. Use security group references for multi-tier architectures
  4. Never use 0.0.0.0/0 for SSH or database access

Security groups require explicit configuration for each network service your ML application needs to access or provide.

Cloud Storage: Durability and Global Access

Cloud storage services provide durability, scalability, and global accessibility.

Object Storage (S3)

  • Store files as objects in buckets
  • Globally unique bucket names
  • REST API access from any location
  • 99.999999999% (11 9’s) durability
  • Automatic replication across facilities

Block Storage (EBS)

  • Virtual hard drives for EC2 instances
  • High IOPS performance for databases
  • Snapshot backup and restoration
  • Encryption at rest and in transit
  • Multiple volume types for different use cases

File Systems (EFS)

  • Network File System (NFS) compatible
  • Shared access across multiple instances
  • Automatic scaling to petabyte capacity
  • POSIX file system semantics
  • Suitable for distributed applications

Database Services (RDS, DynamoDB)

  • Managed relational databases (MySQL, PostgreSQL)
  • NoSQL for high-scale applications
  • Automated backups and patching
  • Multi-region replication
  • Performance monitoring and optimization

Storage service selection depends on access patterns, performance requirements, durability needs, and cost constraints.

Storage Services Abstract Physical Disks

Cloud storage abstracts physical disks into managed services with different access patterns.

Traditional Storage Model

  • Physical hard drives attached to servers
  • Direct file system access (NTFS, ext4)
  • Local RAID for redundancy
  • Manual backup and recovery
  • Fixed capacity planning

Cloud Storage Model

  • Storage services accessed over network APIs
  • Provider manages physical infrastructure
  • Automatic replication and durability
  • Pay-per-GB pricing with instant scaling
  • Different services optimized for specific use cases

Key Cloud Storage Concepts

Durability: How likely data survives hardware failures

  • Local disk: ~99% (1% annual failure rate)
  • Cloud storage: 99.999999999% (11 9’s) through replication

Consistency: When all copies reflect the same data

  • Strong consistency: All reads return latest write immediately
  • Eventual consistency: All copies eventually consistent, may be stale briefly

Access Patterns: How applications read and write data

  • Random access: Database queries, frequent small reads/writes
  • Sequential access: Log files, backups, large file streaming
  • Infrequent access: Archives, disaster recovery, compliance data

Storage Service Categories

Block Storage (EBS)

  • Virtual hard drives for EC2 instances
  • Raw block device, requires file system
  • High IOPS for databases and applications
  • Can attach/detach from instances
  • Snapshots for backup and cloning

Object Storage (S3)

  • Files stored as objects with metadata
  • REST API access from anywhere
  • Virtually unlimited capacity
  • Multiple storage classes for cost optimization
  • Global replication and CDN integration

File Storage (EFS)

  • Traditional file system semantics (POSIX)
  • Multiple instances access simultaneously
  • Automatic scaling to petabytes
  • Network File System (NFS) protocol
  • Shared access for distributed applications

Database Storage (RDS)

  • Managed database engines
  • Automatic backups and point-in-time recovery
  • Multi-AZ deployment for high availability
  • Read replicas for scale-out
  • Provider handles maintenance and patching

Storage Selection Criteria: Access frequency, performance requirements, sharing needs, backup/recovery, and cost sensitivity.

S3 Operational Complexity Exceeds Simple File Storage

S3 appears simple but involves significant operational complexity.

Why S3 Isn’t “Just File Storage”

Global Namespace and Regions

  • Bucket names must be globally unique across all AWS accounts
  • Data stored in specific geographic regions
  • Cross-region data transfer costs $0.02/GB
  • Latency varies significantly by region (20ms local, 200ms+ cross-continent)

Access Control Complexity

  • Bucket policies control who can access data
  • IAM roles define service permissions
  • Access Control Lists (ACLs) for fine-grained control
  • Pre-signed URLs for temporary access
  • Misconfigured permissions cause security breaches

Consistency and Performance Models

  • Read-after-write consistency for new objects
  • Eventual consistency for updates and deletes
  • Request rate limits: 3,500 PUT/COPY/POST/DELETE, 5,500 GET/HEAD per prefix per second
  • Hotspotting when many requests target same key prefix

Storage Classes and Cost Optimization

  • Standard: $0.023/GB/month, immediate access
  • Infrequent Access: $0.0125/GB/month, retrieval fees
  • Glacier: $0.004/GB/month, minutes to hours retrieval
  • Lifecycle policies automatically transition data

S3 operational complexity includes regional data placement, access control management, performance optimization, and cost management across multiple storage classes.

Network Services Enable Secure Component Communication

Cloud networking enables secure, scalable communication between services.

Virtual Private Cloud (VPC)

  • Isolated network environment in AWS
  • Define IP address ranges (CIDR blocks)
  • Public and private subnets
  • Control traffic with security groups and NACLs
  • Connect to on-premises networks via VPN

Load Balancers

  • Application Load Balancer (ALB): HTTP/HTTPS traffic, Layer 7 routing
  • Network Load Balancer (NLB): TCP/UDP traffic, ultra-low latency
  • Gateway Load Balancer (GWLB): Third-party security appliances
  • Health checks and automatic failover
  • SSL/TLS termination and certificate management

Content Delivery Network (CloudFront)

  • Global edge locations reduce latency
  • Cache static content closer to users
  • Dynamic content acceleration
  • DDoS protection and security features
  • Integration with AWS services

DNS and Service Discovery

  • Route 53 for domain name management
  • Health checks and failover routing
  • Service discovery for microservices
  • Geographic and latency-based routing

Networking services reduce latency, improve reliability, and provide security for distributed applications across global infrastructure.

Serverless Executes Code Without Server Management

Serverless computing executes code without server management or capacity planning.

Traditional Server-Based Model

  • Provision EC2 instances for expected peak load
  • Install runtime environments and dependencies
  • Deploy application code to servers
  • Monitor server health and scaling
  • Pay for server uptime regardless of usage

Serverless Execution Model

  • Upload code to serverless platform
  • Platform handles all infrastructure automatically
  • Code executes in response to events/requests
  • Automatic scaling from zero to thousands of concurrent executions
  • Pay only for actual execution time and requests

Key Serverless Concepts

Function as a Service (FaaS): Code runs as stateless functions

  • Each function execution is independent
  • No persistent local storage between invocations
  • Runtime environment created/destroyed for each execution

Event-Driven Architecture: Functions triggered by events

  • HTTP requests via API Gateway
  • File uploads to S3 storage
  • Database changes, queue messages, scheduled timers
  • Functions can trigger other functions

Cold Starts: Initialization delay for new function instances

  • Platform creates new runtime environment
  • Downloads code package and dependencies
  • Initializes programming language runtime
  • 100ms-1000ms latency penalty for first execution

Serverless Service Categories

Compute Functions (Lambda)

  • Execute code in response to events
  • Supported languages: Python, Node.js, Java, C#, Go, Ruby
  • 15-minute maximum execution time
  • 10GB maximum memory allocation

API Management (API Gateway)

  • REST and WebSocket API endpoints
  • Request/response transformation
  • Authentication and authorization
  • Rate limiting and usage monitoring

Database Services (DynamoDB)

  • NoSQL database with automatic scaling
  • Single-digit millisecond latency
  • Pay-per-request pricing model
  • Global tables for multi-region deployment

Storage and Messaging

  • S3: Object storage with event triggers
  • SQS: Message queues for asynchronous processing
  • SNS: Publish/subscribe messaging service
  • EventBridge: Event routing between services

Development and Deployment

  • SAM: Serverless Application Model for infrastructure as code
  • X-Ray: Distributed tracing for debugging
  • CloudWatch: Logging and monitoring
  • CodePipeline: CI/CD for serverless applications

Serverless Trade-offs: No server management vs execution time limits, automatic scaling vs cold starts, pay-per-use vs potentially higher costs at scale.

Lambda Constraints Limit ML Workload Suitability

Lambda provides specific implementation of serverless computing with constraints for ML workloads.

Lambda Execution Model

  • Event-driven function execution
  • Automatic scaling from zero to thousands of concurrent executions
  • Pay only for actual compute time (100ms billing increments)
  • No server provisioning or maintenance required
  • Supports Python, Node.js, Java, C#, Go, Ruby, custom runtimes

Lambda Limitations for ML Workloads

  • Execution time: 15-minute maximum duration
  • Memory: 10GB maximum allocation
  • Storage: 512MB in /tmp directory
  • Package size: 50MB zipped, 250MB unzipped
  • Cold starts: 100ms+ initialization delay for new instances

Suitable ML Use Cases

  • Real-time inference for small models (<250MB)
  • Image preprocessing and data transformation
  • Model serving behind API Gateway
  • Event-driven data processing pipelines
  • Feature extraction from streaming data

Not Suitable for:

  • Large model training (memory and time constraints)
  • Models requiring GPU acceleration
  • Long-running data processing jobs
  • Applications requiring persistent connections

Lambda provides cost-effective serverless computing for event-driven ML tasks but has significant constraints for large-scale model operations.

Integration Patterns Connect Services Through APIs

Cloud services connect through APIs, events, and data flows.

Request-Response Pattern

  • Direct API calls between services
  • Synchronous communication
  • EC2 → S3 for data retrieval
  • Application Load Balancer → EC2 instances
  • Suitable for real-time interactions

Event-Driven Pattern

  • Asynchronous message passing
  • S3 triggers Lambda on object upload
  • CloudWatch Events schedule functions
  • SQS queues decouple services
  • Handles variable load and failures

Data Pipeline Pattern

  • Sequential processing stages
  • S3 → Lambda → DynamoDB
  • ECS tasks process batch jobs
  • Step Functions orchestrate workflows
  • Supports complex data transformations

Shared Storage Pattern

  • Multiple services access common data
  • EFS for shared file access
  • RDS for transactional data
  • S3 for object sharing
  • ElastiCache for session storage

Integration pattern selection depends on latency requirements, failure tolerance, and operational complexity constraints.

Memory Limits Impose Service Contraints

Lambda 10GB memory limit prevents large model deployment.

Lambda Memory Constraint

  • Maximum allocation: 10GB RAM
  • PyTorch model loading overhead: 2x model size
  • Practical model size limit: 4-5GB maximum

Large Language Models

  • GPT-3.5: 13GB model weights
  • Llama-2 7B: 14GB model weights
  • Llama-2 13B: 26GB model weights
  • BERT Large: 1.3GB model weights

Result: Lambda cannot load models >4GB

Cold Start Penalty

Models >250MB face initialization delays:

  • 1GB model: 2-3 second cold start
  • 4GB model: 8-12 second cold start
  • Timeout before first request completion

EC2 Memory Capacity

Instance Memory Range

  • t3.micro: 1GB RAM ($8.76/month)
  • r5.large: 16GB RAM ($90.72/month)
  • r5.24xlarge: 768GB RAM ($4,343.04/month)
  • u-6tb1.metal: 6TB RAM ($17,971.20/month)

Model Deployment Examples

  • BERT Large (1.3GB): Runs on t3.small (2GB)
  • Llama-2 7B (14GB): Requires r5.large minimum
  • Llama-2 70B (140GB): Requires r5.24xlarge minimum

Memory vs Cost Trade-off

  • 16GB instance: $91/month
  • 768GB instance: $4,343/month (48x cost for 48x memory)

EC2 supports any practical model size with appropriate instance selection.

Memory requirements determine compute service viability before performance or cost considerations.

Execution Time Implose Training Blocks

Lambda 15-minute timeout eliminates ML training.

Lambda Execution Limits

  • Maximum execution time: 15 minutes (900 seconds)
  • Cannot be extended or renewed
  • Process terminated with no checkpoint saving
  • Suitable for inference only, never training

Typical ML Training Duration

Small Models (ImageNet Classification)

  • ResNet-50: 2-4 hours on single GPU
  • EfficientNet-B0: 1-2 hours on single GPU
  • Training epochs: 100-300 typical

Large Models (Language Models)

  • GPT-2 Small: 24-48 hours on 8 GPUs
  • BERT Base: 4-16 hours on 16 GPUs
  • Llama-2 7B: 184 hours on 64 GPUs

Fine-tuning Duration

  • BERT fine-tuning: 30-120 minutes
  • GPT-3.5 fine-tuning: 60-240 minutes
  • Still exceeds Lambda limit

EC2 Training Capability

Unlimited Execution Time

  • No timeout constraints
  • Training runs for days or weeks
  • Automatic checkpointing to S3 for failure recovery

Training Cost Examples

ResNet-50 on p3.2xlarge ($3.06/hour)

  • Training time: 3 hours
  • Total cost: $9.18

GPT-2 Small on p3.8xlarge ($12.24/hour)

  • Training time: 48 hours
  • Total cost: $587.52

BERT Base on p3.16xlarge ($24.48/hour)

  • Training time: 8 hours
  • Total cost: $195.84

Spot Instance Savings

  • Same instances: 70% discount
  • BERT training: $195.84 → $58.75
  • Risk: Training interruption every 2-6 hours

15-minute execution limit makes Lambda unsuitable for any ML training workload.

Storage Request Limits Create Bottlenecks

S3 request rate limits constrain high-throughput workloads.

S3 Request Rate Limits

Per-Prefix Limits

  • PUT/COPY/POST/DELETE: 3,500 requests/second
  • GET/HEAD: 5,500 requests/second
  • Prefix = everything before last “/” in object key

Distributed Training Impact

100-GPU Training Job

  • Each GPU requests 10 data batches/second
  • Total requests: 1,000/second
  • Within S3 limits if properly prefixed

1000-GPU Training Job

  • Each GPU requests 10 data batches/second
  • Total requests: 10,000/second
  • Exceeds S3 GET limit by 82%
  • Result: Training stalls waiting for data

Request Hotspotting

  • Single prefix: /training-data/imagenet/
  • All 1000 GPUs hit same prefix
  • Requests throttled, training blocked

EBS IOPS Limitations

Volume Type Performance

EBS Type Max IOPS Max Throughput Cost/Month (100GB)
gp3 16,000 1,000 MB/s $8.00
io2 64,000 1,000 MB/s $65.00
gp2 10,000 250 MB/s $10.00

Database Workload Impact

PostgreSQL with 1M records/second inserts

  • Required IOPS: 50,000-80,000
  • gp3 volume: Cannot support workload
  • io2 volume: $65/month + $3,250 IOPS charges = $3,315/month

Machine Learning Dataset Loading

  • ImageNet (1.2M images): 500 MB/s sequential read
  • gp3 volume: Supports workload at $8/month
  • Random access training: Requires higher IOPS

Multi-Instance Sharing

  • EBS limitation: Single attachment point
  • Cannot share between training instances
  • Requires data replication or network storage

Storage performance limits determine data access patterns and training architecture.

Cost Models Favor Different Usage Patterns

Lambda pay-per-request vs EC2 always-on pricing.

Usage Pattern Analysis

Scenario 1: Sporadic Inference (100 requests/day)

Lambda Costs

  • Requests: 100/day × 30 days = 3,000/month
  • Duration: 200ms average per request
  • Memory: 1GB allocated
  • Monthly cost: $0.60

EC2 Alternative (t3.micro always-on)

  • Instance cost: $8.76/month
  • Always running regardless of usage
  • 14.6x more expensive than Lambda

Break-even point: 1,460 requests/day

Scenario 2: High-Volume Inference (100,000 requests/day)

Lambda Costs

  • Requests: 3M/month
  • Monthly cost: $600

EC2 Alternative (c5.large)

  • Instance cost: $61.32/month
  • Can handle 100,000 requests/day
  • 10x cheaper than Lambda

Cost Crossover Points

Request Volume Thresholds

Instance Type Monthly Cost Lambda Break-even
t3.nano $4.38 730 req/day
t3.micro $8.76 1,460 req/day
t3.small $17.52 2,920 req/day
c5.large $61.32 10,220 req/day

Memory Impact on Lambda Costs

Memory Cost per GB-second 1M req/month cost
128MB Base rate $200
1GB 8x base $1,600
3GB 24x base $4,800
10GB 80x base $16,000

Duration Impact

  • 100ms execution: $200/million requests
  • 1 second execution: $2,000/million requests
  • 10 second execution: $20,000/million requests

Cost optimization requires matching service pricing model to actual usage patterns.

Service Constraints Determine Architecture

Hard limits eliminate service options before cost optimization.

Constraint Hierarchy

1. Hard Constraints (Service Elimination)

  • Memory > 10GB → Lambda impossible
  • Execution > 15 minutes → Lambda impossible
  • Shared storage access → S3 or EFS required
  • 64,000 IOPS → Multiple EBS volumes required

2. Performance Constraints (Service Selection)

  • <100ms latency → Pre-warmed instances required
  • 5,500 requests/second → S3 prefix distribution required

  • 16,000 IOPS → io2 volumes required

3. Cost Constraints (Configuration Optimization)

  • Variable load → Lambda or auto-scaling preferred
  • Consistent load → Reserved instances preferred
  • Development → Spot instances acceptable

Real Architecture Decisions

Large Model Serving (7GB model)

  1. Memory constraint eliminates Lambda
  2. Always-on requirement eliminates spot instances
  3. Load balancing required for availability
  4. Result: EC2 + ALB + Auto Scaling Group

Batch Processing (2-hour jobs)

  1. Execution time eliminates Lambda
  2. Intermittent usage favors spot instances
  3. Job queuing handles interruptions
  4. Result: EC2 Spot + SQS + Auto Scaling

Service constraints determine feasible architectures; cost considerations optimize within remaining options.

Cloud ML System Design

From Local Scripts to Cloud Services

Transform single-machine PyTorch workflows into systems using EC2 and S3.

Local Development Workflow

# Everything on laptop
import torch
import pandas as pd

# Load data (local file)
data = pd.read_csv('dataset.csv')

# Train model (local GPU)
model = train_pytorch_model(data)

# Save model (local disk)  
torch.save(model, 'model.pth')

# Serve predictions (local process)
app.run(host='localhost', port=5000)

Local Constraints:

  • Data limited by disk space (2TB max)
  • Training limited by GPU memory (24GB)
  • Serving limited to single user
  • No backup or redundancy
  • Cannot scale beyond one machine

Cloud Workflow Using EC2 + S3

# Distributed across services
import boto3
import torch

# Load data (from S3)
s3.download_file('ml-bucket', 'dataset.csv', '/tmp/dataset.csv')

# Train model (EC2 with GPU)
model = train_pytorch_model(data)

# Save model (to S3)
torch.save(model, '/tmp/model.pth')
s3.upload_file('/tmp/model.pth', 'ml-bucket', 'models/model.pth')

# Serve predictions (Lambda + S3)
def lambda_handler(event, context):
    s3.download_file('ml-bucket', 'models/model.pth', '/tmp/model.pth')
    model = torch.load('/tmp/model.pth')
    return model.predict(event['input'])

Cloud Capabilities:

  • Data storage scales to petabytes (S3)
  • Training scales to multiple GPUs (EC2)
  • Serving handles thousands of users (Lambda)
  • Automatic backup and replication (S3)
  • Pay only for resources used

EC2 instances and S3 buckets require API integration and IAM configuration for functional ML systems.

Basic ML System Architecture

Simple ML system using EC2 for training and Lambda for serving.

Component Design

Data Storage (S3)

  • Training data: s3://ml-bucket/data/
  • Model artifacts: s3://ml-bucket/models/
  • Predictions: s3://ml-bucket/results/

Training Infrastructure (EC2)

  • Instance type: p3.2xlarge (1 GPU, 8 vCPUs, 61GB RAM)
  • AMI: Deep Learning AMI (Ubuntu) with PyTorch pre-installed
  • Storage: 100GB EBS volume for temporary data
  • IAM role: S3 read/write permissions

Serving Infrastructure (Lambda)

  • Runtime: Python 3.9
  • Memory: 3GB (for model loading)
  • Timeout: 30 seconds
  • Trigger: API Gateway HTTP requests

System Data Flow

  1. Upload training data to S3 bucket
  2. Launch EC2 instance with training script
  3. EC2 downloads data from S3, trains model
  4. EC2 uploads trained model back to S3
  5. Lambda function loads model from S3 for predictions
  6. API Gateway routes prediction requests to Lambda

Total monthly cost: ~$330 for moderate ML workload with occasional training and regular serving.

Training System Design

EC2-based training system with S3 data management.

Training Job Configuration

EC2 Instance Setup

# Launch instance
aws ec2 run-instances \
  --image-id ami-0c02fb55956c7d316 \
  --instance-type p3.2xlarge \
  --key-name my-key \
  --security-group-ids sg-12345678

# SSH and setup
ssh -i my-key.pem ubuntu@instance-ip
sudo apt update && sudo apt install awscli

Training Script Structure

#!/usr/bin/env python3
import boto3
import torch

# Download training data
s3 = boto3.client('s3')
s3.download_file('ml-bucket', 'train.csv', 'data/train.csv')

# Load and train
data = load_data('data/train.csv')  
model = MyModel()
train_model(model, data, epochs=100)

# Upload trained model
torch.save(model.state_dict(), 'model.pth')
s3.upload_file('model.pth', 'ml-bucket', 'models/model_v1.pth')

# Cleanup and terminate
os.system('sudo shutdown -h now')

Cost Optimization

  • Use spot instances for 70% cost reduction
  • Terminate instance when training completes
  • Use appropriate instance size for model

Training Performance Analysis

Model Size Local (RTX 4090) EC2 (p3.2xlarge) Cost
Small (10M params) 2 hours 1.5 hours $4.59
Medium (100M params) 8 hours 6 hours $18.36
Large (1B params) Cannot fit 24 hours $73.44

Training Workflow

  1. Prepare training data locally
  2. Upload data to S3 bucket
  3. Launch EC2 instance with training script
  4. Monitor training progress via CloudWatch logs
  5. Retrieve trained model from S3
  6. Terminate instance to stop billing

Failure Handling

  • Save checkpoints to S3 every epoch
  • Use spot instance interruption handling
  • Implement training resume from checkpoint
  • Set up CloudWatch alarms for long-running jobs

Training System Benefits: Scales beyond local GPU memory, handles larger datasets, provides cost flexibility through spot instances.

Serving System Design

Lambda-based serving with S3 model storage.

Lambda Function Implementation

import json
import boto3
import torch
import tempfile

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Download model from S3 (cached after first call)
    if not hasattr(lambda_handler, 'model'):
        with tempfile.NamedTemporaryFile() as tmp:
            s3.download_file('ml-bucket', 'models/model_v1.pth', tmp.name)
            lambda_handler.model = torch.load(tmp.name, map_location='cpu')
    
    # Parse input
    input_data = json.loads(event['body'])
    
    # Make prediction
    with torch.no_grad():
        prediction = lambda_handler.model(input_data['features'])
    
    return {
        'statusCode': 200,
        'body': json.dumps({'prediction': prediction.tolist()})
    }

API Gateway Configuration

  • REST API endpoint: https://api.example.com/predict
  • POST method with JSON payload
  • CORS enabled for web applications
  • Rate limiting: 1000 requests/second

Alternative: EC2 Serving

# For higher throughput or larger models
from flask import Flask, request
import torch

app = Flask(__name__)
model = torch.load('model.pth')  # Loaded once at startup

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    prediction = model(data['features'])
    return {'prediction': prediction.tolist()}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80)

Serving Performance Comparison

Approach Cold Start Warm Latency Max Throughput Cost/1M requests
Lambda 2-5 seconds 100-300ms 1000 concurrent $200
EC2 t3.medium 0ms 50-100ms 100 req/sec $300
EC2 c5.large 0ms 20-50ms 500 req/sec $600

When to Use Each:

Lambda:

  • Sporadic traffic patterns
  • Cost optimization priority
  • Simple models (<250MB)
  • Can tolerate cold starts

EC2:

  • Consistent traffic
  • Large models (>250MB)
  • Low latency requirements (<50ms)
  • Need persistent connections

Serving Design Choice: Lambda for variable workloads, EC2 for consistent high-throughput requirements.

Data Management Patterns

S3-based data organization for ML workflows.

S3 Bucket Organization

ml-project-bucket/
├── data/
│   ├── raw/
│   │   ├── 2024/01/15/data.csv
│   │   └── 2024/01/16/data.csv
│   ├── processed/
│   │   ├── train.parquet
│   │   └── test.parquet
│   └── features/
│       └── feature_v1.csv
├── models/
│   ├── experiments/
│   │   ├── exp_001/model.pth
│   │   └── exp_002/model.pth
│   └── production/
│       ├── model_v1.pth
│       └── model_v2.pth
└── results/
    ├── predictions/
    └── metrics/

Data Processing Pipeline

# Data validation and preprocessing
def process_data():
    # Download raw data
    s3.download_file('bucket', 'data/raw/data.csv', 'raw.csv')
    
    # Clean and validate
    df = pd.read_csv('raw.csv')
    df = validate_schema(df)
    df = clean_missing_values(df)
    
    # Split and save
    train, test = train_test_split(df)
    train.to_parquet('train.parquet')
    test.to_parquet('test.parquet')
    
    # Upload processed data
    s3.upload_file('train.parquet', 'bucket', 'data/processed/train.parquet')
    s3.upload_file('test.parquet', 'bucket', 'data/processed/test.parquet')

S3 Storage Class Strategy

Data Type Access Pattern Storage Class Cost/GB/month
Raw data Archive only Glacier $0.004
Processed training data Weekly access IA $0.0125
Active models Daily access Standard $0.023
Predictions Real-time Standard $0.023

Data Lifecycle Management

# Lifecycle policy example
lifecycle_policy = {
    'Rules': [{
        'Status': 'Enabled',
        'Transitions': [
            {
                'Days': 30,
                'StorageClass': 'STANDARD_IA'
            },
            {
                'Days': 90, 
                'StorageClass': 'GLACIER'
            }
        ]
    }]
}

Data Access Patterns

  • Training: High bandwidth, infrequent access
  • Serving: Low bandwidth, frequent access
  • Archival: No bandwidth, rare access
  • Monitoring: Medium bandwidth, regular access

Cost Optimization

  • Use appropriate storage class
  • Compress data files (parquet vs CSV)
  • Partition large datasets by date/category
  • Delete intermediate processing files

Data Strategy: Organize by lifecycle stage, optimize storage classes for access patterns, implement automated lifecycle policies.

System Integration and Orchestration

Connect EC2 training and Lambda serving through S3.

End-to-End Workflow

Automated Training Pipeline

# CloudWatch Event triggered training
def trigger_training(event, context):
    # Launch EC2 training instance
    ec2 = boto3.client('ec2')
    
    user_data_script = '''#!/bin/bash
    aws s3 cp s3://ml-bucket/scripts/train.py /home/ubuntu/
    cd /home/ubuntu
    python3 train.py
    sudo shutdown -h now
    '''
    
    response = ec2.run_instances(
        ImageId='ami-0c02fb55956c7d316',  # Deep Learning AMI
        InstanceType='p3.2xlarge',
        MinCount=1, MaxCount=1,
        UserData=user_data_script,
        IamInstanceProfile={'Name': 'ML-Training-Role'}
    )
    
    return {'instance_id': response['Instances'][0]['InstanceId']}

Model Update Workflow

# S3 trigger for model updates
def update_serving_model(event, context):
    # New model uploaded to S3
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    if key.startswith('models/production/'):
        # Update Lambda environment variable
        lambda_client = boto3.client('lambda')
        lambda_client.update_function_configuration(
            FunctionName='ml-serving-function',
            Environment={'Variables': {'MODEL_PATH': key}}
        )

Monitoring and Alerting

CloudWatch Metrics

  • Training job duration and cost
  • Model serving latency and error rates
  • S3 storage usage and costs
  • Lambda function invocations and failures

Automated Alerts

# CloudWatch alarm for training failures
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
    AlarmName='ML-Training-Failed',
    MetricName='InstanceTerminated',
    Namespace='AWS/EC2',
    Statistic='Sum',
    Period=300,
    EvaluationPeriods=1,
    Threshold=1,
    ComparisonOperator='GreaterThanThreshold',
    AlarmActions=['arn:aws:sns:us-east-1:123456789012:ml-alerts']
)

System Health Dashboard

  • Active training jobs and progress
  • Model serving performance metrics
  • Daily/weekly cost breakdown
  • Data pipeline health status

Integration Principles: Use S3 as central data store, automate workflows with triggers, implement comprehensive monitoring.

Cost Management and Optimization

Practical cost control for EC2 and S3 based ML systems.

Cost Breakdown Analysis

Monthly Costs for Typical ML Project

  • S3 storage (500GB): $11.50
  • EC2 training (20 hours p3.2xlarge): $612
  • Lambda serving (1M requests): $200
  • Data transfer: $50
  • Total: $873.50/month

Cost Optimization Strategies

EC2 Training Optimization

  • Use spot instances: 70% cost reduction ($612 → $184)
  • Right-size instances: Match model requirements to instance type
  • Automated termination: Stop instances when training completes
  • Reserved instances: 60% discount for predictable workloads

S3 Storage Optimization

  • Lifecycle policies: Automatic transition to cheaper storage classes
  • Data compression: 50-80% size reduction with parquet/gzip
  • Intelligent tiering: Automatic cost optimization
  • Delete temporary files: Clean up intermediate processing data

Lambda Serving Optimization

  • Memory allocation: Match to actual model requirements
  • Provisioned concurrency: Reduce cold start costs for consistent traffic
  • Alternative architectures: Consider EC2 for high-volume serving

Monitoring and Budgets

  • Cost allocation tags: Track expenses by project/team
  • Billing alerts: Notification when costs exceed thresholds
  • Usage reports: Identify optimization opportunities

Cost Optimization Impact

Optimization Before After Savings
Spot instances $612 $184 $428 (70%)
S3 lifecycle $11.50 $5.75 $5.75 (50%)
Right-sizing $200 $120 $80 (40%)
Total $873.50 $359.75 $513.75

Monthly savings: 59% through optimization

Budgeting Framework

# Set up cost alerts
import boto3

budgets = boto3.client('budgets')
budgets.create_budget(
    AccountId='123456789012',
    Budget={
        'BudgetName': 'ML-Project-Budget',
        'BudgetLimit': {
            'Amount': '500',
            'Unit': 'USD'
        },
        'TimeUnit': 'MONTHLY',
        'BudgetType': 'COST'
    },
    NotificationsWithSubscribers=[{
        'Notification': {
            'NotificationType': 'ACTUAL',
            'ComparisonOperator': 'GREATER_THAN',
            'Threshold': 80
        },
        'Subscribers': [{
            'SubscriptionType': 'EMAIL',
            'Address': 'admin@company.com'
        }]
    }]
)

Cost Management Process: Set budgets, implement optimizations, monitor usage patterns, adjust resources based on actual requirements.

Production Deployment Considerations

Transform development system into production-ready ML service.

Production Readiness Checklist

Security

  • IAM roles with least-privilege permissions
  • S3 bucket policies restricting access
  • VPC for network isolation
  • Encryption at rest and in transit

Reliability

  • Multi-AZ deployment for high availability
  • Automated backup and recovery procedures
  • Health checks and automatic failover
  • Circuit breakers for external dependencies

Monitoring

  • Comprehensive logging and metrics
  • Alerting for system and model performance
  • Distributed tracing for debugging
  • Business impact measurement

Scalability

  • Auto-scaling groups for EC2 instances
  • Lambda concurrent execution limits
  • S3 request rate optimization
  • CDN for global content distribution

Compliance

  • Data retention and deletion policies
  • Audit logging for regulatory requirements
  • Model explainability and bias detection
  • Privacy protection and anonymization

Development vs Production

Aspect Development Production
Data volume 1GB sample 1TB+ full dataset
Training frequency Manual Automated daily/weekly
Serving SLA Best effort 99.9% availability
Security Basic Enterprise-grade
Cost $50/month $500-5000/month

Production Architecture Changes

  • Load balancer in front of serving instances
  • Database for model metadata and predictions
  • Monitoring dashboards and alerting
  • CI/CD pipeline for code deployment
  • Infrastructure as code (Terraform/CloudFormation)

Operational Procedures

  • Incident response and escalation
  • Model retraining and deployment pipeline
  • Performance regression testing
  • Capacity planning and resource forecasting
  • Regular security audits and updates

Success Metrics

  • System uptime and availability
  • Model prediction accuracy over time
  • Response latency and throughput
  • Cost per prediction or user
  • Time to detect and resolve issues

Production Transformation: Add redundancy, monitoring, security, and operational procedures around the basic EC2/S3/Lambda architecture.

AWS Identity and Access Management

Amazon Resource Names: Global Resource Identification

AWS uses ARNs to uniquely identify every resource across all accounts and regions globally.

ARN Structure Format

arn:partition:service:region:account-id:resource-type/resource-id

Component Breakdown

Partition: AWS deployment (usually “aws”)

  • aws - Standard AWS regions
  • aws-cn - China regions
  • aws-us-gov - GovCloud regions

Service: AWS service name

  • s3 - Simple Storage Service
  • ec2 - Elastic Compute Cloud
  • iam - Identity and Access Management
  • lambda - Lambda Functions

Region: Geographic region identifier

  • us-east-1 - US East (Virginia)
  • eu-west-1 - EU (Ireland)
  • Empty for global services (IAM, S3 bucket names)

Account ID: 12-digit account identifier

  • 123456789012 - Specific AWS account
  • Empty for public resources

Resource: Service-specific identifier

  • bucket-name - S3 bucket
  • instance/i-1234567890abcdef0 - EC2 instance
  • user/developer-name - IAM user

Real ARN Examples

S3 Bucket ARN

arn:aws:s3:::ml-training-bucket-12345
  • Global resource (no region/account)
  • Bucket names must be globally unique

S3 Object ARN

arn:aws:s3:::ml-training-bucket-12345/models/bert-base.pth
  • Specific object within bucket
  • Used in policies for granular access

EC2 Instance ARN

arn:aws:ec2:us-east-1:123456789012:instance/i-0abcd1234efgh5678
  • Region-specific resource
  • Account-specific identifier

IAM Role ARN

arn:aws:iam::123456789012:role/EC2-ML-Training-Role
  • Global service (no region)
  • Account-specific role

Lambda Function ARN

arn:aws:lambda:us-east-1:123456789012:function:iris-classifier-api
  • Region and account specific
  • Function name as resource ID

Policy Usage Example

{
  "Effect": "Allow",
  "Action": "s3:GetObject",
  "Resource": [
    "arn:aws:s3:::ml-training-bucket-12345/data/*",
    "arn:aws:s3:::ml-training-bucket-12345/models/*"
  ]
}

ARNs enable precise resource identification across AWS’s global infrastructure, supporting granular access control and cross-service integration.

Resource IDs and Naming Conventions

AWS generates unique identifiers for resources with predictable patterns for programmatic access.

AWS-Generated IDs

EC2 Instances

  • Pattern: i- + 17 hex characters
  • Example: i-0abcd1234efgh5678
  • Unique within region, persistent across stop/start

Security Groups

  • Pattern: sg- + 17 hex characters
  • Example: sg-0123456789abcdef0
  • Referenced in networking and firewall rules

VPCs (Virtual Private Clouds)

  • Pattern: vpc- + 17 hex characters
  • Example: vpc-12345678901234567
  • Container for all networking resources

Subnets

  • Pattern: subnet- + 17 hex characters
  • Example: subnet-0abcdef1234567890
  • Network segment within VPC and AZ

AMI (Amazon Machine Images)

  • Pattern: ami- + 17 hex characters
  • Example: ami-0c2b8ca1dad447f8a
  • Immutable OS image for launching instances

EBS Volumes

  • Pattern: vol- + 17 hex characters
  • Example: vol-0123456789abcdef0
  • Block storage attached to instances

User-Defined Naming

S3 Bucket Names (Global)

  • Must be globally unique across all AWS accounts
  • 3-63 characters, lowercase, no underscores
  • Examples: ml-training-data-company-2024, model-artifacts-prod

IAM Names (Account-scoped)

  • User names: developer-john-smith, ci-cd-deployment
  • Role names: EC2-ML-Training-Role, Lambda-S3-Access
  • Policy names: MLTrainingDataAccess, ModelDeploymentPermissions

Tags for Resource Organization

{
  "Environment": "production",
  "Project": "ml-classifier",
  "Owner": "data-science-team",
  "CostCenter": "research-development"
}

Naming Best Practices

Descriptive and Searchable

  • Good: ml-training-p3xlarge-gpu-instance
  • Bad: my-instance-1

Environment Separation

  • ml-model-artifacts-dev
  • ml-model-artifacts-staging
  • ml-model-artifacts-prod

Service Integration

# EC2 instance launches with role
aws ec2 run-instances \
    --image-id ami-0c2b8ca1dad447f8a \
    --instance-type p3.2xlarge \
    --iam-instance-profile Name=EC2-ML-Training-Profile \
    --security-group-ids sg-0123456789abcdef0 \
    --subnet-id subnet-0abcdef1234567890

Consistent resource naming and understanding ID patterns enables automation, cost tracking, and operational management at scale.

Identity Hierarchies: Users, Roles, and Service Accounts

Distributed systems require identity verification across network boundaries without shared local authentication.

Distributed Systems Security Problem

Local systems rely on operating system authentication:

  • Single login validates all local resource access
  • File permissions enforced by kernel
  • Process isolation prevents unauthorized access
  • Network access assumed trusted (localhost)

Cloud Distribution Challenge

  • Resources span multiple physical machines across datacenters
  • Network communication between untrusted systems
  • No shared operating system to enforce permissions
  • Service-to-service calls cross security boundaries
  • Identity must be verified for every distributed request

IAM as Distributed Security Solution

AWS IAM solves distributed identity through:

  • Centralized identity store: Single source of truth for all accounts
  • Network-based credentials: Authentication tokens sent over network
  • Service-specific permissions: Each API call individually authorized
  • Cross-boundary trust: Roles enable secure service communication

IAM Identity Types

Root Account

  • Complete administrative access to all AWS services and resources
  • Email address and password used for initial account creation
  • Cannot be restricted through IAM policies
  • Should never be used for day-to-day operations
  • Requires multi-factor authentication for production accounts

IAM Users

  • Individual identity for human access to AWS resources
  • Permanent credentials (access key ID and secret access key)
  • Optional password for console access
  • Direct attachment of policies and group membership
  • Maximum 5,000 users per AWS account

IAM Roles

  • Temporary credentials for applications, services, or cross-account access
  • No permanent credentials - credentials issued dynamically
  • Assumed by trusted entities (users, services, other accounts)
  • Preferred method for EC2 instances and Lambda functions
  • Cross-account access without sharing permanent credentials

Service-Linked Roles

  • Predefined roles for specific AWS services
  • Automatically created and managed by AWS services
  • Cannot be modified or deleted by users
  • Required for services like ECS, Lambda, and Auto Scaling

Identity Hierarchy Structure

AWS Account (Root)
├── IAM Users
   ├── Individual Developer A
   ├── Individual Developer B
   └── CI/CD System User
├── IAM Groups
   ├── Developers Group
   ├── Administrators Group
   └── Read-Only Group
├── IAM Roles
   ├── EC2-ML-Training-Role
   ├── Lambda-Execution-Role
   └── Cross-Account-Access-Role
└── Service-Linked Roles
    ├── ECS Task Role
    ├── Auto Scaling Role
    └── CloudFormation Role

Identity Relationship Dependencies

User → Group Membership

  • Users inherit permissions from all assigned groups
  • Group policy changes affect all group members immediately
  • Maximum 10 groups per user, 300 groups per account

Role → Trust Relationships

  • Trust policy defines which entities can assume the role
  • Role policies define permissions when role is assumed
  • Temporary credentials expire (15 minutes to 12 hours)

Cross-Account Trust

  • Role in Account A trusts specific users/roles in Account B
  • External ID required for enhanced security in cross-account scenarios
  • Audit trail through CloudTrail for all role assumptions

Critical Design Principle: Least privilege access - grant minimum permissions required for specific tasks, expandable through group membership or role assumption.

Permission Models: Policies, Actions, and Resource Restrictions

Distributed systems require explicit authorization for every network request.

Local vs Distributed Authorization

Local System Authorization (Traditional)

  • Operating system controls file access through uid/gid
  • Process inherits user permissions automatically
  • File system enforces read/write/execute permissions
  • No network authorization required for local resources

Distributed System Authorization (Cloud)

  • Every API call evaluated independently across network
  • No inherited permissions between services
  • Each resource access requires explicit policy evaluation
  • Network requests carry identity and are verified remotely

Policy-Based Authorization Model

IAM implements declarative security through JSON policies:

Policy Document Structure

Basic Policy Components

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::ml-training-bucket/*",
            "Condition": {
                "StringEquals": {
                    "s3:prefix": ["models/", "datasets/"]
                }
            }
        }
    ]
}

Policy Types and Attachment Methods

Identity-Based Policies

  • Attached directly to users, groups, or roles
  • Define permissions for the identity across all resources
  • Inherited through group membership
  • Maximum 10 managed policies per identity

Resource-Based Policies

  • Attached directly to resources (S3 buckets, Lambda functions)
  • Define which identities can access the resource
  • Cross-account access without role assumption
  • Resource owner maintains control over access

Permission Boundaries

  • Maximum permissions an identity can have
  • Does not grant permissions, only limits them
  • Applied to users and roles, not groups
  • Advanced feature for delegation of administrative tasks

Policy Evaluation Logic

Common Permission Patterns

Service-Specific Actions

  • s3:ListBucket - List objects in S3 bucket
  • ec2:RunInstances - Launch EC2 instances
  • iam:CreateRole - Create IAM roles
  • logs:CreateLogGroup - Create CloudWatch log groups

Resource ARN Patterns

  • arn:aws:s3:::bucket-name/* - All objects in bucket
  • arn:aws:ec2:us-east-1:*:instance/* - All instances in region
  • arn:aws:iam::account-id:role/role-name - Specific IAM role

Policy Evaluation Rule: Explicit deny always wins, followed by explicit allow, with implicit deny as default for all unspecified actions.

AWS Access Methods: Console, CLI, and SDK Integration

Multiple programmatic and interactive interfaces provide access to AWS services with different authentication and use case optimization.

AWS Management Console

  • Web-based graphical interface for all AWS services
  • Requires username/password authentication
  • Multi-factor authentication support required for production
  • Session-based access with configurable timeout
  • Visual resource management and monitoring dashboards

Console Authentication Flow

User Login → MFA Verification → Session Token
├── Session Duration: 12 hours maximum
├── Automatic logout on inactivity
├── Role switching within console
└── CloudTrail logging of all actions

AWS Command Line Interface (CLI)

  • Text-based tool for scriptable AWS service interaction
  • Supports all AWS service APIs through consistent command structure
  • Local credential configuration and profile management
  • Batch operations and automation scripting
  • Output formats: JSON, table, text for different use cases

CLI Installation and Configuration

# Install AWS CLI v2
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip && sudo ./aws/install

# Configure default profile
aws configure
# AWS Access Key ID: AKIAIOSFODNN7EXAMPLE
# AWS Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Default region name: us-east-1
# Default output format: json

AWS Software Development Kits (SDKs)

  • Language-specific libraries for AWS service integration
  • Available for Python (boto3), Java, .NET, Node.js, Go, Rust
  • Automatic retry logic and error handling
  • Built-in credential chain resolution
  • Asynchronous operations and pagination support

SDK Authentication Hierarchy

  1. Explicit credentials in code (not recommended)
  2. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  3. Credentials file (~/.aws/credentials)
  4. IAM roles for EC2 instances (Instance Metadata Service)
  5. IAM roles for ECS tasks (Task role assignment)
  6. IAM roles for Lambda functions (Execution role)

Python SDK (boto3) Example

import boto3

# Automatic credential resolution
s3_client = boto3.client('s3')

# List buckets
response = s3_client.list_buckets()
for bucket in response['Buckets']:
    print(f"Bucket: {bucket['Name']}")

# Upload file with automatic multipart
s3_client.upload_file(
    'local_file.txt', 
    'ml-training-bucket', 
    'datasets/file.txt'
)

Access Method Performance Comparison

  • Console: Interactive exploration, visual debugging
  • CLI: Automation scripts, CI/CD integration
  • SDK: Application integration, programmatic access
  • API: Direct HTTP calls, custom tooling

Credential Security Principle: Use temporary credentials (roles) for applications, permanent credentials only for development environments with regular rotation.

Credential Management: Security Keys, Profiles, and Rotation

Secure credential management requires understanding authentication mechanisms, storage locations, and rotation procedures for maintaining system security.

Credential Types and Use Cases

Access Key Pairs (Permanent Credentials)

  • Access Key ID: Public identifier (20 characters, starts with AKIA)
  • Secret Access Key: Private key (40 characters, base64-encoded)
  • Used for programmatic access via CLI and SDKs
  • Maximum 2 active access keys per IAM user
  • Require regular rotation (recommended 90 days)

Temporary Security Credentials

  • Session token in addition to access key pair
  • Limited lifetime (15 minutes to 36 hours)
  • Issued through AWS Security Token Service (STS)
  • Cannot be extended - must be refreshed before expiration
  • Used automatically by EC2 instance roles and Lambda functions

Multi-Factor Authentication (MFA)

  • Virtual MFA devices (Google Authenticator, Authy)
  • Hardware MFA devices (YubiKey, Gemalto)
  • Required for sensitive operations (root account, role assumption)
  • Time-based one-time passwords (TOTP) or challenge-response

Credential Storage Mechanisms

Local Configuration Files

# ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

[development]
aws_access_key_id = AKIAI44QH8DHBEXAMPLE
aws_secret_access_key = je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY

# ~/.aws/config  
[default]
region = us-east-1
output = json

[profile development]
region = us-west-2
output = table

Environment Variable Configuration

export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export AWS_DEFAULT_REGION=us-east-1
export AWS_PROFILE=development

Instance Metadata Service (IMDS)

  • Automatic credential delivery to EC2 instances
  • No permanent credentials stored on instance
  • Credentials refreshed automatically before expiration
  • IMDSv2 requires token-based requests for enhanced security
# Get instance role credentials (IMDSv2)
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" \
    -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

CREDENTIALS=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/iam/security-credentials/role-name)

Credential Security Best Practices

Development Environment

  • Use named profiles for different projects/accounts
  • Never commit credentials to version control systems
  • Use environment variables for containerized applications
  • Implement credential scanning in CI/CD pipelines

Production Environment

  • IAM roles for all EC2 instances and Lambda functions
  • Cross-account roles instead of shared permanent credentials
  • Regular rotation of any permanent credentials (90-day maximum)
  • Monitoring and alerting for credential usage anomalies

Credential Rotation Procedure

  1. Create second access key while first remains active
  2. Update applications to use new credentials
  3. Test functionality with new credentials
  4. Delete old access key after verification
  5. Monitor CloudTrail for any authentication failures

Security Implementation Standard: Production systems must use IAM roles with temporary credentials; permanent access keys only for development environments with mandatory rotation procedures.

Role Assumption and Cross-Account Access Patterns

Distributed systems require transitive trust without credential sharing.

Distributed Trust Problem

Traditional network security uses shared secrets:

  • Database passwords shared across all application servers
  • API keys distributed to every service that needs access
  • Credentials stored in configuration files on multiple machines
  • Single credential compromise affects entire system

Transitive Trust Challenge

ML systems require service-to-service access:

  • Training service needs to read S3 data and write models
  • API service needs to load models and log predictions
  • Monitoring service needs to access logs from all other services
  • Each service runs on separate machines with separate credentials

Role Assumption as Trust Delegation

IAM roles implement temporary trust without credential sharing:

  • Identity verification: Service proves its identity to AWS
  • Trust policy evaluation: AWS checks if service can assume target role
  • Temporary credential issuance: AWS provides time-limited access tokens
  • Resource access: Service uses temporary credentials for specific actions
  • Automatic expiration: Credentials become invalid after specified time

Role Assumption Mechanics

Trust Policy Configuration

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789012:user/DeveloperA",
                    "arn:aws:iam::123456789012:role/EC2-Instance-Role"
                ]
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "unique-external-identifier"
                }
            }
        }
    ]
}

Role Assumption Process

  1. Authentication: Identity authenticates with AWS using permanent or temporary credentials
  2. Authorization: AWS verifies identity has sts:AssumeRole permission for target role
  3. Trust Evaluation: Target role’s trust policy evaluated against requesting identity
  4. Token Issuance: AWS STS issues temporary credentials (AccessKeyId, SecretAccessKey, SessionToken)
  5. Resource Access: Temporary credentials used for API calls within role’s permission scope

Temporary Credential Characteristics

  • Default session duration: 1 hour for role assumption
  • Maximum session duration: 12 hours (configurable per role)
  • Credentials include session token for authentication
  • Cannot be extended - must assume role again for continued access

Cross-Account Access Patterns

Development Account → Production Account

# Assume role in production account
aws sts assume-role \
    --role-arn arn:aws:iam::987654321098:role/ProductionDeploymentRole \
    --role-session-name deployment-session-2024 \
    --external-id unique-external-identifier

# Response contains temporary credentials
{
    "Credentials": {
        "AccessKeyId": "ASIAIOSFODNN7EXAMPLE",
        "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "SessionToken": "very-long-session-token-string",
        "Expiration": "2024-03-15T14:30:00Z"
    }
}

Service-to-Service Role Assumption

  • EC2 instances assume roles for S3 access
  • Lambda functions assume roles for DynamoDB operations
  • ECS tasks assume roles for Secrets Manager access
  • CodeBuild projects assume roles for deployment operations

Cross-Account Trust Relationships

Account A (Production) Trusts Account B (Development)

Account B (111111111111) - Development
├── Developer Users
├── CI/CD Systems
└── Can assume roles in Production Account

Account A (222222222222) - Production  
├── ProductionDeploymentRole (trusts Account B)
├── DataAccessRole (trusts specific users)
└── MonitoringRole (trusts service accounts)

Role Chaining Limitations

  • Cannot assume role from within an assumed role session
  • Maximum one level of role assumption for security
  • Use role switching in console for multi-level access
  • Cross-account access requires explicit trust in both directions

Access Control Architecture: Cross-account role assumption provides secure resource sharing without permanent credential distribution, enabling centralized identity management across multiple AWS environments.

AWS SDK and CLI Configuration Management

Programmatic AWS access requires proper configuration of authentication credentials, regional settings, and service-specific parameters through standardized configuration methods.

Configuration Hierarchy and Precedence

Credential Resolution Order

  1. Command-line parameters (aws s3 ls --profile production)
  2. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  3. CLI credentials file (~/.aws/credentials)
  4. CLI configuration file (~/.aws/config)
  5. Container credentials (ECS task role)
  6. Instance metadata service (EC2 instance role)

Profile-Based Configuration Management

# ~/.aws/config
[default]
region = us-east-1
output = json

[profile development]
region = us-west-2
output = table
role_arn = arn:aws:iam::123456789012:role/DevelopmentRole
source_profile = default

[profile production]
region = us-east-1
output = json
role_arn = arn:aws:iam::987654321098:role/ProductionRole
source_profile = default
external_id = prod-external-id-2024

Advanced Configuration Options

Regional Configuration

  • Default region for service calls
  • Service-specific regional overrides
  • Regional failover configuration for high availability
  • Cross-region replication settings

Output Format Specification

  • json: Machine-readable structured output
  • table: Human-readable tabular format
  • text: Tab-delimited values for shell scripting
  • yaml: YAML-formatted output for configuration files

SDK Configuration Examples

Python (boto3) Configuration

import boto3
from botocore.config import Config

# Session with specific profile
session = boto3.Session(profile_name='development')
s3_client = session.client('s3')

# Client with custom configuration
config = Config(
    region_name='us-west-2',
    retries={'max_attempts': 10, 'mode': 'adaptive'},
    max_pool_connections=50
)
ec2_client = boto3.client('ec2', config=config)

# Role assumption for cross-account access
sts_client = boto3.client('sts')
assumed_role = sts_client.assume_role(
    RoleArn='arn:aws:iam::123456789012:role/DataAccessRole',
    RoleSessionName='ml-training-session'
)

# Use temporary credentials
temp_credentials = assumed_role['Credentials']
s3_resource = boto3.resource(
    's3',
    aws_access_key_id=temp_credentials['AccessKeyId'],
    aws_secret_access_key=temp_credentials['SecretAccessKey'],
    aws_session_token=temp_credentials['SessionToken']
)

CLI Profile Operations

# List configured profiles
aws configure list-profiles

# Use specific profile
aws s3 ls --profile development

# Set default profile
export AWS_PROFILE=development

# Configure new profile interactively
aws configure --profile new-environment

Environment-Specific Configuration

# Development environment
export AWS_PROFILE=development
export AWS_DEFAULT_REGION=us-west-2

# Production environment  
export AWS_PROFILE=production
export AWS_DEFAULT_REGION=us-east-1
export AWS_DEFAULT_OUTPUT=json

Configuration Management Strategy: Use named profiles for environment separation, environment variables for containerized applications, and IAM roles for production services to maintain security boundaries and operational consistency.

Security Best Practices: Permission Boundaries and Access Monitoring

Comprehensive security requires implementing permission boundaries, continuous access monitoring, and automated compliance verification to maintain least-privilege principles.

Permission Boundary Implementation

Maximum Permission Limits

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "ec2:DescribeInstances",
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Deny",
            "Action": [
                "iam:*",
                "ec2:TerminateInstances",
                "s3:DeleteBucket"
            ],
            "Resource": "*"
        }
    ]
}

Boundary Application Pattern

  • Attached to IAM users and roles (not groups)
  • Defines maximum permissions, never grants permissions
  • Combined with identity-based policies using logical AND
  • Enables safe delegation of administrative tasks
  • Prevents privilege escalation through policy modification

Access Monitoring and Alerting

CloudTrail Event Monitoring

  • All API calls logged with identity, timestamp, source IP
  • Failed authentication attempts and permission denials
  • Root account usage (should trigger immediate alerts)
  • Cross-account role assumptions and policy changes
  • Unusual geographic access patterns or service usage

Critical Security Events

# Root account login
"eventName": "ConsoleLogin",
"userIdentity.type": "Root"

# Failed authentication attempts
"errorCode": "SigninFailure"
"errorMessage": "Invalid username or password"

# Policy modification
"eventName": "PutUserPolicy",
"eventName": "AttachRolePolicy"

# Cross-account access
"eventName": "AssumeRole",
"recipientAccountId": "different-account-id"

Automated Compliance Verification

AWS Config Rules for IAM Compliance

  • Root access key usage detection
  • Multi-factor authentication requirement validation
  • Unused IAM users and roles identification
  • Password policy compliance verification
  • Permission boundary attachment verification

Access Review Procedures

Quarterly Access Audit

  1. Identity Inventory: List all users, roles, and service accounts
  2. Permission Analysis: Review attached policies and group memberships
  3. Access Pattern Review: Analyze CloudTrail logs for actual usage
  4. Inactive Account Detection: Identify accounts without recent activity
  5. Privilege Escalation Check: Verify no unauthorized permission increases

Automated Security Monitoring

import boto3
import json
from datetime import datetime, timedelta

def audit_iam_users():
    iam = boto3.client('iam')
    
    # Get all IAM users
    users = iam.list_users()['Users']
    
    for user in users:
        username = user['UserName']
        
        # Check last activity
        try:
            last_used = iam.get_user(UserName=username)['User'].get('PasswordLastUsed')
            if last_used:
                days_inactive = (datetime.now(last_used.tzinfo) - last_used).days
                if days_inactive > 90:
                    print(f"Warning: User {username} inactive for {days_inactive} days")
        except:
            print(f"Unable to check activity for {username}")
        
        # Check MFA status
        mfa_devices = iam.list_mfa_devices(UserName=username)['MFADevices']
        if not mfa_devices:
            print(f"Warning: User {username} has no MFA device")

Security Incident Response

  • Automatic credential rotation for compromised access keys
  • Role assumption monitoring for unusual patterns
  • Geographic access anomaly detection and blocking
  • Integration with SIEM systems for enterprise security

Security Architecture Principle: Implement defense-in-depth through permission boundaries, continuous monitoring, and automated compliance verification to maintain security posture at scale.

AWS ML Pipeline Implementation

EC2-S3 ML Pipeline Architecture

Local ML development breaks under production data volumes and serving requirements.

Development Environment Limitations

MacBook Pro M3 (32GB RAM)

  • Training dataset limit: 20GB fits in memory
  • Model size limit: 8B parameters maximum (32GB VRAM requirement)
  • Training time: 4 hours for ResNet-50 on ImageNet subset
  • Serving capacity: Single process, ~10 requests/second
  • Storage: 1TB SSD, no redundancy or backup

Production Requirements

Training Workload

  • Dataset: ImageNet full (1.3TB, 14M images)
  • Model: EfficientNet-B7 (800M parameters, 12GB memory)
  • Training time constraint: <8 hours for experiment iteration
  • Concurrent experiments: 3-5 model variants simultaneously

Serving Workload

  • Traffic: 1000+ requests/second peak
  • Latency requirement: <100ms p99
  • Availability: 99.9% uptime (43 minutes downtime/month)
  • Global deployment: US, Europe, Asia regions

Failure Points

  • Memory: 1.3TB dataset exceeds 32GB RAM → Training impossible
  • Storage: 1TB drive fills with 1.3TB dataset → Process fails
  • Serving: Single process cannot handle 1000 req/s → Request timeouts
  • Availability: Single machine failure = 100% downtime → SLA violation

Distributed Architecture Solution

EC2 Compute Scaling

  • Instance type: r5.2xlarge (8 vCPUs, 64GB RAM)
  • GPU acceleration: p3.2xlarge (V100, 16GB VRAM)
  • Cost: $3.06/hour for training, shut down when idle
  • Concurrent training: Launch multiple instances simultaneously

S3 Storage Scaling

  • Capacity: Unlimited storage (1.3TB+ supported)
  • Durability: 99.999999999% (11 9’s) - no data loss risk
  • Access: Concurrent reads from multiple training instances
  • Cost: $0.023/GB/month ($30/month for 1.3TB)

Network Integration

EC2 r5.2xlarge (us-east-1a)
├── Training Process: PyTorch + 64GB RAM
├── Data Pipeline: boto3 → S3 streaming
├── Model Output: S3 model artifacts
└── API Server: Flask + gunicorn (100 req/s)

S3 Bucket (us-east-1)
├── /data/imagenet/ (1.3TB training data)
├── /models/experiments/ (trained model weights)
└── /logs/training/ (experiment tracking)

Cost Structure

  • Training: $3.06/hour × 8 hours = $24.48 per experiment
  • Storage: $30/month for dataset (vs $0 local storage)
  • Serving: $61/month always-on (vs free local serving)
  • Total: ~$115/month vs $15K workstation purchase

Operational Complexity

  • Network latency: 20ms S3 access vs <1ms local SSD
  • Security: IAM policies vs local file permissions
  • Failure modes: Service dependencies vs single machine reliability

This architecture trades local simplicity for production scalability at the cost of operational complexity and network dependencies.

AWS BILLING WARNING

AWS requires a credit card for account signup. Charges begin upon resource creation.

CRITICAL BILLING SAFETY - IMPLEMENT IMMEDIATELY:

1. Set Billing Alerts

# Set $10 billing alert via AWS CLI
aws budgets create-budget --account-id 123456789012 \
    --budget '{
        "BudgetName": "Monthly-Spend-Alert",
        "BudgetLimit": {"Amount": "10", "Unit": "USD"},
        "TimeUnit": "MONTHLY",
        "BudgetType": "COST"
    }'

2. Always Terminate Resources

  • Stop instances: Saves compute costs, keeps storage costs
  • Terminate instances: Deletes everything, stops all charges
  • Delete S3 buckets: Ongoing storage charges until deleted
  • Never leave resources running overnight

3. Use Free Tier Eligible Resources Only

  • t2.micro/t3.micro instances (750 hours/month free)
  • 30GB EBS storage free per month
  • 5GB S3 storage free per month
  • RDS t2.micro database (750 hours/month free)

EXPENSIVE MISTAKES TO AVOID:

GPU Instances: p3.2xlarge costs $3.06/hour ($2,200/month if left running)

Data Transfer: Cross-region transfer costs $0.09/GB (expensive for large datasets)

Load Balancers: Application Load Balancer costs $16.20/month + $0.008 per hour per rule

Auto Scaling: Can launch dozens of instances automatically during traffic spikes

Real Student Bill Examples:

  • Forgot running p3.8xlarge: $2,400 weekend charge
  • Left 20 instances in Auto Scaling Group: $1,200 monthly bill
  • Accidentally replicated 500GB across regions: $45 transfer charge

PROTECTION CHECKLIST:

  • Billing alerts configured for $10, $25, $50 thresholds
  • AWS CLI/Console set to us-east-1 (cheapest region)
  • Only use instance types explicitly mentioned in assignments
  • Terminate ALL resources after each lab session
  • Monitor billing dashboard weekly during course

When In Doubt: STOP and TERMINATE EVERYTHING

EC2 Instance Configuration

Create a Linux development environment optimized for ML workloads.

Instance Launch Configuration

AMI Selection

  • Navigate to EC2 console → Launch Instance
  • Search “Ubuntu Server 22.04 LTS”
  • Select official Canonical AMI (ami-0c02fb55956c7d316)
  • Base Ubuntu installation - will install ML frameworks manually

Instance Type Selection

  • Choose t3.medium (2 vCPUs, 4GB RAM) for cost efficiency
  • Avoid GPU instances for initial setup (p3 costs $3+/hour)
  • Sufficient for small model training and development

Storage Configuration

  • Root volume: 30 GB gp3 SSD (general purpose)
  • No additional EBS volumes needed for demo
  • Enable “Delete on Termination” to avoid storage charges

Network and Security

  • Use default VPC and subnet
  • Create new security group: “ml-development”
  • Allow SSH (port 22) from your IP address only
  • Allow HTTP (port 80) for API endpoint access

Key Pair Authentication

  • Create new key pair: “ml-training-key”
  • Download .pem file (store securely)
  • Required for SSH access to instance

Launch Process Checklist

# Verify instance is running
aws ec2 describe-instances \
    --instance-ids i-1234567890abcdef0

# Connect via SSH
ssh -i ml-training-key.pem \
    ubuntu@ec2-xx-xx-xx-xx.compute-1.amazonaws.com

# Check system info
uname -a
python3 --version

Expected Costs

  • t3.medium: $0.0416/hour ($30/month if left running)
  • Storage: 30GB × $0.08/GB/month = $2.40/month
  • Data transfer: First 1GB free, then $0.09/GB

Common Launch Issues

  • Key pair permissions: chmod 400 ml-training-key.pem
  • Security group SSH access restricted to your IP
  • Instance state checks may take 2-3 minutes
  • Base Ubuntu AMI is ~8GB, standard boot time

Verification: Instance reaches “running” state, passes status checks, accepts SSH connections.

Development Environment Setup

Configure the instance for ML development with manual Docker installation.

System Updates and Dependencies

# Connect to instance
ssh -i ml-training-key.pem ubuntu@<instance-ip>

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install essential development tools
sudo apt install -y \
    git \
    htop \
    tree \
    curl \
    wget \
    unzip

# Verify Python environment
python3 --version
which python3

Docker Installation (Manual)

# Remove any old Docker versions
sudo apt-get remove docker docker-engine docker.io containerd runc

# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

# Add Docker repository
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker Engine
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

# Add user to docker group
sudo usermod -aG docker ubuntu
newgrp docker

# Verify Docker installation
docker --version
docker run hello-world

Python Environment Configuration

# Install Python package manager
sudo apt install -y python3-pip python3-venv

# Create virtual environment for ML
python3 -m venv ml-env
source ml-env/bin/activate

# Install ML frameworks and cloud integration packages
pip install \
    torch \
    boto3 \
    pandas \
    scikit-learn \
    matplotlib \
    flask \
    joblib \
    psutil

# Verify PyTorch installation
python -c "import torch; print(torch.__version__)"
python -c "import torch; print(torch.cuda.is_available())"

AWS CLI Configuration

# Install/update AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Configure credentials (use IAM user with S3 permissions)
aws configure
# AWS Access Key ID: [your-access-key]
# AWS Secret Access Key: [your-secret-key]  
# Default region: us-east-1
# Default output format: json

# Test AWS connectivity
aws s3 ls

Environment Verification

  • Docker runs without sudo
  • Virtual environment activated with PyTorch
  • AWS CLI can list S3 buckets
  • All required Python packages installed

Troubleshooting Common Issues: Docker permission errors (restart session), conda environment activation, AWS credential configuration.

S3 Data Storage Implementation

Create cloud storage for training data and model artifacts.

Create S3 Bucket via AWS Console

  1. Navigate to S3 service in AWS console
  2. Click “Create bucket”
  3. Bucket name: ml-training-{random-suffix} (must be globally unique)
  4. Region: us-east-1 (same as EC2 instance)
  5. Block public access: Keep default (enabled)
  6. Versioning: Disabled for demo
  7. Default encryption: Server-side encryption with S3 managed keys

Bucket Structure

ml-training-demo-12345/
├── data/
│   ├── raw/
│   │   └── iris.csv
│   └── processed/
├── models/
│   └── experiments/
└── logs/
    └── training/

Upload Sample Dataset

# Create sample dataset locally
python3 << EOF
from sklearn.datasets import load_iris
import pandas as pd

# Load iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
df.to_csv('iris.csv', index=False)
print(f"Created dataset with {len(df)} rows")
EOF

# Upload to S3
aws s3 cp iris.csv s3://ml-training-demo-12345/data/raw/iris.csv

# Verify upload
aws s3 ls s3://ml-training-demo-12345/data/raw/

Test S3 Access from Python

import boto3
import pandas as pd
from io import StringIO

# Initialize S3 client
s3_client = boto3.client('s3')
bucket_name = 'ml-training-demo-12345'

# List bucket contents
response = s3_client.list_objects_v2(Bucket=bucket_name)
for obj in response.get('Contents', []):
    print(f"Object: {obj['Key']}, Size: {obj['Size']} bytes")

# Download data for training
obj = s3_client.get_object(Bucket=bucket_name, Key='data/raw/iris.csv')
data = pd.read_csv(obj['Body'])
print(f"Loaded {len(data)} rows, {len(data.columns)} columns")
print(data.head())

S3 Access Patterns

  • Download: Copy S3 object to local filesystem
  • Stream: Read S3 object directly into memory
  • Upload: Copy local file or memory buffer to S3
  • List: Enumerate objects in bucket prefix

Cost Monitoring

# Check current month S3 costs
aws ce get-cost-and-usage \
    --time-period Start=2025-01-01,End=2025-02-01 \
    --granularity MONTHLY \
    --metrics BlendedCost \
    --group-by Type=DIMENSION,Key=SERVICE

Verification: S3 bucket created, data uploaded successfully, Python can read/write objects, permissions configured correctly.

PyTorch Model Definition

Define neural network architecture for cloud training.

import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import boto3
from io import StringIO, BytesIO
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import joblib
import json
from datetime import datetime

class IrisClassifier(nn.Module):
    def __init__(self, input_size=4, hidden_size=64, num_classes=3):
        super(IrisClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, num_classes)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

Architecture Details

  • Input layer: 4 features (sepal/petal length and width)
  • Hidden layers: 64 neurons each with ReLU activation
  • Dropout: 0.2 probability for regularization
  • Output layer: 3 classes (setosa, versicolor, virginica)
  • Parameters: 4→64→64→3 = 4,675 trainable parameters

Model Memory Requirements

  • Model weights: ~18KB (4,675 × 4 bytes per float32)
  • Forward pass: ~512 bytes per sample
  • Gradient storage: ~18KB additional during training
  • Total training memory: ~50KB per model instance

S3 Integration Functions

Handle data loading and model persistence in cloud storage.

def load_data_from_s3(bucket_name, key):
    """Load training data from S3 with error handling"""
    try:
        s3_client = boto3.client('s3')
        print(f"Loading data from s3://{bucket_name}/{key}")
        obj = s3_client.get_object(Bucket=bucket_name, Key=key)
        data = pd.read_csv(obj['Body'])
        print(f"Successfully loaded {len(data)} rows, {len(data.columns)} columns")
        return data
    except Exception as e:
        print(f"Error loading data from S3: {str(e)}")
        print(f"Bucket: {bucket_name}, Key: {key}")
        raise

def save_model_to_s3(model, scaler, bucket_name, model_key, scaler_key):
    """Save trained model and scaler to S3"""
    s3_client = boto3.client('s3')
    
    # Save PyTorch model
    model_buffer = BytesIO()
    torch.save(model.state_dict(), model_buffer)
    model_buffer.seek(0)
    s3_client.put_object(
        Bucket=bucket_name,
        Key=model_key,
        Body=model_buffer.getvalue()
    )
    
    # Save scaler
    scaler_buffer = BytesIO()
    joblib.dump(scaler, scaler_buffer)
    scaler_buffer.seek(0)
    s3_client.put_object(
        Bucket=bucket_name,
        Key=scaler_key,
        Body=scaler_buffer.getvalue()
    )

S3 Operation Characteristics

  • Data loading: Streams CSV directly from S3 without local disk
  • Model saving: Serializes to memory buffer before S3 upload
  • Error handling: Explicit exception handling for network failures
  • Performance: ~20ms latency per S3 operation from EC2

Training Execution Pipeline

Complete training workflow with cloud data and model persistence.

def train_model():
    # Configuration
    bucket_name = 'ml-training-demo-12345'
    data_key = 'data/raw/iris.csv'
    
    # Load data from S3
    print("Loading data from S3...")
    data = load_data_from_s3(bucket_name, data_key)
    
    # Prepare features and labels
    X = data.drop('target', axis=1).values
    y = data['target'].values
    
    # Split and scale data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Convert to PyTorch tensors
    X_train_tensor = torch.FloatTensor(X_train_scaled)
    y_train_tensor = torch.LongTensor(y_train)
    X_test_tensor = torch.FloatTensor(X_test_scaled)
    y_test_tensor = torch.LongTensor(y_test)
    
    # Initialize model and training
    model = IrisClassifier()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    
    # Training loop with resource monitoring
    print("Starting training...")
    import psutil
    start_time = datetime.now()
    
    model.train()
    for epoch in range(100):
        optimizer.zero_grad()
        outputs = model(X_train_tensor)
        loss = criterion(outputs, y_train_tensor)
        loss.backward()
        optimizer.step()
        
        if (epoch + 1) % 20 == 0:
            memory_mb = psutil.Process().memory_info().rss / 1024 / 1024
            print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}, Memory: {memory_mb:.1f}MB')
    
    # Evaluate model
    model.eval()
    with torch.no_grad():
        test_outputs = model(X_test_tensor)
        _, predicted = torch.max(test_outputs.data, 1)
        accuracy = (predicted == y_test_tensor).sum().item() / len(y_test_tensor)
        print(f'Test Accuracy: {accuracy:.4f}')
    
    # Save to S3
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    model_key = f'models/iris_classifier_{timestamp}.pth'
    scaler_key = f'models/scaler_{timestamp}.pkl'
    
    save_model_to_s3(model, scaler, bucket_name, model_key, scaler_key)
    
    # Training performance summary
    end_time = datetime.now()
    training_duration = (end_time - start_time).total_seconds()
    print(f"Training completed in {training_duration:.1f} seconds")
    print(f"Final accuracy: {accuracy:.4f}")
    print(f"Model saved to S3: {model_key}")
    
    return model, scaler, accuracy, training_duration

# Run training
if __name__ == "__main__":
    model, scaler, accuracy, duration = train_model()

Training Performance Characteristics

  • Data loading: ~20ms from S3 (150 samples, 5 columns)
  • Preprocessing: <1ms (StandardScaler transformation)
  • Training: ~2 seconds (100 epochs, 4,675 parameters)
  • Model saving: ~15ms (18KB model + 2KB scaler to S3)
  • Total pipeline: ~2.1 seconds end-to-end

Expected Output: Training progress logs, final accuracy metrics, confirmation of model artifacts saved to S3.

Flask API Server Implementation

HTTP API server loading models from S3 for inference.

from flask import Flask, request, jsonify
import torch
import boto3
import joblib
from io import BytesIO
import numpy as np

app = Flask(__name__)

# Global variables for model and scaler
model = None
scaler = None

def load_model_from_s3(bucket_name, model_key, scaler_key):
    """Load model and scaler from S3"""
    s3_client = boto3.client('s3')
    
    # Load PyTorch model
    model_obj = s3_client.get_object(Bucket=bucket_name, Key=model_key)
    model_buffer = BytesIO(model_obj['Body'].read())
    
    model = IrisClassifier()
    model.load_state_dict(torch.load(model_buffer, map_location='cpu'))
    model.eval()
    
    # Load scaler
    scaler_obj = s3_client.get_object(Bucket=bucket_name, Key=scaler_key)
    scaler_buffer = BytesIO(scaler_obj['Body'].read())
    scaler = joblib.load(scaler_buffer)
    
    return model, scaler

@app.route('/health', methods=['GET'])
def health_check():
    return jsonify({
        'status': 'healthy',
        'model_loaded': model is not None
    })

@app.route('/predict', methods=['POST'])
def predict():
    try:
        # Parse input data
        data = request.json
        features = np.array(data['features']).reshape(1, -1)
        
        # Scale features
        features_scaled = scaler.transform(features)
        
        # Make prediction
        with torch.no_grad():
            features_tensor = torch.FloatTensor(features_scaled)
            outputs = model(features_tensor)
            probabilities = torch.softmax(outputs, dim=1)
            predicted_class = torch.argmax(outputs, dim=1).item()
            confidence = probabilities[0][predicted_class].item()
        
        # Class names for Iris dataset
        class_names = ['setosa', 'versicolor', 'virginica']
        
        return jsonify({
            'predicted_class': class_names[predicted_class],
            'confidence': float(confidence),
            'probabilities': probabilities[0].tolist()
        })
        
    except Exception as e:
        return jsonify({'error': str(e)}), 400

# Initialize model on startup
bucket_name = 'ml-training-demo-12345'
model_key = 'models/iris_classifier_20250916_143022.pth'
scaler_key = 'models/scaler_20250916_143022.pkl'

print("Loading model from S3...")
model, scaler = load_model_from_s3(bucket_name, model_key, scaler_key)
print("Model loaded successfully!")

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80, debug=True)

API Performance Characteristics

  • Model loading: ~35ms (18KB model + 2KB scaler from S3)
  • Inference latency: ~2ms per request (forward pass only)
  • Memory usage: ~25MB (Flask + PyTorch + loaded model)
  • Throughput: ~100 requests/second (single thread)

API Deployment and Testing

Deploy and validate ML inference API on EC2 instance.

Local API Testing

# Save API code as app.py
# Run Flask application
sudo python3 app.py

# Expected startup output:
Loading model from S3...
Model loaded successfully!
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:80
 * Running on http://10.0.1.100:80

# Test from another terminal
# Health check
curl http://localhost/health

# Expected response:
{
  "status": "healthy",
  "model_loaded": true
}

# Make prediction
curl -X POST http://localhost/predict \
     -H "Content-Type: application/json" \
     -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

# Expected response:
{
  "predicted_class": "setosa",
  "confidence": 0.9876,
  "probabilities": [0.9876, 0.0084, 0.0040]
}

Public Internet Access

# Update security group to allow HTTP traffic
aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxxxxx \
    --protocol tcp \
    --port 80 \
    --cidr 0.0.0.0/0

# Test from external machine
curl http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com/health

# Load test with multiple requests
for i in {1..10}; do
  curl -X POST \
    http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com/predict \
    -H "Content-Type: application/json" \
    -d '{"features": [5.1, 3.5, 1.4, 0.2]}' &
done
wait

Error Handling Validation

# Test malformed request
curl -X POST http://localhost/predict \
     -H "Content-Type: application/json" \
     -d '{"invalid": "data"}'

# Expected error response:
{
  "error": "KeyError: 'features'"
}

# Test wrong feature count
curl -X POST http://localhost/predict \
     -H "Content-Type: application/json" \
     -d '{"features": [5.1, 3.5]}'

# Expected error response:
{
  "error": "Input array has wrong dimensions"
}

Performance Verification: API handles 100+ requests/second, <5ms response time, graceful error handling for malformed inputs.

System Monitoring Implementation

Monitor system performance and optimize costs for production use.

CloudWatch Integration

import boto3
from datetime import datetime

# Initialize CloudWatch client
cloudwatch = boto3.client('cloudwatch')

def publish_custom_metrics(accuracy, training_time):
    """Publish ML training metrics to CloudWatch"""
    
    # Model accuracy metric
    cloudwatch.put_metric_data(
        Namespace='ML/Training',
        MetricData=[
            {
                'MetricName': 'ModelAccuracy',
                'Value': accuracy,
                'Unit': 'Percent',
                'Dimensions': [
                    {
                        'Name': 'ModelType',
                        'Value': 'IrisClassifier'
                    }
                ]
            },
            {
                'MetricName': 'TrainingDuration',
                'Value': training_time,
                'Unit': 'Seconds',
                'Dimensions': [
                    {
                        'Name': 'InstanceType',
                        'Value': 't3.medium'
                    }
                ]
            }
        ]
    )

# Add to training script
start_time = datetime.now()
# ... training code ...
end_time = datetime.now()
training_duration = (end_time - start_time).total_seconds()

publish_custom_metrics(accuracy * 100, training_duration)

System Monitoring Commands

# Monitor instance performance
htop
iostat -x 1
df -h

# Check Docker resource usage
docker stats

# Monitor network connectivity
ping google.com
curl -w "@curl-format.txt" -o /dev/null -s http://httpbin.org/delay/2

Cost Analysis and Optimization

# Check current AWS costs
aws ce get-cost-and-usage \
    --time-period Start=2025-01-01,End=2025-01-31 \
    --granularity DAILY \
    --metrics BlendedCost \
    --group-by Type=DIMENSION,Key=SERVICE

# EC2 instance costs
aws ec2 describe-instances \
    --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,State.Name]' \
    --output table

# S3 storage costs
aws s3api list-objects-v2 \
    --bucket ml-training-demo-12345 \
    --query 'sum(Contents[].Size)' \
    --output text

How to Reduce Costs

Instance Management

  • Stop instances when not in use (saves compute costs)
  • Use Spot Instances for training workloads (70% discount)
  • Right-size instances based on actual usage
  • Schedule automatic start/stop with Lambda functions

Storage Optimization

  • Delete intermediate training files after model training
  • Use S3 Lifecycle policies to archive old models
  • Compress large datasets before uploading
  • Monitor data transfer costs between services

Development Practices

  • Use smaller datasets for development and testing
  • Implement checkpointing to resume interrupted training
  • Clean up failed experiments and temporary files
  • Set up billing alerts for cost overruns

Expected Monthly Costs: t3.medium ($30), S3 storage ($5), data transfer ($10) = ~$45 for continuous operation.

Common Implementation Problems

Identify and resolve typical cloud development problems.

Connection and Access Issues

SSH Connection Failures

# Permission denied (publickey)
chmod 400 ml-training-key.pem
ssh -i ml-training-key.pem ubuntu@instance-ip

# Connection timeout
# Check security group allows SSH from your IP
aws ec2 describe-security-groups \
    --group-ids sg-xxxxxxxxx

# Add your current IP to security group
curl ifconfig.me  # Get your public IP
aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxxxxx \
    --protocol tcp \
    --port 22 \
    --cidr your-ip/32

S3 Access Errors

# NoCredentialsError
aws configure list
aws sts get-caller-identity

# AccessDenied
aws iam get-user
aws s3 ls s3://bucket-name --debug

# Bucket region mismatch
aws s3api get-bucket-location --bucket bucket-name

Docker Issues

# Permission denied
sudo usermod -aG docker ubuntu
newgrp docker

# Docker daemon not running
sudo systemctl start docker
sudo systemctl enable docker

# Out of disk space
df -h
docker system prune -f

Performance and Resource Issues

Memory and CPU Constraints

# Monitor resource usage
free -h
cat /proc/cpuinfo | grep processor | wc -l
htop

# PyTorch out of memory
# Reduce batch size in training code
batch_size = 16  # Instead of 64

Network and Latency Issues

# Slow S3 transfers
# Use multipart upload for large files
aws configure set default.s3.multipart_threshold 64MB
aws configure set default.s3.multipart_chunksize 16MB

# Test network speed
wget -O /dev/null http://speedtest-sfo1.digitalocean.com/10mb.test

# DNS resolution issues
nslookup s3.amazonaws.com

Application Debugging

import logging
logging.basicConfig(level=logging.DEBUG)

# Add extensive error handling
try:
    data = load_data_from_s3(bucket_name, data_key)
except Exception as e:
    print(f"S3 Error: {str(e)}")
    print(f"Bucket: {bucket_name}, Key: {data_key}")
    raise

# Log training progress
print(f"Epoch {epoch}, Loss: {loss.item():.4f}, Memory: {torch.cuda.memory_allocated()}")

Cost Overrun Prevention

  • Set up billing alerts in AWS console
  • Use AWS Cost Explorer for usage analysis
  • Implement automatic instance shutdown after training
  • Monitor S3 storage growth and implement cleanup policies

Debugging Strategy: Check permissions first, verify network connectivity, monitor resource usage, implement comprehensive logging.

EC2+S3 System Reality

t3.medium Training: 47 Seconds vs 2 Seconds Local

CPU-only EC2 instance delivers 22× slower training than local GPU.

Demo System Configuration

  • EC2 t3.medium: 2 vCPU, 4GB RAM
  • Ubuntu 22.04 with PyTorch installation
  • S3 bucket for dataset and model storage
  • IAM role with S3 read/write permissions

CIFAR-10 ResNet-18 Performance

  • Local RTX 4090: 2.1 seconds/epoch
  • t3.medium CPU: 47 seconds/epoch
  • Slowdown factor: 22×

Training Duration Impact

  • 100-epoch training: 3.5 minutes vs 78 minutes
  • Hyperparameter grid search: Hours vs days
  • Interactive development impossible on EC2 CPU

Why t3.medium Fails for ML

  • No GPU acceleration
  • 4GB RAM limits batch size to 16-32 samples
  • Optimal batch size (256) requires 14GB RAM
  • CPU utilization: 100% but inefficient for tensor operations

GPU Instance Costs

Instance vCPU GPU RAM Cost/Hour
t3.medium 2 None 4GB $0.042
p3.2xlarge 8 1×V100 61GB $3.06
p3.8xlarge 32 4×V100 244GB $12.24

Cost-Performance Analysis

  • t3.medium: 22× slower, 73× cheaper
  • p3.2xlarge: 1.2× faster than local, 73× more expensive

Break-even Usage

  • p3.2xlarge profitable above 20 hours/month
  • Below 20 hours: Local development cheaper
  • Above 100 hours: Consider reserved instances

Memory Requirements

  • ResNet-50: 8GB minimum
  • BERT-base: 12GB minimum
  • GPT-2 small: 16GB minimum
  • t3.medium cannot load production models

p3.2xlarge costs $61/day continuous operation vs $0 local GPU after purchase.

S3 Data Loading: 12 Seconds vs 1.4 Seconds Local

Network storage introduces 8× slowdown for dataset loading.

CIFAR-10 Loading Performance (170MB dataset)

  • Local SSD: 1.4 seconds
  • S3 single-thread: 12 seconds
  • S3 multi-thread (8 workers): 4.3 seconds
  • EBS attached volume: 2.8 seconds

Network Latency Impact

  • Local file open: 0.02ms
  • S3 GetObject request: 20-50ms per file
  • Cross-AZ S3 access: +5-10ms latency
  • 1000 small files: 20-50 seconds vs 0.1 seconds local

Training Pipeline Bottlenecks

# Local development - continuous GPU utilization
for batch in DataLoader(dataset, batch_size=256):
    loss = model(batch)  # GPU busy 98% of time

# S3 streaming - GPU starvation  
for epoch in range(100):
    download_dataset_from_s3()  # 12 second delay
    for batch in cached_dataset:
        loss = model(batch)  # GPU idle during downloads

Checkpoint Saving Delays

  • Local model save: 50ms (instant)
  • S3 model upload: 800ms-2.1 seconds
  • Training interruption risk during uploads

Caching Strategies

EBS Volume Cache

  • Attach 100GB gp3 volume: $8/month
  • One-time dataset download: 12 seconds
  • Subsequent epochs: 2.8 seconds (local EBS speed)
  • Cache 10-20 datasets before cost equals S3

Instance Store (i3.large)

  • 475GB NVMe SSD included
  • Read speed: 1.9GB/s (faster than local)
  • Cost premium: $0.156/hour vs $0.042 t3.medium
  • Data lost on instance stop/termination

Multi-part Downloads

import concurrent.futures
import boto3

def parallel_download(bucket, prefix, workers=8):
    s3 = boto3.client('s3')
    objects = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
    
    def download_one(key):
        s3.download_file(bucket, key['Key'], f"./data/{key['Key']}")
    
    with concurrent.futures.ThreadPoolExecutor(workers) as executor:
        executor.map(download_one, objects['Contents'])

Cost of Data Movement

  • First 100GB/month: Free
  • Additional transfer: $0.09/GB
  • 1TB monthly: $90 data transfer cost
  • Regional co-location essential

EBS caching reduces loading time to 2.8 seconds but requires manual cache management.

IAM Policy Errors Block S3 Access

Incorrect resource ARNs cause access denied errors.

Common IAM Mistakes

Wrong Resource ARN Format

{
    "Effect": "Allow",
    "Action": "s3:GetObject", 
    "Resource": "arn:aws:s3:::my-bucket"
}

Error: Missing /* for object access Fix: "arn:aws:s3:::my-bucket/*"

Missing List Permission

{
    "Effect": "Allow",
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::my-bucket/*"
}

Error: Cannot list bucket contents Fix: Add s3:ListBucket action

Overly Broad Permissions

{
    "Effect": "Allow",
    "Action": "s3:*",
    "Resource": "*"
}

Risk: Access to all S3 buckets in account Production: Never use wildcard permissions

Minimal Working IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::ml-training-bucket"
        },
        {
            "Effect": "Allow", 
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::ml-training-bucket/*"
        }
    ]
}

Debug IAM Issues

# Test S3 access
aws s3 ls s3://ml-training-bucket --profile demo

# Check effective permissions
aws iam simulate-principal-policy \
    --policy-source-arn arn:aws:iam::account:role/EC2-ML-Role \
    --action-names s3:GetObject \
    --resource-arns arn:aws:s3:::ml-training-bucket/data.csv

CloudTrail for Debugging

  • All S3 API calls logged with timestamps
  • Access denied events show exact error cause
  • Essential for production IAM debugging

IAM permissions require exact ARN matching - bucket vs object permissions commonly confused.

Single Instance: No Fault Tolerance

EC2 instance failure stops all training with no automatic recovery.

Failure Modes

  • Hardware failure: 2-5 minute detection + restart
  • Spot instance interruption: 2-minute warning
  • Software crash: Manual SSH required for diagnosis
  • AZ outage: Complete system unavailability
  • Network partition: Training stops, no automatic retry

Data Loss Scenarios

  • In-memory model state: Lost on any failure
  • /tmp directory: Cleared on restart
  • Training progress: Lost without S3 checkpointing
  • Logs and debugging info: Gone unless CloudWatch configured

Manual Recovery Process

  1. SSH to investigate failure cause (if instance accessible)
  2. Launch replacement instance manually
  3. Restore training environment from scratch
  4. Resume from last S3 checkpoint (if exists)
  5. Restart training job manually

Availability Calculation

  • Single EC2 instance: 99.5% uptime (AWS SLA)
  • Monthly downtime: 3.6 hours expected
  • Training interruption: 2-5 minutes recovery time

High Availability Requirements

Auto Scaling Group

{
    "AutoScalingGroupName": "ml-training-asg",
    "MinSize": 1,
    "MaxSize": 3,
    "DesiredCapacity": 1,
    "HealthCheckType": "EC2",
    "HealthCheckGracePeriod": 300,
    "AvailabilityZones": ["us-east-1a", "us-east-1b"]
}

Application Load Balancer

  • Health check every 30 seconds
  • Automatic traffic routing to healthy instances
  • Multi-AZ deployment for zone failures

Training Job Resilience

def checkpoint_training():
    # Save to S3 every epoch
    checkpoint = {
        'model_state': model.state_dict(),
        'optimizer_state': optimizer.state_dict(), 
        'epoch': current_epoch,
        'loss': current_loss
    }
    torch.save(checkpoint, '/tmp/checkpoint.pth')
    s3.upload_file('/tmp/checkpoint.pth', 
                   'bucket', f'checkpoints/epoch_{current_epoch}.pth')

def resume_training():
    # Resume from latest S3 checkpoint
    latest_checkpoint = find_latest_checkpoint_s3()
    checkpoint = torch.load(latest_checkpoint)
    model.load_state_dict(checkpoint['model_state'])
    return checkpoint['epoch']

Cost of High Availability

  • Single instance: $30/month
  • HA setup: $90/month (3× cost)
  • Load balancer: +$16/month
  • Total HA cost: $106/month vs $30 basic

Production systems require 3-5× cost increase for fault tolerance and automatic recovery.

Lambda Cold Starts: 2-8 Second Delays

Serverless model serving faces initialization delays absent in always-on systems.

Cold Start Performance

import json
import torch
import boto3

def lambda_handler(event, context):
    # Cold start steps:
    # 1. Download model from S3 (2-4 seconds)
    # 2. Load PyTorch model (1-3 seconds)  
    # 3. Initialize model for inference (0.5-1 seconds)
    
    s3 = boto3.client('s3')
    s3.download_file('bucket', 'model.pth', '/tmp/model.pth')
    model = torch.load('/tmp/model.pth', map_location='cpu')
    
    # Actual inference (10-50ms)
    prediction = model(torch.tensor(event['input']))
    return {'prediction': prediction.tolist()}

Timing Breakdown

  • Container initialization: 200-500ms
  • Python runtime startup: 300-800ms
  • PyTorch import: 1-2 seconds
  • S3 model download: 1-4 seconds (depends on size)
  • Model loading: 0.5-2 seconds
  • Total cold start: 2-8 seconds

Warm Request Performance

  • Model cached in memory: 10-50ms response
  • No S3 download or model loading
  • Warm container reused for ~15 minutes

Always-On EC2 Alternative

from flask import Flask, request, jsonify
import torch

app = Flask(__name__)

# Load model once at startup (not per request)
print("Loading model...")  # 2-4 seconds one-time
model = torch.load('model.pth', map_location='cpu')
model.eval()
print("Model ready")

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    
    # No cold start - model already loaded
    with torch.no_grad():
        prediction = model(torch.tensor(data['features']))
    
    return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80)

Performance Comparison

Approach Cold Start Warm Latency Cost (1M req/month)
Lambda 2-8 seconds 15-50ms $200
EC2 t3.micro 0ms 20-100ms $350
EC2 c5.large 0ms 5-20ms $720

When Lambda Makes Sense

  • Sporadic traffic (< 1000 requests/day)
  • Cost optimization priority
  • Can tolerate cold starts
  • Model size < 250MB

When EC2 Required

  • Consistent low latency needed
  • Large models (> 250MB)
  • High request volume (> 10,000/day)
  • Always-on user expectations

Serverless introduces 2-8 second initialization penalty vs 0ms for persistent servers.

Manual Instance Management vs Auto Scaling

Demo system requires manual start/stop vs production auto-scaling complexity.

Manual Operations

# Start training job
aws ec2 start-instances --instance-ids i-1234567890abcdef0

# SSH and run training
ssh -i key.pem ubuntu@instance-ip
python train_model.py

# Check progress manually
tail -f training.log

# Stop instance when done
aws ec2 stop-instances --instance-ids i-1234567890abcdef0

Manual Process Problems

  • Forget to stop instances → $73/day cost
  • Instance launch failures require restart
  • No automatic scaling for load changes
  • SSH access required for all operations
  • Training interruption if connection lost

Development Workflow

  • Start instance: 45 seconds boot time
  • Install dependencies: 2-5 minutes first time
  • Run training: Variable duration
  • Manual monitoring required
  • Manual termination after completion

Cost Control Issues

  • Left running overnight: $73 unexpected cost
  • Weekend forgetting: $146 weekend cost
  • Instance type mistakes: p3.16xlarge ($24/hour) vs intended t3.medium

Auto Scaling Production Setup

# CloudFormation template
Resources:
  MLAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: 0
      MaxSize: 10
      DesiredCapacity: 1
      LaunchTemplate:
        LaunchTemplateId: !Ref MLLaunchTemplate
        Version: !GetAtt MLLaunchTemplate.LatestVersionNumber
      HealthCheckGracePeriod: 300
      HealthCheckType: EC2

  MLLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate  
    Properties:
      LaunchTemplateData:
        ImageId: ami-0c02fb55956c7d316
        InstanceType: p3.2xlarge
        IamInstanceProfile:
          Arn: !GetAtt MLInstanceProfile.Arn
        UserData:
          Fn::Base64: |
            #!/bin/bash
            aws s3 cp s3://ml-bucket/train.py /home/ubuntu/
            cd /home/ubuntu && python3 train.py
            shutdown -h now  # Auto-terminate when done

Auto Scaling Benefits

  • Automatic instance replacement on failure
  • Scale based on queue depth or metrics
  • No manual intervention required
  • Automatic cost optimization (scale to zero)

Production Complexity

  • Infrastructure as code required
  • Health checks and monitoring setup
  • Load balancer configuration
  • Service discovery for distributed training
  • Cost: 5-10× setup time vs manual approach

Production auto-scaling requires infrastructure complexity but eliminates manual operations and cost overruns.

100× Operational Overhead: Development vs Production

Production deployment multiplies operational requirements by 100×.

Demo System Operations

Weekly Effort: 1-2 Hours

  • Launch instance when needed
  • SSH and start training
  • Check CloudWatch logs for errors
  • Download results from S3
  • Stop instance manually

Tools Required

  • AWS CLI for instance management
  • SSH client for remote access
  • Basic S3 commands for data transfer
  • CloudWatch console for log viewing

Failure Recovery

  • Restart failed instances manually
  • Re-run training from beginning
  • Debug via SSH and log inspection
  • No monitoring or alerting

Security Model

  • Single IAM role with broad permissions
  • Default VPC with basic security groups
  • No encryption or compliance considerations
  • Developer access keys with full privileges

Cost Management

  • Manual instance start/stop
  • Basic billing alerts at account level
  • No cost allocation or project tracking

Production System Requirements

Weekly Effort: 15-20 Hours

  • Infrastructure monitoring and maintenance
  • Security patch management
  • Cost optimization analysis
  • Performance tuning and debugging
  • Incident response and resolution

Enterprise Operations Stack

# Infrastructure as Code
terraform plan && terraform apply

# Monitoring and Alerting  
kubectl apply -f prometheus-config.yaml
aws cloudwatch put-metric-alarm --alarm-name "High-GPU-Usage"

# Security Compliance
aws config start-config-recorder
aws guardduty create-detector

# Cost Management
aws budgets create-budget --budget file://ml-budget.json
aws ce get-cost-and-usage --time-period Start=2024-01-01,End=2024-02-01

Production Requirements

  • 24/7 monitoring and alerting
  • Automated backup and disaster recovery
  • Multi-region deployment for availability
  • Role-based access control (RBAC)
  • Encryption at rest and in transit
  • Compliance auditing and reporting
  • Load testing and capacity planning
  • A/B testing and gradual rollouts

Team Structure

  • DevOps engineer: Infrastructure management
  • SRE: Monitoring and incident response
  • Security engineer: Compliance and auditing
  • ML engineer: Model development and optimization

Production ML systems require dedicated operations team vs single developer for demo system.

p3.2xlarge Economics: $2,196 Monthly vs $400 Local GPU

GPU instance costs exceed local workstation after 3 weeks continuous operation.

Cost Comparison Analysis

AWS p3.2xlarge (1× NVIDIA V100)

  • Hourly rate: $3.06
  • Monthly (24/7): $2,196
  • Yearly (24/7): $26,356
  • Performance: 1.2× local RTX 4090

Local RTX 4090 Workstation

  • Hardware cost: $4,500 (GPU + system)
  • Electricity: $150/month (600W × 24/7)
  • Total monthly: $150 + depreciation
  • 3-year amortization: $125/month hardware
  • Total local cost: $275/month

Break-even Analysis

  • Monthly crossover: 90 hours GPU usage
  • p3.2xlarge profitable: <90 hours/month
  • Local profitable: >90 hours/month
  • Daily break-even: 3 hours/day

Spot Instance Pricing

  • Spot discount: 70% typical
  • p3.2xlarge spot: $0.92/hour average
  • Monthly spot (24/7): $659
  • Break-even vs local: 300 hours/month

Usage Pattern Economics

Intermittent Research (20 hours/month)

  • p3.2xlarge on-demand: $61.20
  • Local alternative: $275 (hardware + electricity)
  • Cloud savings: $213.80/month

Heavy Development (200 hours/month)

  • p3.2xlarge on-demand: $612
  • p3.2xlarge spot: $184
  • Local alternative: $275
  • Local savings: $337-429/month

Continuous Production (720 hours/month)

  • p3.2xlarge on-demand: $2,196
  • p3.2xlarge spot: $659
  • p3.2xlarge reserved (3-year): $1,317
  • Local alternative: $275
  • Local savings: $384-1,921/month

GPU Performance Comparison

  • RTX 4090: 83 TFLOPS (FP16)
  • Tesla V100: 125 TFLOPS (FP16)
  • V100 memory: 16GB HBM2
  • RTX 4090 memory: 24GB GDDR6X
  • V100 advantage: 50% compute, 33% less memory

Reserved Instance Strategy

  • 1-year commitment: 40% discount
  • 3-year commitment: 60% discount
  • Requires accurate usage forecasting
  • No flexibility for changing requirements

GPU instances cost-effective below 90 hours/month; above this threshold local hardware provides 60-85% savings.

When EC2+S3 Architecture Fails

Specific technical constraints where demonstrated architecture becomes inadequate.

Memory Constraints

  • GPT-2 medium (774M parameters): 3.1GB model weights
  • GPT-3.5 equivalent: ~13GB model weights
  • Llama-2 70B: 140GB model weights
  • EC2 limit: r5.24xlarge maximum 768GB
  • Solution: Multi-instance model parallelism

Training Scale Limits

  • Single p3.2xlarge: 1 GPU, 61GB RAM
  • ImageNet training: Acceptable (24 hours)
  • GPT-3 scale training: 1024+ GPUs required
  • EC2 constraint: Manual cluster management
  • Solution: EKS or SageMaker managed training

Request Rate Bottlenecks

  • Single EC2 instance: ~1000 requests/second maximum
  • Load balancer + Auto Scaling: ~10,000 requests/second
  • Global scale: 100,000+ requests/second required
  • Bottleneck: Database and backend services
  • Solution: Microservices + CDN architecture

Data Processing Limits

  • S3 throughput: 5,500 requests/second per prefix
  • Large training job: 1000 GPUs × 10 requests/second = 10,000 req/s
  • S3 constraint: Request rate exceeds limits
  • Solution: Data sharding across prefixes or local caching

Alternative Architecture Patterns

Kubernetes + GPU Operators

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training-cluster
spec:
  replicas: 16
  template:
    spec:
      containers:
      - name: pytorch-training
        image: pytorch/pytorch:latest
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 61Gi

Managed ML Services

  • SageMaker Training: Automatic cluster management
  • SageMaker Endpoints: Auto-scaling inference
  • Cost: 20-30% premium vs EC2, but operational savings
  • Suitable for teams >5 people

Serverless Data Processing

  • AWS Batch: Managed job queues
  • Step Functions: Workflow orchestration
  • Lambda: Event-driven preprocessing
  • Cost-effective for intermittent workloads

When to Migrate from EC2+S3

  • Training jobs require >4 GPUs simultaneously
  • Inference SLA requires <50ms latency globally
  • Team >10 people need shared infrastructure
  • Compliance requires advanced security controls
  • Cost optimization needs automated resource management

Migration Triggers

  • Manual scaling becomes operational bottleneck
  • Security requirements exceed basic IAM policies
  • Multi-region deployment needed for latency
  • Training coordination requires job scheduling
  • Model serving needs A/B testing capabilities

EC2+S3 architecture optimal for single-developer ML projects; enterprise scale requires orchestration platforms and managed services.