Cloud Computing Fundamentals

EE 547 - Unit 3

Dr. Brandon Franzke

Fall 2025

Local Scaling Limits

Single-Machine Memory and Compute Constraints

Development environments constrain ML system capabilities.

Typical Development Setup (2024)

32GB RAM (64GB if expensive)
8-12 CPU cores
Local SSD storage: 1-2TB
Single GPU: RTX 4090 (24GB VRAM)
Development cost: $3,000-$5,000

Where constraints bind:

Memory: 100GB+ datasets exceed RAM capacity → OOM kills training process
Storage: Multi-TB datasets fill local drives → Training stops mid-epoch
Compute: Model training requires days/weeks
GPU memory: Large models exceed 24GB VRAM → Transformer training impossible
Serving: Cannot handle many concurrent users

Real ML systems require infrastructure beyond single machine scale.

Production Workloads Require Distributed Infrastructure

Production workloads exceed development capabilities by orders of magnitude.

Large Language Model Training

1,024+ GPUs running for weeks
8TB+ aggregate GPU memory
10TB+ training data with constant I/O
400+ Gbps GPU interconnect bandwidth
$1-5 million per training run

Production Model Serving

10K-1M+ requests per second
<100ms response time requirements
99.9%+ uptime (4 minutes downtime/month)
Global deployment with data residency constraints
Variable cost based on unpredictable traffic peaks

Real-Time Processing Pipelines

TB/day data ingestion from multiple sources
Feature extraction and transformation processing
Millisecond inference decisions
Hot data, cold archives, backup storage
System health and model drift monitoring

Production workloads require entirely different infrastructure architectures, not scaled development setups.

Local Development Constraints Prevent Production Deployment

Concrete example of where local development assumptions break.

Development Phase (5 engineers, MacBook Pros)

Training YOLOv8 on 10K labeled images
Local training time: 4 hours per experiment
Storage: 50GB dataset fits on local SSDs
Cost: $25K in laptops

Production Requirements

1M+ labeled images from customer data
Real-time inference: <50ms latency globally
Traffic: 10K requests/second peak
Deployment: US, Europe, Asia simultaneously

Failure Points

Data pipeline: 500GB dataset cannot fit in memory → Swap thrashing kills performance
Training time: 2 weeks per experiment vs 4 hours locally → 35x slower iteration cycle
GPU memory: Models require 48GB VRAM, have 24GB → CUDA out of memory errors
Serving latency: Global users see 300ms+ latency from US servers → 6x SLA violation
Infrastructure cost: $2M upfront hardware vs $50K/month cloud → 40x capital requirement

Local development assumptions break at cloud scale: datasets exceed memory, training times become prohibitive, single-point failures affect global users.

Resource Pooling Economics

Datacenters achieve cost efficiencies impossible for individual organizations.

Individual Company Infrastructure

Purchase servers for peak capacity → 3-5 year depreciation regardless of usage
Maintain datacenter facilities → $500K+ annual facility costs
Staff specialized operations teams → $200K+ per systems engineer
Handle hardware failures independently → 24-48 hour repair time
Plan capacity years in advance → 50% over-provisioning for growth

Utilization: 20-30% average utilization with 100% fixed costs → 70% resource waste

Example Startup ML Training

Peak need: 100 GPUs for 1 week/month
Required purchase: 100 GPUs × $15K = $1.5M
Utilization: 25% (3 weeks idle)
Annual cost: $1.5M + datacenter + operations staff

Hyperscaler Infrastructure (AWS, Google, Microsoft)

1M+ servers per provider
Resource pooling across thousands of customers
Automated management at scale
Geographic distribution reduces latency
Specialized operations expertise

Result: Rent exactly required resources when needed.

Same Startup with Cloud

Rent: 100 GPUs for 1 week = $15K/month
Annual cost: $180K vs $1.5M+ ownership
Zero idle capacity, no operations overhead

Economics: Hyperscalers achieve 10-20x cost efficiency through scale, specialization, and resource pooling.

Service Layers Abstract Hardware Management

Cloud providers abstract physical complexity into consumable services.

Physical Infrastructure Layer

Datacenters: 100K+ servers per facility
Networking: 100+ Tbps backbone connectivity
Power: Megawatts electrical capacity with redundancy
Cooling: Industrial-scale temperature control
Security: Physical access controls, biometrics

Virtualization Layer

Hypervisors: Multiple virtual machines per physical server
Resource isolation: CPU, memory, storage quotas per VM
Live migration: Move VMs between physical hosts
Resource scheduling: Optimize utilization across fleet

Service Layer

Compute: Virtual machines, containers, serverless functions
Storage: Object stores, databases, file systems
Network: Load balancers, CDNs, private networks
Management: Monitoring, logging, billing, security

Application Layer

Data pipelines: ETL, feature engineering
Model training: Distributed training frameworks
Model serving: APIs, batch inference
Monitoring: Model performance, data drift

Each layer abstracts thousands of operational details. Application development consumes services without managing underlying infrastructure.

Market Competition Drives Service Innovation

Competition between AWS, Google Cloud, and Microsoft Azure drives innovation and price reductions.

Market Share and Positioning (2024)

Provider	Market Share	Strengths	ML Focus
AWS	32%	Service breadth, enterprise adoption	SageMaker, comprehensive ML tools
Microsoft Azure	23%	Enterprise integration, hybrid cloud	Azure ML, enterprise AI
Google Cloud	11%	ML/AI innovation, data analytics	Vertex AI, TensorFlow integration
Others	34%	Specialized services, regional players	Various

Competitive Pressures

Regular price cuts to match competitors
New services launched monthly
Performance improvements: faster CPUs, newer GPUs
Geographic expansion: global datacenter buildouts
ML/AI specialization: dedicated hardware and services

Competition Results: 75% price reduction over 10 years, specialized ML hardware, new capabilities quarterly, multiple viable providers prevent vendor lock-in.

Pay-per-Use vs Fixed Infrastructure Costs

Cloud fundamentally changes IT spending from capital investment to operational expense.

Traditional Model: Capital Expenditure (CapEx)

Upfront Investment Requirements

Purchase servers, storage, networking equipment
Build or lease datacenter space
Hire operations and maintenance staff
Plan capacity for 3-5 year hardware lifecycle

Financial Characteristics

Large upfront costs ($100K-$10M+)
Hardware depreciation over 3-5 years
Fixed costs regardless of utilization
Difficult to scale resources up or down
Requires accurate long-term demand forecasting

Example: Startup Scaling Challenge

Year 0: Purchase $500K GPU servers for anticipated growth
Year 1: Using only 20% of capacity (wasted $400K)
Year 2: Need 5x capacity, but hardware already purchased
Year 3: Original hardware obsolete, must purchase again

Cloud Model: Operating Expenditure (OpEx)

Pay-as-you-go Model

Rent computing resources by hour/minute
Scale resources up/down based on actual demand
Zero upfront hardware investment
Provider handles all operations and maintenance

Financial Characteristics

Zero upfront costs (start at $0)
Monthly bills based on actual resource usage
Variable costs that scale with business growth
Easy to experiment and pivot directions
Budget aligns with revenue growth

Same Startup Example with Cloud

Year 0: Start with $100/month for prototypes
Year 1: Scale to $5K/month as usage grows
Year 2: Scale to $25K/month for higher usage
Year 3: Latest GPU hardware automatically available

OpEx model aligns IT costs with business growth, reducing financial risk and enabling rapid experimentation.

Provider Capacity Exceeds Individual Requirements

Cloud providers maintain resource pools orders of magnitude larger than individual user needs.

AWS Global Infrastructure (2024)

Compute: 1M+ physical servers across fleet
Storage: 100+ exabytes total capacity
Network: 400+ Tbps global backbone bandwidth
GPUs: 10K+ H100 equivalents for ML workloads
Geographic: 33 regions, 105 availability zones
CDN: 450+ edge locations worldwide

Practical Implications

GPU availability: 100 GPUs for training available in minutes
Storage capacity: Multi-TB datasets stored without constraint
Global deployment: Applications deployed worldwide instantly
Traffic handling: 10x traffic surges handled automatically
Disaster recovery: Primary region failure triggers automatic backup

Large Model Training Example

Local constraint: Limited to 1-8 GPUs maximum
Cloud capability: 100+ GPU cluster available in <30 minutes
Cost model: Pay only for actual training time (hours vs years of ownership)

Cloud resources appear unlimited because total provider capacity exceeds individual user needs by orders of magnitude. This enables entirely new categories of ML experiments and applications.

Cloud Infrastructure Enables Production ML Scale

Cloud platforms provide computational resources that exceed local development constraints.

Training Scale Requirements

Large Model Training

1B+ parameter models require 100+ GPU coordination
Distributed training across 8-64 nodes with gradient synchronization
Fault tolerance through checkpointing and node replacement
Training data exceeding single-machine memory capacity

Computational Throughput

Transformer training: 4-8 weeks on local GPU vs 2-3 days distributed
Hyperparameter search: 100+ experiments in parallel
Model ensemble training: Multiple architectures simultaneously
Cross-validation: Independent fold processing across instances

Data Pipeline Scale

TB-scale dataset loading with parallel preprocessing
Real-time augmentation across distributed workers
Multi-region data replication for availability
Versioned dataset management and lineage tracking

Serving Scale Requirements

Inference Throughput

10,000+ requests/second with <100ms latency
Autoscaling from 10 to 1,000 instances based on load
Global traffic distribution across geographic regions
A/B testing with traffic splitting across model versions

Model Management

Deploy multiple model versions simultaneously
Gradual rollout and rollback capabilities
Model artifact storage and version control
Monitoring model drift and performance degradation

Production Reliability

99.9% availability requirements with automatic failover
Multi-region deployment for disaster recovery
Health monitoring and automatic instance replacement
Load balancing with circuit breakers for fault isolation

Cloud infrastructure provides the computational scale and reliability characteristics required for production ML systems operating at enterprise scale.

Distributed ML Systems Require New Engineering Skills

Distributed architectures introduce operational complexity absent in single-machine development.

Distributed Systems Engineering

Coordination Protocols

Distributed training synchronization and consensus algorithms
Checkpoint coordination across multiple training nodes
Service discovery and health monitoring between components
Load balancing algorithms and traffic routing policies

Failure Handling

Partial system failures and degraded mode operations
Network partition handling and split-brain scenarios
Retry logic with exponential backoff and circuit breakers
Data consistency during node failures and recovery

Performance Optimization

Network latency optimization between distributed components
Resource allocation and scheduling across heterogeneous hardware
Cost optimization through spot instance management
Monitoring and profiling distributed system bottlenecks

Operational Infrastructure

Service Management

Container orchestration and deployment strategies
Configuration management across multiple environments
Dependency management between microservices
Rolling updates and blue-green deployment patterns

Observability and Debugging

Distributed tracing across service boundaries
Log aggregation and correlation from multiple sources
Performance monitoring with custom metrics and alerting
Root cause analysis in complex distributed failures

Security and Compliance

Identity and access management across service boundaries
Network security policies and firewall configuration
Credential management and secret rotation protocols
Audit logging and compliance reporting requirements

Distributed ML systems require expertise in systems engineering disciplines beyond traditional machine learning and statistical modeling.

Network-Based Services Replace Local File Access

Cloud programming assumes distributed services rather than single-machine execution.

Local Development Model

# Single-machine assumptions
import torch
import pandas as pd

# Load data (assumes local files)
data = pd.read_csv('dataset.csv')

# Train model (uses local GPU/CPU)  
model = train_model(data)

# Save result (local filesystem)
torch.save(model, 'model.pth')

# Serve predictions (single process)
app.run(host='localhost', port=5000)

Assumptions

Unlimited local storage access
Reliable single machine operation
Direct file system access
No network latency considerations
Single point of failure acceptable

Cloud-Native Development Model

# Distributed service assumptions
import boto3

# Load data (from cloud storage)
s3.download_file('bucket', 'dataset.csv', '/tmp/data.csv')

# Train model (on cloud compute)
ec2_instance.run_training_job(
    data_location='s3://bucket/dataset.csv'
)

# Save result (to cloud storage)  
s3.upload_file('model.pth', 'bucket', 'models/v1.pth')

# Serve predictions (managed service)
lambda_function.deploy(
    model_path='s3://bucket/models/v1.pth'
)

New Assumptions

Data stored remotely with network I/O
Multiple services fail independently
Network latency affects performance
Security and permissions required
Cost proportional to usage patterns

Cloud development requires designing for network latency, service failures, and distributed data flows.

Distributed Systems Failure Modes

Distributed systems introduce complexity not present in local development.

Network Reliability Constraints

Services temporarily unavailable (timeouts, retries required)
Data transfer bandwidth and latency limits
Must handle connection failures gracefully

Security Requirements Everywhere

Access permissions for every service interaction
Data encryption in transit and at rest
Network security groups and firewall rules

Usage-Based Cost Model

Every API call, data transfer, compute hour costs money
Poor architectural choices become expensive quickly
Continuous monitoring and optimization required

Distributed Debugging Complexity

Errors occur across multiple services simultaneously
Logs distributed across different systems
Troubleshooting requires understanding service interactions

Why This Complexity Exists

Complexity results from solving problems that do not exist in local development:

Multi-tenancy: Code runs alongside thousands of other users
Global distribution: Data and compute span continents
Fault tolerance: Systems handle component failures gracefully
Security: Protection against sophisticated attacks and compliance

Cloud development trades local simplicity for global scale and distributed system capabilities.

Cloud Infrastructure and Services

AWS Global Infrastructure: Regions and Availability Zones

Cloud services run on geographically distributed datacenters with specific failure and latency characteristics.

AWS Regions (33 worldwide as of 2024)

Definition: Isolated geographic areas containing multiple datacenters

North America: us-east-1 (Virginia), us-west-2 (Oregon), ca-central-1 (Canada)
Europe: eu-west-1 (Ireland), eu-central-1 (Frankfurt), eu-north-1 (Stockholm)
Asia Pacific: ap-southeast-1 (Singapore), ap-northeast-1 (Tokyo), ap-south-1 (Mumbai)

Region Characteristics

Isolation: Complete independence - no shared infrastructure
Latency: 150-300ms between distant regions (US-Asia)
Compliance: Data residency laws require specific regions
Services: Not all AWS services available in all regions
Pricing: Different costs per region (Tokyo 20% more expensive than Virginia)

Availability Zones per Region (2-6 AZs)

Definition: Separate datacenters within a region
Physical separation: 10+ miles apart, separate power/cooling
Network: <10ms latency between AZs in same region
Failure isolation: AZ failures don’t affect other AZs
Examples: us-east-1a, us-east-1b, us-east-1c (Virginia region)

Infrastructure Hierarchy

Global Infrastructure
├── AWS Regions (33)
│   ├── us-east-1 (Virginia)
│   │   ├── us-east-1a (AZ)
│   │   ├── us-east-1b (AZ)
│   │   ├── us-east-1c (AZ)
│   │   ├── us-east-1d (AZ)
│   │   ├── us-east-1e (AZ)
│   │   └── us-east-1f (AZ)
│   ├── us-west-2 (Oregon)
│   │   ├── us-west-2a (AZ)
│   │   ├── us-west-2b (AZ)
│   │   ├── us-west-2c (AZ)
│   │   └── us-west-2d (AZ)
│   ├── eu-west-1 (Ireland)
│   │   ├── eu-west-1a (AZ)
│   │   ├── eu-west-1b (AZ)
│   │   └── eu-west-1c (AZ)
│   ├── ap-southeast-1 (Singapore)
│   │   ├── ap-southeast-1a (AZ)
│   │   ├── ap-southeast-1b (AZ)
│   │   └── ap-southeast-1c (AZ)
│   └── ... (29 more regions)
└── Edge Locations (450+)
    ├── CloudFront CDN
    └── Global Content Delivery

ML System Design Implications

Data Residency Constraints

European GDPR: EU citizen data must stay in EU regions
Chinese data sovereignty: cn-north-1, cn-northwest-1 required
US government: AWS GovCloud (us-gov-east-1, us-gov-west-1)

Multi-AZ Architecture for Availability

Training: Data in S3 replicated across AZs automatically
Inference: Load balancer distributes across AZ-deployed instances
Database: RDS Multi-AZ failover in <60 seconds

Cost vs Latency Trade-offs

us-east-1 (and us-west-2): Cheapest region, highest AWS service availability
ap-northeast-1: 20-30% more expensive, required for Japan users
Cross-region data transfer: $0.09/GB (expensive for large datasets)

ML systems must account for region selection based on data residency, user latency, service availability, and cost constraints.

Resource Placement: Cost and Latency Trade-offs

Cross-AZ data transfer costs create trade-offs between cost and availability for large ML datasets.

ImageNet Training Cost Impact (1.3TB dataset)

Same AZ Placement

Training and database: us-east-1a
Data transfer cost: $0
Risk: Single AZ failure stops training

Cross-AZ Placement

Training: us-east-1a, Database: us-east-1b
Cross-AZ transfer: $0.01/GB each direction
ImageNet daily training: 1.3TB × $0.01 = $13/day
Monthly cost: $400 additional for cross-AZ data access
Benefit: Training continues during AZ failure

The Trade-off

Same AZ: $0 transfer cost, single point of failure
Cross-AZ: $400/month cost, survives AZ outages

Production Architecture Decisions

Training Workloads

Co-locate compute and data in same AZ
Accept single AZ risk to avoid $400/month transfer costs
Use S3 checkpointing for recovery

Inference Services

Multi-AZ load balancing for availability
Smaller data transfers make cross-AZ costs acceptable
Database: $0.20/day for 10GB daily queries

Cross-Region Costs

ImageNet replication: 1.3TB × $0.09/GB = $117 one-time
Used only for disaster recovery, not daily access

Cross-AZ data transfer at $0.01/GB makes dataset placement a key decision for large-scale ML training.

Network Latency Replaces Deterministic Local Access

Distributed systems replace instant local operations with network requests.

Local Development Assumptions

File read: 0.1ms from SSD
Memory access: 0.001ms RAM lookup
Function call: 0.0001ms CPU instruction
Database query: 1ms SQLite local file

Network Operation Reality

S3 object read: 20-50ms average latency
EC2 to RDS query: 1-5ms within AZ, 15-25ms cross-AZ
Service-to-service API call: 10-100ms depending on load
Cross-region data transfer: 150-300ms transcontinental

ML Training Pipeline Impact

Local batch loading: 50ms per 1000 images
S3 batch loading: 200-500ms per 1000 images
Result: 4-10× slower data pipeline, GPU starvation
Distributed training coordination: +200ms per epoch synchronization

Network operations introduce 100-2000× latency increase over local operations, requiring different software design patterns.

Partial Failures Require New Error Handling

Distributed systems fail differently than single machines.

Single Machine Failure Model

Process crash: Complete system failure
Out of memory: Entire application stops
Disk full: All operations fail immediately
Network down: No external connectivity

Recovery: Restart entire system, reload from disk

Distributed System Failure Model

Partial node failure: 2 out of 8 training nodes crash
Network partition: East coast can’t reach West coast servers
Service degradation: S3 returns 10% error rate, not 100%
Cascading failures: Database overload causes API timeouts

ML Training Example

8-GPU distributed training job:

GPU 3 fails at epoch 47 of 100
Options: Stop all GPUs (waste 6 hours) or continue with 7 GPUs
Gradient synchronization must handle missing node
Checkpoint frequency determines maximum lost work

Error Handling Complexity

# Local development - simple error handling
try:
    data = load_training_data('dataset.csv')
    model = train_model(data)
    save_model(model, 'model.pth')
except Exception as e:
    print("Training failed, restart from beginning")

# Distributed training - complex error handling  
try:
    nodes = discover_healthy_training_nodes()
    if len(nodes) < MIN_NODES:
        wait_for_node_recovery()
    
    checkpoint = load_latest_checkpoint_if_exists()
    model = train_distributed(data, nodes, checkpoint)
    
except NodeFailure as e:
    # Continue with remaining nodes or wait for replacement
    handle_node_failure(e.failed_node)
except NetworkPartition as e:
    # Pause training until partition heals
    wait_for_network_recovery()
except ServiceDegradation as e:
    # Retry with exponential backoff
    retry_with_backoff(e.failing_service)

Failure Probability Math

Single machine: 99.9% monthly uptime
8-machine system: (0.999)^8 = 99.2% all nodes healthy
Result: 8× higher chance of partial system failure

Distributed systems require application logic to handle partial failures that never occur in single-machine development.

S3 Hides Data Replication Implementation

Simple API masks complex distributed storage system.

What You See: Simple File Operations

import boto3
s3 = boto3.client('s3')

# Appears like local file system
s3.put_object(Bucket='my-bucket', Key='data.csv', Body=data)
s3.get_object(Bucket='my-bucket', Key='data.csv')
s3.delete_object(Bucket='my-bucket', Key='data.csv')

What AWS Implements Behind the Scenes

Data Replication

Automatically copies data to 3+ physical servers
Distributes copies across different data centers
Maintains 99.999999999% durability (11 9’s)

Consistency Management

Coordinates writes across multiple storage nodes
Handles read-after-write consistency
Manages eventual consistency for updates

Failure Recovery

Detects hardware failures within seconds
Automatically replaces failed storage nodes
Rebuilds lost data copies from remaining replicas

Complexity You Don’t Handle

# What you would need to implement manually:
# 1. Distributed consensus protocol
# 2. Failure detection and recovery
# 3. Data partitioning and replication  
# 4. Consistent hashing for load distribution
# 5. Network protocol for reliable transfer
# 6. Monitoring and alerting systems
# 7. Hardware provisioning and maintenance

Engineering Cost Avoided

Distributed systems team: 5-10 engineers × $200K = $1-2M/year
Data center operations: $500K+/year facilities cost
Hardware replacement: $100K+/year equipment
24/7 on-call rotation: $300K+/year operations staff

vs S3 Cost: $23/TB/month for most workloads

Development Time Savings

Building reliable distributed storage: 18-24 months
S3 integration: 1-2 days
Focus shift: From infrastructure to ML algorithms

S3 provides distributed storage reliability without requiring distributed systems expertise.

Load Balancers Replace Manual Request Distribution

Automatic traffic distribution across multiple servers.

Manual Load Distribution Problems

Single Server Bottleneck

1 EC2 instance: ~1,000 requests/second maximum
Model inference: 50-200ms per request
Capacity: 5-20 concurrent users before timeouts

Adding Servers Manually

# Deploy model to 3 servers
server1: ec2-1-2-3-4.compute-1.amazonaws.com
server2: ec2-1-2-3-5.compute-1.amazonaws.com  
server3: ec2-1-2-3-6.compute-1.amazonaws.com

# Client must choose which server to call
if server1_healthy:
    call server1
elif server2_healthy:
    call server2
else:
    call server3

Problems:

Client needs health check logic
Uneven load distribution
Manual server replacement on failures

Application Load Balancer Solution

# Single endpoint for clients
API_ENDPOINT = "https://my-api.elb.amazonaws.com/predict"

# Load balancer handles distribution automatically:
# 1. Health checks servers every 30 seconds
# 2. Routes requests to healthy instances only  
# 3. Distributes load evenly across instances
# 4. Automatically adds/removes instances

response = requests.post(API_ENDPOINT, json=data)

Complexity Abstracted

Health monitoring: Automatic detection of failed instances
Traffic routing: Weighted round-robin distribution
SSL termination: Handles HTTPS certificates automatically
Auto scaling integration: Adds servers during traffic spikes

Performance Results

3 instances behind load balancer: 3,000 requests/second capacity
Automatic failover: <30 seconds to detect and route around failures
Availability: 99.99% with multi-AZ deployment

Client sees single reliable endpoint instead of managing multiple servers.

Load balancers provide high availability and scalability without client-side complexity.

Virtual Resources Replace Physical Infrastructure

Cloud providers abstract physical infrastructure into consumable services.

Traditional Infrastructure Model

Purchase physical servers
Install operating systems
Configure networking equipment
Manage storage arrays
Handle hardware failures
Plan capacity for peak loads

Constraints:

Fixed capacity regardless of usage
Upfront capital investment required
Manual scaling and maintenance
Single datacenter deployment

Cloud Service Model

Rent virtual resources on-demand
Pre-configured software stacks available
Managed networking and load balancing
Distributed storage with replication
Provider handles hardware failures
Automatic scaling based on demand

Advantages:

Pay only for resources consumed
Scale from zero to massive capacity
Global deployment in minutes
Provider expertise in operations

Core Cloud Service Categories:

Compute: Processing power (CPUs, GPUs, memory)
Storage: Data persistence (files, objects, databases)
Network: Connectivity (load balancers, CDNs, security)

Each category solves specific scaling problems that local infrastructure cannot handle cost-effectively.

Compute Services: Processing Without Hardware Ownership

Compute services provide processing power without hardware ownership.

Virtual Machines (EC2)

Complete operating system control
Choose CPU, memory, storage, networking
Install any software stack
Direct SSH/RDP access for development
Suitable for existing applications with minimal changes

Containers (ECS/EKS)

Application packaging with dependencies
Faster startup than virtual machines
Resource sharing across containers
Orchestration handles scaling and failures
Ideal for microservices architectures

Serverless Functions (Lambda)

No server management required
Automatic scaling to zero and massive concurrency
Pay per request execution time
Event-driven execution model
Best for stateless, short-running tasks

Service selection depends on control requirements, scaling patterns, and operational complexity tolerance.

Instance Configuration Determines Functionality and Cost

Four key decisions define every EC2 instance configuration.

1. Amazon Machine Image (AMI)

Pre-configured operating system and software stack
Ubuntu 22.04, Windows Server 2022, Amazon Linux 2
Deep Learning AMIs with ML frameworks pre-installed
Custom AMIs with your specific software configurations
Determines what software is available when instance starts

2. Instance Type

Hardware specification: CPU, memory, storage, networking
t3.micro: 2 vCPUs, 1 GB RAM - development/testing
m5.large: 2 vCPUs, 8 GB RAM - general purpose applications
c5.4xlarge: 16 vCPUs, 32 GB RAM - CPU-intensive workloads
p3.2xlarge: 8 vCPUs, 61 GB RAM, 1 GPU - ML training

3. Storage Configuration

Root volume: Operating system and applications
Additional EBS volumes: Data storage, databases
Instance store: Temporary high-speed storage
Snapshots: Backup and restore capabilities

4. Network and Security Settings

VPC: Virtual network environment
Security groups: Firewall rules for inbound/outbound traffic
Key pairs: SSH authentication for Linux instances
Public IP: Internet accessibility

Configuration Examples:

AMI: Ubuntu 22.04 LTS
Instance Type: t3.medium
Storage: 20 GB root + 100 GB data volume
Network: Public IP, SSH key authentication

Configuration Impact on Cost:

AMI: Usually free (OS licensing may apply for Windows)
Instance type: Primary cost driver ($0.0116-$32.77/hour range)
Storage: Additional cost based on size and performance
Data transfer: Charges for internet egress traffic

Each configuration choice affects functionality, performance, and monthly costs.

AMIs: Pre-configured Operating Environments

Amazon Machine Images provide the foundation software for EC2 instances.

What AMIs Contain

Operating System: Linux distributions, Windows versions
System Software: Device drivers, networking stack, AWS tools
Application Software: Web servers, databases, ML frameworks
Configuration: Users, permissions, startup scripts
Customizations: Your specific software installations and settings

AMI Categories

AWS-provided: Maintained by Amazon, regular security updates
Marketplace AMIs: Third-party vendors, specialized software stacks
Community AMIs: Shared by other AWS users, use with caution
Custom AMIs: Your own snapshots of configured instances

Deep Learning AMI Features

Pre-installed ML frameworks: TensorFlow, PyTorch, MXNet, Hugging Face
CUDA drivers and cuDNN for GPU acceleration
Conda environments for different framework versions
Jupyter notebook server pre-configured
Development tools: git, vim, tmux, htop

AMI Selection Impact

Launch time: Custom AMIs start faster than base images
Maintenance: AWS AMIs get security updates, custom AMIs require manual updates
Storage cost: Larger AMIs cost more to store and transfer
Compatibility: Must match instance architecture (x86, ARM, GPU support)

AMI choice significantly impacts development velocity, operational overhead, and ongoing maintenance requirements.

Instance Types Optimize Hardware for Workload Patterns

EC2 provides hundreds of instance configurations optimized for different workload patterns.

General Purpose Instances (t3, m5, m6i)

Balanced CPU, memory, networking
t3.medium: 2 vCPUs, 4GB RAM, $0.0416/hour
m5.large: 2 vCPUs, 8GB RAM, $0.096/hour
Suitable for web servers, development environments

Compute Optimized (c5, c6i)

High-performance processors
c5.large: 2 vCPUs, 4GB RAM, $0.085/hour
3.4 GHz sustained all-core frequency
Ideal for CPU-intensive ML inference

Memory Optimized (r5, x1e)

High memory-to-CPU ratios
r5.large: 2 vCPUs, 16GB RAM, $0.126/hour
x1e.xlarge: 4 vCPUs, 122GB RAM, $0.834/hour
Required for large dataset processing

Storage Optimized (i3, i4i)

NVMe SSD storage with high IOPS
i3.large: 2 vCPUs, 15.25GB RAM, 475GB NVMe, $0.156/hour
Up to 3.3 million IOPS per instance
Database workloads and distributed file systems

Instance selection balances CPU performance, memory capacity, storage speed, and hourly cost based on workload requirements.

GPU Instances: Parallel Processing for ML Workloads

GPU instances provide parallel processing power for ML training and inference.

GPU Instance Families

p4d Instances: Latest ML Training

NVIDIA A100 GPUs (40GB memory each)
p4d.24xlarge: 8x A100, 96 vCPUs, 1152GB RAM
400 Gbps networking for multi-node training
$32.77/hour for 8 GPU instance

p3 Instances: General ML Workloads

NVIDIA V100 GPUs (16GB memory each)
p3.2xlarge: 1x V100, 8 vCPUs, 61GB RAM
25 Gbps networking
$3.06/hour for single GPU

g4 Instances: ML Inference

NVIDIA T4 GPUs (16GB memory each)
g4dn.xlarge: 1x T4, 4 vCPUs, 16GB RAM
Optimized for inference workloads
$0.526/hour for single GPU

Current Limitations:

Limited availability in some regions
Requires reservation for large-scale training
High cost for continuous operation

GPU selection depends on model size, training duration, and budget constraints. Latest hardware provides better performance-per-dollar for large-scale training.

AMI Selection Impacts Launch Time and Maintenance

AMIs provide pre-built operating system and software configurations.

Base Operating System Images

Ubuntu Server 22.04 LTS: Standard Linux distribution
Amazon Linux 2: AWS-optimized with pre-installed AWS tools
Windows Server 2022: Microsoft environment for .NET applications
Red Hat Enterprise Linux: Enterprise-grade Linux support

Deep Learning AMIs

AWS Deep Learning AMI (Ubuntu): Pre-installed ML frameworks
- PyTorch, TensorFlow, MXNet, Hugging Face Transformers
- CUDA drivers and cuDNN for GPU acceleration
- Jupyter notebooks and development tools
AWS Deep Learning Containers: Docker images for specific frameworks
NVIDIA NGC Images: Optimized containers for ML workloads

Custom AMIs

Create snapshots of configured instances
Share AMIs across accounts or make public
Version control for deployment consistency
Faster instance launch with pre-installed software

AMI Selection Strategy:

Start with Deep Learning AMI for ML workloads
Use base Ubuntu for custom configurations
Create custom AMI after environment setup
Consider regional availability and update frequency

AMI selection significantly impacts instance launch time, configuration complexity, and ongoing maintenance requirements.

Key Pairs Enable SSH Access

Key pairs provide secure authentication for connecting to EC2 instances without passwords.

AWS Key Pair Integration

Launch Requirement: Must specify key pair when creating instance
No Password Access: AWS disables password authentication by default
Region Specific: Key pairs only available in the region where created
Instance Metadata: Public key automatically installed in ~/.ssh/authorized_keys

Key Pair Management

AWS Generated: EC2 console creates key pair, you download .pem file
Import Existing: Upload your existing public key to AWS
One-time Download: Private key only available at creation time
No Recovery: Lost private key = permanent loss of access

Access Patterns

Single User: One key pair for personal development instances
Team Access: Multiple team members’ public keys imported separately
Service Access: Dedicated key pairs for automated tools and CI/CD
Environment Separation: Different keys for dev/staging/production

Key pairs cannot be added to running instances - losing your private key requires instance replacement or complex recovery procedures.

Security Groups Control Instance Network Access

Security groups act as virtual firewalls controlling inbound and outbound traffic to EC2 instances.

Inbound Rules (Traffic TO Your Instance)

SSH (Port 22): Administrative access for configuration and debugging
HTTP (Port 80): Web traffic for API endpoints and web applications
HTTPS (Port 443): Encrypted web traffic for production services
Custom Ports: Application-specific services (Jupyter: 8888, TensorBoard: 6006)

Outbound Rules (Traffic FROM Your Instance)

HTTPS (Port 443): Download packages, access S3, API calls
HTTP (Port 80): Software updates and package repositories
DNS (Port 53): Domain name resolution
Database Ports: Connection to RDS or external databases

Source and Destination Options

Your IP Address: Restrict access to your current location only
Anywhere (0.0.0.0/0): Allow access from entire internet
Other Security Groups: Reference groups for multi-tier applications
VPC CIDR Block: Allow access from within your virtual network

Security Group Strategy:

Start with restrictive rules (SSH from your IP only)
Add specific ports as needed for your application
Use security group references for multi-tier architectures
Never use 0.0.0.0/0 for SSH or database access

Security groups require explicit configuration for each network service your ML application needs to access or provide.

Cloud Storage: Durability and Global Access

Cloud storage services provide durability, scalability, and global accessibility.

Object Storage (S3)

Store files as objects in buckets
Globally unique bucket names
REST API access from any location
99.999999999% (11 9’s) durability
Automatic replication across facilities

Block Storage (EBS)

Virtual hard drives for EC2 instances
High IOPS performance for databases
Snapshot backup and restoration
Encryption at rest and in transit
Multiple volume types for different use cases

File Systems (EFS)

Network File System (NFS) compatible
Shared access across multiple instances
Automatic scaling to petabyte capacity
POSIX file system semantics
Suitable for distributed applications

Database Services (RDS, DynamoDB)

Managed relational databases (MySQL, PostgreSQL)
NoSQL for high-scale applications
Automated backups and patching
Multi-region replication
Performance monitoring and optimization

Storage service selection depends on access patterns, performance requirements, durability needs, and cost constraints.

Storage Services Abstract Physical Disks

Cloud storage abstracts physical disks into managed services with different access patterns.

Traditional Storage Model

Physical hard drives attached to servers
Direct file system access (NTFS, ext4)
Local RAID for redundancy
Manual backup and recovery
Fixed capacity planning

Cloud Storage Model

Storage services accessed over network APIs
Provider manages physical infrastructure
Automatic replication and durability
Pay-per-GB pricing with instant scaling
Different services optimized for specific use cases

Key Cloud Storage Concerns

Durability: How likely data survives hardware failures

Local disk: ~99% (1% annual failure rate)
Cloud storage: 99.999999999% (11 9’s) through replication

Consistency: When all copies reflect the same data

Strong consistency: All reads return latest write immediately
Eventual consistency: All copies eventually consistent, may be stale briefly

Access Patterns: How applications read and write data

Random access: Database queries, frequent small reads/writes
Sequential access: Log files, backups, large file streaming
Infrequent access: Archives, disaster recovery, compliance data

Storage Service Categories

Block Storage (EBS)

Virtual hard drives for EC2 instances
Raw block device, requires file system
High IOPS for databases and applications
Can attach/detach from instances
Snapshots for backup and cloning

Object Storage (S3)

Files stored as objects with metadata
REST API access from anywhere
Virtually unlimited capacity
Multiple storage classes for cost optimization
Global replication and CDN integration

File Storage (EFS)

Traditional file system semantics (POSIX)
Multiple instances access simultaneously
Automatic scaling to petabytes
Network File System (NFS) protocol
Shared access for distributed applications

Database Storage (RDS)

Managed database engines
Automatic backups and point-in-time recovery
Multi-AZ deployment for high availability
Read replicas for scale-out
Provider handles maintenance and patching

Storage Selection Criteria: Access frequency, performance requirements, sharing needs, backup/recovery, and cost sensitivity.

S3 Operational Complexity Exceeds Simple File Storage

S3 appears simple but involves significant operational complexity.

Why S3 Isn’t “Just File Storage”

Global Namespace and Regions

Bucket names must be globally unique across all AWS accounts
Data stored in specific geographic regions
Cross-region data transfer costs $0.02/GB
Latency varies significantly by region (20ms local, 200ms+ cross-continent)

Access Control Complexity

Bucket policies control who can access data
IAM roles define service permissions
Access Control Lists (ACLs) for fine-grained control
Pre-signed URLs for temporary access
Misconfigured permissions cause security breaches

Consistency and Performance Models

Read-after-write consistency for new objects
Eventual consistency for updates and deletes
Request rate limits: 3,500 PUT/COPY/POST/DELETE, 5,500 GET/HEAD per prefix per second
Hotspotting when many requests target same key prefix

Storage Classes and Cost Optimization

Standard: $0.023/GB/month, immediate access
Infrequent Access: $0.0125/GB/month, retrieval fees
Glacier: $0.004/GB/month, minutes to hours retrieval
Lifecycle policies automatically transition data

S3 operational complexity includes regional data placement, access control management, performance optimization, and cost management across multiple storage classes.

Network Services Enable Secure Component Communication

Cloud networking enables secure, scalable communication between services.

Virtual Private Cloud (VPC)

Isolated network environment in AWS
Define IP address ranges (CIDR blocks)
Public and private subnets
Control traffic with security groups and NACLs
Connect to on-premises networks via VPN

Load Balancers

Application Load Balancer (ALB): HTTP/HTTPS traffic, Layer 7 routing
Network Load Balancer (NLB): TCP/UDP traffic, ultra-low latency
Gateway Load Balancer (GWLB): Third-party security appliances
Health checks and automatic failover
SSL/TLS termination and certificate management

Content Delivery Network (CloudFront)

Global edge locations reduce latency
Cache static content closer to users
Dynamic content acceleration
DDoS protection and security features
Integration with AWS services

DNS and Service Discovery

Route 53 for domain name management
Health checks and failover routing
Service discovery for microservices
Geographic and latency-based routing

Networking services reduce latency, improve reliability, and provide security for distributed applications across global infrastructure.

Serverless Executes Code Without Server Management

Serverless computing executes code without server management or capacity planning.

Traditional Server-Based Model

Provision EC2 instances for expected peak load
Install runtime environments and dependencies
Deploy application code to servers
Monitor server health and scaling
Pay for server uptime regardless of usage

Serverless Execution Model

Upload code to serverless platform
Platform handles all infrastructure automatically
Code executes in response to events/requests
Automatic scaling from zero to thousands of concurrent executions
Pay only for actual execution time and requests

Key Serverless Concerns

Function as a Service (FaaS): Code runs as stateless functions

Each function execution is independent
No persistent local storage between invocations
Runtime environment created/destroyed for each execution

Event-Driven Architecture: Functions triggered by events

HTTP requests via API Gateway
File uploads to S3 storage
Database changes, queue messages, scheduled timers
Functions can trigger other functions

Cold Starts: Initialization delay for new function instances

Platform creates new runtime environment
Downloads code package and dependencies
Initializes programming language runtime
100ms-1000ms latency penalty for first execution

Serverless Service Categories

Compute Functions (Lambda)

Execute code in response to events
Supported languages: Python, Node.js, Java, C#, Go, Ruby
15-minute maximum execution time
10GB maximum memory allocation

API Management (API Gateway)

REST and WebSocket API endpoints
Request/response transformation
Authentication and authorization
Rate limiting and usage monitoring

Database Services (DynamoDB)

NoSQL database with automatic scaling
Single-digit millisecond latency
Pay-per-request pricing model
Global tables for multi-region deployment

Storage and Messaging

S3: Object storage with event triggers
SQS: Message queues for asynchronous processing
SNS: Publish/subscribe messaging service
EventBridge: Event routing between services

Development and Deployment

SAM: Serverless Application Model for infrastructure as code
X-Ray: Distributed tracing for debugging
CloudWatch: Logging and monitoring
CodePipeline: CI/CD for serverless applications

Serverless Trade-offs: No server management vs execution time limits, automatic scaling vs cold starts, pay-per-use vs potentially higher costs at scale.

Lambda Constraints Limit ML Workload Suitability

Lambda provides specific implementation of serverless computing with constraints for ML workloads.

Lambda Execution Model

Event-driven function execution
Automatic scaling from zero to thousands of concurrent executions
Pay only for actual compute time (100ms billing increments)
No server provisioning or maintenance required
Supports Python, Node.js, Java, C#, Go, Ruby, custom runtimes

Lambda Limitations for ML Workloads

Execution time: 15-minute maximum duration
Memory: 10GB maximum allocation
Storage: 512MB in /tmp directory
Package size: 50MB zipped, 250MB unzipped
Cold starts: 100ms+ initialization delay for new instances

Suitable ML Use Cases

Real-time inference for small models (<250MB)
Image preprocessing and data transformation
Model serving behind API Gateway
Event-driven data processing pipelines
Feature extraction from streaming data

Not Suitable for:

Large model training (memory and time constraints)
Models requiring GPU acceleration
Long-running data processing jobs
Applications requiring persistent connections

Lambda provides cost-effective serverless computing for event-driven ML tasks but has significant constraints for large-scale model operations.

Integration Patterns Connect Services Through APIs

Cloud services connect through APIs, events, and data flows.

Request-Response Pattern

Direct API calls between services
Synchronous communication
EC2 → S3 for data retrieval
Application Load Balancer → EC2 instances
Suitable for real-time interactions

Event-Driven Pattern

Asynchronous message passing
S3 triggers Lambda on object upload
CloudWatch Events schedule functions
SQS queues decouple services
Handles variable load and failures

Data Pipeline Pattern

Sequential processing stages
S3 → Lambda → DynamoDB
ECS tasks process batch jobs
Step Functions orchestrate workflows
Supports complex data transformations

Shared Storage Pattern

Multiple services access common data
EFS for shared file access
RDS for transactional data
S3 for object sharing
ElastiCache for session storage

Integration pattern selection depends on latency requirements, failure tolerance, and operational complexity constraints.

Memory Limits Impose Service Contraints

Lambda 10GB memory limit prevents large model deployment.

Lambda Memory Constraint

Maximum allocation: 10GB RAM
PyTorch model loading overhead: 2x model size
Practical model size limit: 4-5GB maximum

Large Language Models

GPT-3.5: 13GB model weights
Llama-2 7B: 14GB model weights
Llama-2 13B: 26GB model weights
BERT Large: 1.3GB model weights

Result: Lambda cannot load models >4GB

Cold Start Penalty

Models >250MB face initialization delays:

1GB model: 2-3 second cold start
4GB model: 8-12 second cold start
Timeout before first request completion

EC2 Memory Capacity

Instance Memory Range

t3.micro: 1GB RAM ($8.76/month)
r5.large: 16GB RAM ($90.72/month)
r5.24xlarge: 768GB RAM ($4,343.04/month)
u-6tb1.metal: 6TB RAM ($17,971.20/month)

Model Deployment Examples

BERT Large (1.3GB): Runs on t3.small (2GB)
Llama-2 7B (14GB): Requires r5.large minimum
Llama-2 70B (140GB): Requires r5.24xlarge minimum

Memory vs Cost Trade-off

16GB instance: $91/month
768GB instance: $4,343/month (48x cost for 48x memory)

EC2 supports any practical model size with appropriate instance selection.

Memory requirements determine compute service viability before performance or cost considerations.

Execution Time Implose Training Blocks

Lambda 15-minute timeout eliminates ML training.

Lambda Execution Limits

Maximum execution time: 15 minutes (900 seconds)
Cannot be extended or renewed
Process terminated with no checkpoint saving
Suitable for inference only, never training

Typical ML Training Duration

Small Models (ImageNet Classification)

ResNet-50: 2-4 hours on single GPU
EfficientNet-B0: 1-2 hours on single GPU
Training epochs: 100-300 typical

Large Models (Language Models)

GPT-2 Small: 24-48 hours on 8 GPUs
BERT Base: 4-16 hours on 16 GPUs
Llama-2 7B: 184 hours on 64 GPUs

Fine-tuning Duration

BERT fine-tuning: 30-120 minutes
GPT-3.5 fine-tuning: 60-240 minutes
Still exceeds Lambda limit

EC2 Training Capability

Unlimited Execution Time

No timeout constraints
Training runs for days or weeks
Automatic checkpointing to S3 for failure recovery

Training Cost Examples

ResNet-50 on p3.2xlarge ($3.06/hour)

Training time: 3 hours
Total cost: $9.18

GPT-2 Small on p3.8xlarge ($12.24/hour)

Training time: 48 hours
Total cost: $587.52

BERT Base on p3.16xlarge ($24.48/hour)

Training time: 8 hours
Total cost: $195.84

Spot Instance Savings

Same instances: 70% discount
BERT training: $195.84 → $58.75
Risk: Training interruption every 2-6 hours

15-minute execution limit makes Lambda unsuitable for any ML training workload.

Storage Request Limits Create Bottlenecks

S3 request rate limits constrain high-throughput workloads.

S3 Request Rate Limits

Per-Prefix Limits

PUT/COPY/POST/DELETE: 3,500 requests/second
GET/HEAD: 5,500 requests/second
Prefix = everything before last “/” in object key

Distributed Training Impact

100-GPU Training Job

Each GPU requests 10 data batches/second
Total requests: 1,000/second
Within S3 limits if properly prefixed

1000-GPU Training Job

Each GPU requests 10 data batches/second
Total requests: 10,000/second
Exceeds S3 GET limit by 82%
Result: Training stalls waiting for data

Request Hotspotting

Single prefix: /training-data/imagenet/
All 1000 GPUs hit same prefix
Requests throttled, training blocked

EBS IOPS Limitations

Volume Type Performance

EBS Type	Max IOPS	Max Throughput	Cost/Month (100GB)
gp3	16,000	1,000 MB/s	$8.00
io2	64,000	1,000 MB/s	$65.00
gp2	10,000	250 MB/s	$10.00

Database Workload Impact

PostgreSQL with 1M records/second inserts

Required IOPS: 50,000-80,000
gp3 volume: Cannot support workload
io2 volume: $65/month + $3,250 IOPS charges = $3,315/month

Machine Learning Dataset Loading

ImageNet (1.2M images): 500 MB/s sequential read
gp3 volume: Supports workload at $8/month
Random access training: Requires higher IOPS

Multi-Instance Sharing

EBS limitation: Single attachment point
Cannot share between training instances
Requires data replication or network storage

Storage performance limits determine data access patterns and training architecture.

Cost Models Favor Different Usage Patterns

Lambda pay-per-request vs EC2 always-on pricing.

Usage Pattern Analysis

Scenario 1: Sporadic Inference (100 requests/day)

Lambda Costs

Requests: 100/day × 30 days = 3,000/month
Duration: 200ms average per request
Memory: 1GB allocated
Monthly cost: $0.60

EC2 Alternative (t3.micro always-on)

Instance cost: $8.76/month
Always running regardless of usage
14.6x more expensive than Lambda

Break-even point: 1,460 requests/day

Scenario 2: High-Volume Inference (100,000 requests/day)

Lambda Costs

Requests: 3M/month
Monthly cost: $600

EC2 Alternative (c5.large)

Instance cost: $61.32/month
Can handle 100,000 requests/day
10x cheaper than Lambda

Cost Crossover Points

Request Volume Thresholds

Instance Type	Monthly Cost	Lambda Break-even
t3.nano	$4.38	730 req/day
t3.micro	$8.76	1,460 req/day
t3.small	$17.52	2,920 req/day
c5.large	$61.32	10,220 req/day

Memory Impact on Lambda Costs

Memory	Cost per GB-second	1M req/month cost
128MB	Base rate	$200
1GB	8x base	$1,600
3GB	24x base	$4,800
10GB	80x base	$16,000

Duration Impact

100ms execution: $200/million requests
1 second execution: $2,000/million requests
10 second execution: $20,000/million requests

Cost optimization requires matching service pricing model to actual usage patterns.

Service Constraints Determine Architecture

Hard limits eliminate service options before cost optimization.

Constraint Hierarchy

1. Hard Constraints (Service Elimination)

Memory > 10GB → Lambda impossible
Execution > 15 minutes → Lambda impossible
Shared storage access → S3 or EFS required
64,000 IOPS → Multiple EBS volumes required

2. Performance Constraints (Service Selection)

<100ms latency → Pre-warmed instances required
5,500 requests/second → S3 prefix distribution required
16,000 IOPS → io2 volumes required

3. Cost Constraints (Configuration Optimization)

Variable load → Lambda or auto-scaling preferred
Consistent load → Reserved instances preferred
Development → Spot instances acceptable

Real Architecture Decisions

Large Model Serving (7GB model)

Memory constraint eliminates Lambda
Always-on requirement eliminates spot instances
Load balancing required for availability
Result: EC2 + ALB + Auto Scaling Group

Batch Processing (2-hour jobs)

Execution time eliminates Lambda
Intermittent usage favors spot instances
Job queuing handles interruptions
Result: EC2 Spot + SQS + Auto Scaling

Service constraints determine feasible architectures; cost considerations optimize within remaining options.

Cloud ML System Design

From Local Scripts to Cloud Services

Transform single-machine PyTorch workflows into systems using EC2 and S3.

Local Development Workflow

# Everything on laptop
import torch
import pandas as pd

# Load data (local file)
data = pd.read_csv('dataset.csv')

# Train model (local GPU)
model = train_pytorch_model(data)

# Save model (local disk)  
torch.save(model, 'model.pth')

# Serve predictions (local process)
app.run(host='localhost', port=5000)

Local Constraints:

Data limited by disk space (2TB max)
Training limited by GPU memory (24GB)
Serving limited to single user
No backup or redundancy
Cannot scale beyond one machine

Cloud Workflow Using EC2 + S3

# Distributed across services
import boto3
import torch

# Load data (from S3)
s3.download_file('ml-bucket', 'dataset.csv', '/tmp/dataset.csv')

# Train model (EC2 with GPU)
model = train_pytorch_model(data)

# Save model (to S3)
torch.save(model, '/tmp/model.pth')
s3.upload_file('/tmp/model.pth', 'ml-bucket', 'models/model.pth')

# Serve predictions (Lambda + S3)
def lambda_handler(event, context):
    s3.download_file('ml-bucket', 'models/model.pth', '/tmp/model.pth')
    model = torch.load('/tmp/model.pth')
    return model.predict(event['input'])

Cloud Capabilities:

Data storage scales to petabytes (S3)
Training scales to multiple GPUs (EC2)
Serving handles thousands of users (Lambda)
Automatic backup and replication (S3)
Pay only for resources used

EC2 instances and S3 buckets require API integration and IAM configuration for functional ML systems.

Basic ML System Architecture

Simple ML system using EC2 for training and Lambda for serving.

Component Design

Data Storage (S3)

Training data: s3://ml-bucket/data/
Model artifacts: s3://ml-bucket/models/
Predictions: s3://ml-bucket/results/

Training Infrastructure (EC2)

Instance type: p3.2xlarge (1 GPU, 8 vCPUs, 61GB RAM)
AMI: Deep Learning AMI (Ubuntu) with PyTorch pre-installed
Storage: 100GB EBS volume for temporary data
IAM role: S3 read/write permissions

Serving Infrastructure (Lambda)

Runtime: Python 3.9
Memory: 3GB (for model loading)
Timeout: 30 seconds
Trigger: API Gateway HTTP requests

System Data Flow

Upload training data to S3 bucket
Launch EC2 instance with training script
EC2 downloads data from S3, trains model
EC2 uploads trained model back to S3
Lambda function loads model from S3 for predictions
API Gateway routes prediction requests to Lambda

Total monthly cost: ~$330 for moderate ML workload with occasional training and regular serving.

Training System Design

EC2-based training system with S3 data management.

Training Job Configuration

EC2 Instance Setup

# Launch instance
aws ec2 run-instances \
  --image-id ami-0c02fb55956c7d316 \
  --instance-type p3.2xlarge \
  --key-name my-key \
  --security-group-ids sg-12345678

# SSH and setup
ssh -i my-key.pem ubuntu@instance-ip
sudo apt update && sudo apt install awscli

Training Script Structure

#!/usr/bin/env python3
import boto3
import torch

# Download training data
s3 = boto3.client('s3')
s3.download_file('ml-bucket', 'train.csv', 'data/train.csv')

# Load and train
data = load_data('data/train.csv')  
model = MyModel()
train_model(model, data, epochs=100)

# Upload trained model
torch.save(model.state_dict(), 'model.pth')
s3.upload_file('model.pth', 'ml-bucket', 'models/model_v1.pth')

# Cleanup and terminate
os.system('sudo shutdown -h now')

Cost Optimization

Use spot instances for 70% cost reduction
Terminate instance when training completes
Use appropriate instance size for model

Training Performance Analysis

Model Size	Local (RTX 4090)	EC2 (p3.2xlarge)	Cost
Small (10M params)	2 hours	1.5 hours	$4.59
Medium (100M params)	8 hours	6 hours	$18.36
Large (1B params)	Cannot fit	24 hours	$73.44

Training Workflow

Prepare training data locally
Upload data to S3 bucket
Launch EC2 instance with training script
Monitor training progress via CloudWatch logs
Retrieve trained model from S3
Terminate instance to stop billing

Failure Handling

Save checkpoints to S3 every epoch
Use spot instance interruption handling
Implement training resume from checkpoint
Set up CloudWatch alarms for long-running jobs

Training System Benefits: Scales beyond local GPU memory, handles larger datasets, provides cost flexibility through spot instances.

Serving System Design

Lambda-based serving with S3 model storage.

Lambda Function Implementation

import json
import boto3
import torch
import tempfile

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Download model from S3 (cached after first call)
    if not hasattr(lambda_handler, 'model'):
        with tempfile.NamedTemporaryFile() as tmp:
            s3.download_file('ml-bucket', 'models/model_v1.pth', tmp.name)
            lambda_handler.model = torch.load(tmp.name, map_location='cpu')
    
    # Parse input
    input_data = json.loads(event['body'])
    
    # Make prediction
    with torch.no_grad():
        prediction = lambda_handler.model(input_data['features'])
    
    return {
        'statusCode': 200,
        'body': json.dumps({'prediction': prediction.tolist()})
    }

API Gateway Configuration

REST API endpoint: https://api.example.com/predict
POST method with JSON payload
CORS enabled for web applications
Rate limiting: 1000 requests/second

Alternative: EC2 Serving

# For higher throughput or larger models
from flask import Flask, request
import torch

app = Flask(__name__)
model = torch.load('model.pth')  # Loaded once at startup

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    prediction = model(data['features'])
    return {'prediction': prediction.tolist()}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80)

Serving Performance Comparison

Approach	Cold Start	Warm Latency	Max Throughput	Cost/1M requests
Lambda	2-5 seconds	100-300ms	1000 concurrent	$200
EC2 t3.medium	0ms	50-100ms	100 req/sec	$300
EC2 c5.large	0ms	20-50ms	500 req/sec	$600

When to Use Each:

Lambda:

Sporadic traffic patterns
Cost optimization priority
Simple models (<250MB)
Can tolerate cold starts

EC2:

Consistent traffic
Large models (>250MB)
Low latency requirements (<50ms)
Need persistent connections

Serving Design Choice: Lambda for variable workloads, EC2 for consistent high-throughput requirements.

Data Management Patterns

S3-based data organization for ML workflows.

S3 Bucket Organization

ml-project-bucket/
├── data/
│   ├── raw/
│   │   ├── 2024/01/15/data.csv
│   │   └── 2024/01/16/data.csv
│   ├── processed/
│   │   ├── train.parquet
│   │   └── test.parquet
│   └── features/
│       └── feature_v1.csv
├── models/
│   ├── experiments/
│   │   ├── exp_001/model.pth
│   │   └── exp_002/model.pth
│   └── production/
│       ├── model_v1.pth
│       └── model_v2.pth
└── results/
    ├── predictions/
    └── metrics/

Data Processing Pipeline

# Data validation and preprocessing
def process_data():
    # Download raw data
    s3.download_file('bucket', 'data/raw/data.csv', 'raw.csv')
    
    # Clean and validate
    df = pd.read_csv('raw.csv')
    df = validate_schema(df)
    df = clean_missing_values(df)
    
    # Split and save
    train, test = train_test_split(df)
    train.to_parquet('train.parquet')
    test.to_parquet('test.parquet')
    
    # Upload processed data
    s3.upload_file('train.parquet', 'bucket', 'data/processed/train.parquet')
    s3.upload_file('test.parquet', 'bucket', 'data/processed/test.parquet')

S3 Storage Class Strategy

Data Type	Access Pattern	Storage Class	Cost/GB/month
Raw data	Archive only	Glacier	$0.004
Processed training data	Weekly access	IA	$0.0125
Active models	Daily access	Standard	$0.023
Predictions	Real-time	Standard	$0.023

Data Lifecycle Management

# Lifecycle policy example
lifecycle_policy = {
    'Rules': [{
        'Status': 'Enabled',
        'Transitions': [
            {
                'Days': 30,
                'StorageClass': 'STANDARD_IA'
            },
            {
                'Days': 90, 
                'StorageClass': 'GLACIER'
            }
        ]
    }]
}

Data Access Patterns

Training: High bandwidth, infrequent access
Serving: Low bandwidth, frequent access
Archival: No bandwidth, rare access
Monitoring: Medium bandwidth, regular access

Cost Optimization

Use appropriate storage class
Compress data files (parquet vs CSV)
Partition large datasets by date/category
Delete intermediate processing files

Data Strategy: Organize by lifecycle stage, optimize storage classes for access patterns, implement automated lifecycle policies.

System Integration and Orchestration

Connect EC2 training and Lambda serving through S3.

End-to-End Workflow

Automated Training Pipeline

# CloudWatch Event triggered training
def trigger_training(event, context):
    # Launch EC2 training instance
    ec2 = boto3.client('ec2')
    
    user_data_script = '''#!/bin/bash
    aws s3 cp s3://ml-bucket/scripts/train.py /home/ubuntu/
    cd /home/ubuntu
    python3 train.py
    sudo shutdown -h now
    '''
    
    response = ec2.run_instances(
        ImageId='ami-0c02fb55956c7d316',  # Deep Learning AMI
        InstanceType='p3.2xlarge',
        MinCount=1, MaxCount=1,
        UserData=user_data_script,
        IamInstanceProfile={'Name': 'ML-Training-Role'}
    )
    
    return {'instance_id': response['Instances'][0]['InstanceId']}

Model Update Workflow

# S3 trigger for model updates
def update_serving_model(event, context):
    # New model uploaded to S3
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    if key.startswith('models/production/'):
        # Update Lambda environment variable
        lambda_client = boto3.client('lambda')
        lambda_client.update_function_configuration(
            FunctionName='ml-serving-function',
            Environment={'Variables': {'MODEL_PATH': key}}
        )

Monitoring and Alerting

CloudWatch Metrics

Training job duration and cost
Model serving latency and error rates
S3 storage usage and costs
Lambda function invocations and failures

Automated Alerts

# CloudWatch alarm for training failures
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
    AlarmName='ML-Training-Failed',
    MetricName='InstanceTerminated',
    Namespace='AWS/EC2',
    Statistic='Sum',
    Period=300,
    EvaluationPeriods=1,
    Threshold=1,
    ComparisonOperator='GreaterThanThreshold',
    AlarmActions=['arn:aws:sns:us-east-1:123456789012:ml-alerts']
)

System Health Dashboard

Active training jobs and progress
Model serving performance metrics
Daily/weekly cost breakdown
Data pipeline health status

Integration Principles: Use S3 as central data store, automate workflows with triggers, implement comprehensive monitoring.

Cost Management and Optimization

Practical cost control for EC2 and S3 based ML systems.

Cost Breakdown Analysis

Monthly Costs for Typical ML Project

S3 storage (500GB): $11.50
EC2 training (20 hours p3.2xlarge): $612
Lambda serving (1M requests): $200
Data transfer: $50
Total: $873.50/month

Cost Optimization Strategies

EC2 Training Optimization

Use spot instances: 70% cost reduction ($612 → $184)
Right-size instances: Match model requirements to instance type
Automated termination: Stop instances when training completes
Reserved instances: 60% discount for predictable workloads

S3 Storage Optimization

Lifecycle policies: Automatic transition to cheaper storage classes
Data compression: 50-80% size reduction with parquet/gzip
Intelligent tiering: Automatic cost optimization
Delete temporary files: Clean up intermediate processing data

Lambda Serving Optimization

Memory allocation: Match to actual model requirements
Provisioned concurrency: Reduce cold start costs for consistent traffic
Alternative architectures: Consider EC2 for high-volume serving

Monitoring and Budgets

Cost allocation tags: Track expenses by project/team
Billing alerts: Notification when costs exceed thresholds
Usage reports: Identify optimization opportunities

Cost Optimization Impact

Optimization	Before	After	Savings
Spot instances	$612	$184	$428 (70%)
S3 lifecycle	$11.50	$5.75	$5.75 (50%)
Right-sizing	$200	$120	$80 (40%)
Total	$873.50	$359.75	$513.75

Monthly savings: 59% through optimization

Budgeting Framework

# Set up cost alerts
import boto3

budgets = boto3.client('budgets')
budgets.create_budget(
    AccountId='123456789012',
    Budget={
        'BudgetName': 'ML-Project-Budget',
        'BudgetLimit': {
            'Amount': '500',
            'Unit': 'USD'
        },
        'TimeUnit': 'MONTHLY',
        'BudgetType': 'COST'
    },
    NotificationsWithSubscribers=[{
        'Notification': {
            'NotificationType': 'ACTUAL',
            'ComparisonOperator': 'GREATER_THAN',
            'Threshold': 80
        },
        'Subscribers': [{
            'SubscriptionType': 'EMAIL',
            'Address': 'admin@company.com'
        }]
    }]
)

Cost Management Process: Set budgets, implement optimizations, monitor usage patterns, adjust resources based on actual requirements.

AWS Identity and Access Management

Amazon Resource Names: Global Resource Identification

AWS uses ARNs to uniquely identify every resource across all accounts and regions globally.

ARN Structure Format

arn:partition:service:region:account-id:resource-type/resource-id

Component Breakdown

Partition: AWS deployment (usually “aws”)

aws - Standard AWS regions
aws-cn - China regions
aws-us-gov - GovCloud regions

Service: AWS service name

s3 - Simple Storage Service
ec2 - Elastic Compute Cloud
iam - Identity and Access Management
lambda - Lambda Functions

Region: Geographic region identifier

us-east-1 - US East (Virginia)
eu-west-1 - EU (Ireland)
Empty for global services (IAM, S3 bucket names)

Account ID: 12-digit account identifier

123456789012 - Specific AWS account
Empty for public resources

Resource: Service-specific identifier

bucket-name - S3 bucket
instance/i-1234567890abcdef0 - EC2 instance
user/developer-name - IAM user

Real ARN Examples

S3 Bucket ARN

arn:aws:s3:::ml-training-bucket-12345

Global resource (no region/account)
Bucket names must be globally unique

S3 Object ARN

arn:aws:s3:::ml-training-bucket-12345/models/bert-base.pth

Specific object within bucket
Used in policies for granular access

EC2 Instance ARN

arn:aws:ec2:us-east-1:123456789012:instance/i-0abcd1234efgh5678

Region-specific resource
Account-specific identifier

IAM Role ARN

arn:aws:iam::123456789012:role/EC2-ML-Training-Role

Global service (no region)
Account-specific role

Lambda Function ARN

arn:aws:lambda:us-east-1:123456789012:function:iris-classifier-api

Region and account specific
Function name as resource ID

Policy Usage Example

{
  "Effect": "Allow",
  "Action": "s3:GetObject",
  "Resource": [
    "arn:aws:s3:::ml-training-bucket-12345/data/*",
    "arn:aws:s3:::ml-training-bucket-12345/models/*"
  ]
}

ARNs enable precise resource identification across AWS’s global infrastructure, supporting granular access control and cross-service integration.

Resource IDs and Naming Conventions

AWS generates unique identifiers for resources with predictable patterns for programmatic access.

AWS-Generated IDs

EC2 Instances

Pattern: i- + 17 hex characters
Example: i-0abcd1234efgh5678
Unique within region, persistent across stop/start

Security Groups

Pattern: sg- + 17 hex characters
Example: sg-0123456789abcdef0
Referenced in networking and firewall rules

VPCs (Virtual Private Clouds)

Pattern: vpc- + 17 hex characters
Example: vpc-12345678901234567
Container for all networking resources

Subnets

Pattern: subnet- + 17 hex characters
Example: subnet-0abcdef1234567890
Network segment within VPC and AZ

AMI (Amazon Machine Images)

Pattern: ami- + 17 hex characters
Example: ami-0c2b8ca1dad447f8a
Immutable OS image for launching instances

EBS Volumes

Pattern: vol- + 17 hex characters
Example: vol-0123456789abcdef0
Block storage attached to instances

User-Defined Naming

S3 Bucket Names (Global)

Must be globally unique across all AWS accounts
3-63 characters, lowercase, no underscores
Examples: ml-training-data-company-2024, model-artifacts-prod

IAM Names (Account-scoped)

User names: developer-john-smith, ci-cd-deployment
Role names: EC2-ML-Training-Role, Lambda-S3-Access
Policy names: MLTrainingDataAccess, ModelDeploymentPermissions

Tags for Resource Organization

{
  "Environment": "production",
  "Project": "ml-classifier",
  "Owner": "data-science-team",
  "CostCenter": "research-development"
}

Naming Best Practices

Descriptive and Searchable

Good: ml-training-p3xlarge-gpu-instance
Bad: my-instance-1

Environment Separation

ml-model-artifacts-dev
ml-model-artifacts-staging
ml-model-artifacts-prod

Service Integration

# EC2 instance launches with role
aws ec2 run-instances \
    --image-id ami-0c2b8ca1dad447f8a \
    --instance-type p3.2xlarge \
    --iam-instance-profile Name=EC2-ML-Training-Profile \
    --security-group-ids sg-0123456789abcdef0 \
    --subnet-id subnet-0abcdef1234567890

Consistent resource naming and understanding ID patterns enables automation, cost tracking, and operational management at scale.

Identity Hierarchies: Users, Roles, and Service Accounts

Distributed systems require identity verification across network boundaries without shared local authentication.

Distributed Systems Security Problem

Local systems rely on operating system authentication:

Single login validates all local resource access
File permissions enforced by kernel
Process isolation prevents unauthorized access
Network access assumed trusted (localhost)

Cloud Distribution Challenge

Resources span multiple physical machines across datacenters
Network communication between untrusted systems
No shared operating system to enforce permissions
Service-to-service calls cross security boundaries
Identity must be verified for every distributed request

IAM as Distributed Security Solution

AWS IAM solves distributed identity through:

Centralized identity store: Single source of truth for all accounts
Network-based credentials: Authentication tokens sent over network
Service-specific permissions: Each API call individually authorized
Cross-boundary trust: Roles enable secure service communication

IAM Identity Types

Root Account

Complete administrative access to all AWS services and resources
Email address and password used for initial account creation
Cannot be restricted through IAM policies
Should never be used for day-to-day operations
Requires multi-factor authentication for production accounts

IAM Users

Individual identity for human access to AWS resources
Permanent credentials (access key ID and secret access key)
Optional password for console access
Direct attachment of policies and group membership
Maximum 5,000 users per AWS account

IAM Roles

Temporary credentials for applications, services, or cross-account access
No permanent credentials - credentials issued dynamically
Assumed by trusted entities (users, services, other accounts)
Preferred method for EC2 instances and Lambda functions
Cross-account access without sharing permanent credentials

Service-Linked Roles

Predefined roles for specific AWS services
Automatically created and managed by AWS services
Cannot be modified or deleted by users
Required for services like ECS, Lambda, and Auto Scaling

Identity Hierarchy Structure

AWS Account (Root)
├── IAM Users
│   ├── Individual Developer A
│   ├── Individual Developer B
│   └── CI/CD System User
├── IAM Groups
│   ├── Developers Group
│   ├── Administrators Group
│   └── Read-Only Group
├── IAM Roles
│   ├── EC2-ML-Training-Role
│   ├── Lambda-Execution-Role
│   └── Cross-Account-Access-Role
└── Service-Linked Roles
    ├── ECS Task Role
    ├── Auto Scaling Role
    └── CloudFormation Role

Identity Relationship Dependencies

User → Group Membership

Users inherit permissions from all assigned groups
Group policy changes affect all group members immediately
Maximum 10 groups per user, 300 groups per account

Role → Trust Relationships

Trust policy defines which entities can assume the role
Role policies define permissions when role is assumed
Temporary credentials expire (15 minutes to 12 hours)

Cross-Account Trust

Role in Account A trusts specific users/roles in Account B
External ID required for enhanced security in cross-account scenarios
Audit trail through CloudTrail for all role assumptions

Design Objective: Least privilege access - grant minimum permissions required for specific tasks, expandable through group membership or role assumption.

Permission Models: Policies, Actions, and Resource Restrictions

Distributed systems require explicit authorization for every network request.

Local vs Distributed Authorization

Local System Authorization (Traditional)

Operating system controls file access through uid/gid
Process inherits user permissions automatically
File system enforces read/write/execute permissions
No network authorization required for local resources

Distributed System Authorization (Cloud)

Every API call evaluated independently across network
No inherited permissions between services
Each resource access requires explicit policy evaluation
Network requests carry identity and are verified remotely

Policy-Based Authorization Model

IAM implements declarative security through JSON policies:

Policy Document Structure

Basic Policy Components

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::ml-training-bucket/*",
            "Condition": {
                "StringEquals": {
                    "s3:prefix": ["models/", "datasets/"]
                }
            }
        }
    ]
}

Policy Types and Attachment Methods

Identity-Based Policies

Attached directly to users, groups, or roles
Define permissions for the identity across all resources
Inherited through group membership
Maximum 10 managed policies per identity

Resource-Based Policies

Attached directly to resources (S3 buckets, Lambda functions)
Define which identities can access the resource
Cross-account access without role assumption
Resource owner maintains control over access

Permission Boundaries

Maximum permissions an identity can have
Does not grant permissions, only limits them
Applied to users and roles, not groups
Advanced feature for delegation of administrative tasks

Policy Evaluation Logic

Common Permission Patterns

Service-Specific Actions

s3:ListBucket - List objects in S3 bucket
ec2:RunInstances - Launch EC2 instances
iam:CreateRole - Create IAM roles
logs:CreateLogGroup - Create CloudWatch log groups

Resource ARN Patterns

arn:aws:s3:::bucket-name/* - All objects in bucket
arn:aws:ec2:us-east-1:*:instance/* - All instances in region
arn:aws:iam::account-id:role/role-name - Specific IAM role

Policy Evaluation Rule: Explicit deny always wins, followed by explicit allow, with implicit deny as default for all unspecified actions.

AWS Access Methods: Console, CLI, and SDK Integration

Multiple programmatic and interactive interfaces provide access to AWS services with different authentication and use case optimization.

AWS Management Console

Web-based graphical interface for all AWS services
Requires username/password authentication
Multi-factor authentication support required for production
Session-based access with configurable timeout
Visual resource management and monitoring dashboards

Console Authentication Flow

User Login → MFA Verification → Session Token
├── Session Duration: 12 hours maximum
├── Automatic logout on inactivity
├── Role switching within console
└── CloudTrail logging of all actions

AWS Command Line Interface (CLI)

Text-based tool for scriptable AWS service interaction
Supports all AWS service APIs through consistent command structure
Local credential configuration and profile management
Batch operations and automation scripting
Output formats: JSON, table, text for different use cases

CLI Installation and Configuration

# Install AWS CLI v2
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip && sudo ./aws/install

# Configure default profile
aws configure
# AWS Access Key ID: AKIAIOSFODNN7EXAMPLE
# AWS Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Default region name: us-east-1
# Default output format: json

AWS Software Development Kits (SDKs)

Language-specific libraries for AWS service integration
Available for Python (boto3), Java, .NET, Node.js, Go, Rust
Automatic retry logic and error handling
Built-in credential chain resolution
Asynchronous operations and pagination support

SDK Authentication Hierarchy

Explicit credentials in code (not recommended)
Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
Credentials file (~/.aws/credentials)
IAM roles for EC2 instances (Instance Metadata Service)
IAM roles for ECS tasks (Task role assignment)
IAM roles for Lambda functions (Execution role)

Python SDK (boto3) Example

import boto3

# Automatic credential resolution
s3_client = boto3.client('s3')

# List buckets
response = s3_client.list_buckets()
for bucket in response['Buckets']:
    print(f"Bucket: {bucket['Name']}")

# Upload file with automatic multipart
s3_client.upload_file(
    'local_file.txt', 
    'ml-training-bucket', 
    'datasets/file.txt'
)

Access Method Performance Comparison

Console: Interactive exploration, visual debugging
CLI: Automation scripts, CI/CD integration
SDK: Application integration, programmatic access
API: Direct HTTP calls, custom tooling

Credential Security Principle: Use temporary credentials (roles) for applications, permanent credentials only for development environments with regular rotation.

Credential Management: Security Keys, Profiles, and Rotation

Secure credential management requires understanding authentication mechanisms, storage locations, and rotation procedures for maintaining system security.

Credential Types and Use Cases

Access Key Pairs (Permanent Credentials)

Access Key ID: Public identifier (20 characters, starts with AKIA)
Secret Access Key: Private key (40 characters, base64-encoded)
Used for programmatic access via CLI and SDKs
Maximum 2 active access keys per IAM user
Require regular rotation (recommended 90 days)

Temporary Security Credentials

Session token in addition to access key pair
Limited lifetime (15 minutes to 36 hours)
Issued through AWS Security Token Service (STS)
Cannot be extended - must be refreshed before expiration
Used automatically by EC2 instance roles and Lambda functions

Multi-Factor Authentication (MFA)

Virtual MFA devices (Google Authenticator, Authy)
Hardware MFA devices (YubiKey, Gemalto)
Required for sensitive operations (root account, role assumption)
Time-based one-time passwords (TOTP) or challenge-response

Credential Storage Mechanisms

Local Configuration Files

# ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

[development]
aws_access_key_id = AKIAI44QH8DHBEXAMPLE
aws_secret_access_key = je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY

# ~/.aws/config  
[default]
region = us-east-1
output = json

[profile development]
region = us-west-2
output = table

Environment Variable Configuration

export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export AWS_DEFAULT_REGION=us-east-1
export AWS_PROFILE=development

Instance Metadata Service (IMDS)

Automatic credential delivery to EC2 instances
No permanent credentials stored on instance
Credentials refreshed automatically before expiration
IMDSv2 requires token-based requests for enhanced security

# Get instance role credentials (IMDSv2)
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" \
    -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

CREDENTIALS=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/iam/security-credentials/role-name)

Credential Security Best Practices

Development Environment

Use named profiles for different projects/accounts
Never commit credentials to version control systems
Use environment variables for containerized applications
Implement credential scanning in CI/CD pipelines

Production Environment

IAM roles for all EC2 instances and Lambda functions
Cross-account roles instead of shared permanent credentials
Regular rotation of any permanent credentials (90-day maximum)
Monitoring and alerting for credential usage anomalies

Credential Rotation Procedure

Create second access key while first remains active
Update applications to use new credentials
Test functionality with new credentials
Delete old access key after verification
Monitor CloudTrail for any authentication failures

Security Implementation Standard: Production systems must use IAM roles with temporary credentials; permanent access keys only for development environments with mandatory rotation procedures.

Role Assumption and Cross-Account Access Patterns

Distributed systems require transitive trust without credential sharing.

Distributed Trust Problem

Traditional network security uses shared secrets:

Database passwords shared across all application servers
API keys distributed to every service that needs access
Credentials stored in configuration files on multiple machines
Single credential compromise affects entire system

Transitive Trust Challenge

ML systems require service-to-service access:

Training service needs to read S3 data and write models
API service needs to load models and log predictions
Monitoring service needs to access logs from all other services
Each service runs on separate machines with separate credentials

Role Assumption as Trust Delegation

IAM roles implement temporary trust without credential sharing:

Identity verification: Service proves its identity to AWS
Trust policy evaluation: AWS checks if service can assume target role
Temporary credential issuance: AWS provides time-limited access tokens
Resource access: Service uses temporary credentials for specific actions
Automatic expiration: Credentials become invalid after specified time

Role Assumption Mechanics

Trust Policy Configuration

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789012:user/DeveloperA",
                    "arn:aws:iam::123456789012:role/EC2-Instance-Role"
                ]
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "unique-external-identifier"
                }
            }
        }
    ]
}

Role Assumption Process

Authentication: Identity authenticates with AWS using permanent or temporary credentials
Authorization: AWS verifies identity has sts:AssumeRole permission for target role
Trust Evaluation: Target role’s trust policy evaluated against requesting identity
Token Issuance: AWS STS issues temporary credentials (AccessKeyId, SecretAccessKey, SessionToken)
Resource Access: Temporary credentials used for API calls within role’s permission scope

Temporary Credential Characteristics

Default session duration: 1 hour for role assumption
Maximum session duration: 12 hours (configurable per role)
Credentials include session token for authentication
Cannot be extended - must assume role again for continued access

Cross-Account Access Patterns

Development Account → Production Account

# Assume role in production account
aws sts assume-role \
    --role-arn arn:aws:iam::987654321098:role/ProductionDeploymentRole \
    --role-session-name deployment-session-2024 \
    --external-id unique-external-identifier

# Response contains temporary credentials
{
    "Credentials": {
        "AccessKeyId": "ASIAIOSFODNN7EXAMPLE",
        "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "SessionToken": "very-long-session-token-string",
        "Expiration": "2024-03-15T14:30:00Z"
    }
}

Service-to-Service Role Assumption

EC2 instances assume roles for S3 access
Lambda functions assume roles for DynamoDB operations
ECS tasks assume roles for Secrets Manager access
CodeBuild projects assume roles for deployment operations

Cross-Account Trust Relationships

Account A (Production) Trusts Account B (Development)

Account B (111111111111) - Development
├── Developer Users
├── CI/CD Systems
└── Can assume roles in Production Account

Account A (222222222222) - Production  
├── ProductionDeploymentRole (trusts Account B)
├── DataAccessRole (trusts specific users)
└── MonitoringRole (trusts service accounts)

Role Chaining Limitations

Cannot assume role from within an assumed role session
Maximum one level of role assumption for security
Use role switching in console for multi-level access
Cross-account access requires explicit trust in both directions

Access Control Architecture: Cross-account role assumption provides secure resource sharing without permanent credential distribution, enabling centralized identity management across multiple AWS environments.

AWS SDK and CLI Configuration Management

Programmatic AWS access requires proper configuration of authentication credentials, regional settings, and service-specific parameters through standardized configuration methods.

Configuration Hierarchy and Precedence

Credential Resolution Order

Command-line parameters (aws s3 ls --profile production)
Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
CLI credentials file (~/.aws/credentials)
CLI configuration file (~/.aws/config)
Container credentials (ECS task role)
Instance metadata service (EC2 instance role)

Profile-Based Configuration Management

# ~/.aws/config
[default]
region = us-east-1
output = json

[profile development]
region = us-west-2
output = table
role_arn = arn:aws:iam::123456789012:role/DevelopmentRole
source_profile = default

[profile production]
region = us-east-1
output = json
role_arn = arn:aws:iam::987654321098:role/ProductionRole
source_profile = default
external_id = prod-external-id-2024

Advanced Configuration Options

Regional Configuration

Default region for service calls
Service-specific regional overrides
Regional failover configuration for high availability
Cross-region replication settings

Output Format Specification

json: Machine-readable structured output
table: Human-readable tabular format
text: Tab-delimited values for shell scripting
yaml: YAML-formatted output for configuration files

SDK Configuration Examples

Python (boto3) Configuration

import boto3
from botocore.config import Config

# Session with specific profile
session = boto3.Session(profile_name='development')
s3_client = session.client('s3')

# Client with custom configuration
config = Config(
    region_name='us-west-2',
    retries={'max_attempts': 10, 'mode': 'adaptive'},
    max_pool_connections=50
)
ec2_client = boto3.client('ec2', config=config)

# Role assumption for cross-account access
sts_client = boto3.client('sts')
assumed_role = sts_client.assume_role(
    RoleArn='arn:aws:iam::123456789012:role/DataAccessRole',
    RoleSessionName='ml-training-session'
)

# Use temporary credentials
temp_credentials = assumed_role['Credentials']
s3_resource = boto3.resource(
    's3',
    aws_access_key_id=temp_credentials['AccessKeyId'],
    aws_secret_access_key=temp_credentials['SecretAccessKey'],
    aws_session_token=temp_credentials['SessionToken']
)

CLI Profile Operations

# List configured profiles
aws configure list-profiles

# Use specific profile
aws s3 ls --profile development

# Set default profile
export AWS_PROFILE=development

# Configure new profile interactively
aws configure --profile new-environment

Environment-Specific Configuration

# Development environment
export AWS_PROFILE=development
export AWS_DEFAULT_REGION=us-west-2

# Production environment  
export AWS_PROFILE=production
export AWS_DEFAULT_REGION=us-east-1
export AWS_DEFAULT_OUTPUT=json

Configuration Management Strategy: Use named profiles for environment separation, environment variables for containerized applications, and IAM roles for production services to maintain security boundaries and operational consistency.

Security Best Practices: Permission Boundaries and Access Monitoring

Comprehensive security requires implementing permission boundaries, continuous access monitoring, and automated compliance verification to maintain least-privilege principles.

Permission Boundary Implementation

Maximum Permission Limits

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "ec2:DescribeInstances",
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Deny",
            "Action": [
                "iam:*",
                "ec2:TerminateInstances",
                "s3:DeleteBucket"
            ],
            "Resource": "*"
        }
    ]
}

Boundary Application Pattern

Attached to IAM users and roles (not groups)
Defines maximum permissions, never grants permissions
Combined with identity-based policies using logical AND
Enables safe delegation of administrative tasks
Prevents privilege escalation through policy modification

Access Monitoring and Alerting

CloudTrail Event Monitoring

All API calls logged with identity, timestamp, source IP
Failed authentication attempts and permission denials
Root account usage (should trigger immediate alerts)
Cross-account role assumptions and policy changes
Unusual geographic access patterns or service usage

Critical Security Events

# Root account login
"eventName": "ConsoleLogin",
"userIdentity.type": "Root"

# Failed authentication attempts
"errorCode": "SigninFailure"
"errorMessage": "Invalid username or password"

# Policy modification
"eventName": "PutUserPolicy",
"eventName": "AttachRolePolicy"

# Cross-account access
"eventName": "AssumeRole",
"recipientAccountId": "different-account-id"

Automated Compliance Verification

AWS Config Rules for IAM Compliance

Root access key usage detection
Multi-factor authentication requirement validation
Unused IAM users and roles identification
Password policy compliance verification
Permission boundary attachment verification

Access Review Procedures

Quarterly Access Audit

Identity Inventory: List all users, roles, and service accounts
Permission Analysis: Review attached policies and group memberships
Access Pattern Review: Analyze CloudTrail logs for actual usage
Inactive Account Detection: Identify accounts without recent activity
Privilege Escalation Check: Verify no unauthorized permission increases

Automated Security Monitoring

import boto3
import json
from datetime import datetime, timedelta

def audit_iam_users():
    iam = boto3.client('iam')
    
    # Get all IAM users
    users = iam.list_users()['Users']
    
    for user in users:
        username = user['UserName']
        
        # Check last activity
        try:
            last_used = iam.get_user(UserName=username)['User'].get('PasswordLastUsed')
            if last_used:
                days_inactive = (datetime.now(last_used.tzinfo) - last_used).days
                if days_inactive > 90:
                    print(f"Warning: User {username} inactive for {days_inactive} days")
        except:
            print(f"Unable to check activity for {username}")
        
        # Check MFA status
        mfa_devices = iam.list_mfa_devices(UserName=username)['MFADevices']
        if not mfa_devices:
            print(f"Warning: User {username} has no MFA device")

Security Incident Response

Automatic credential rotation for compromised access keys
Role assumption monitoring for unusual patterns
Geographic access anomaly detection and blocking
Integration with SIEM systems for enterprise security

Security Architecture Principle: Implement defense-in-depth through permission boundaries, continuous monitoring, and automated compliance verification to maintain security posture at scale.

AWS ML Pipeline Implementation

EC2-S3 ML Pipeline Architecture

Local ML development breaks under production data volumes and serving requirements.

Development Environment Limitations

MacBook Pro M3 (32GB RAM)

Training dataset limit: 20GB fits in memory
Model size limit: 8B parameters maximum (32GB VRAM requirement)
Training time: 4 hours for ResNet-50 on ImageNet subset
Serving capacity: Single process, ~10 requests/second
Storage: 1TB SSD, no redundancy or backup

Production Requirements

Training Workload

Dataset: ImageNet full (1.3TB, 14M images)
Model: EfficientNet-B7 (800M parameters, 12GB memory)
Training time constraint: <8 hours for experiment iteration
Concurrent experiments: 3-5 model variants simultaneously

Serving Workload

Traffic: 1000+ requests/second peak
Latency requirement: <100ms p99
Availability: 99.9% uptime (43 minutes downtime/month)
Global deployment: US, Europe, Asia regions

Failure Points

Memory: 1.3TB dataset exceeds 32GB RAM → Training impossible
Storage: 1TB drive fills with 1.3TB dataset → Process fails
Serving: Single process cannot handle 1000 req/s → Request timeouts
Availability: Single machine failure = 100% downtime → SLA violation

Distributed Architecture Solution

EC2 Compute Scaling

Instance type: r5.2xlarge (8 vCPUs, 64GB RAM)
GPU acceleration: p3.2xlarge (V100, 16GB VRAM)
Cost: $3.06/hour for training, shut down when idle
Concurrent training: Launch multiple instances simultaneously

S3 Storage Scaling

Capacity: Unlimited storage (1.3TB+ supported)
Durability: 99.999999999% (11 9’s) - no data loss risk
Access: Concurrent reads from multiple training instances
Cost: $0.023/GB/month ($30/month for 1.3TB)

Network Integration

EC2 r5.2xlarge (us-east-1a)
├── Training Process: PyTorch + 64GB RAM
├── Data Pipeline: boto3 → S3 streaming
├── Model Output: S3 model artifacts
└── API Server: Flask + gunicorn (100 req/s)

S3 Bucket (us-east-1)
├── /data/imagenet/ (1.3TB training data)
├── /models/experiments/ (trained model weights)
└── /logs/training/ (experiment tracking)

Cost Structure

Training: $3.06/hour × 8 hours = $24.48 per experiment
Storage: $30/month for dataset (vs $0 local storage)
Serving: $61/month always-on (vs free local serving)
Total: ~$115/month vs $15K workstation purchase

Operational Complexity

Network latency: 20ms S3 access vs <1ms local SSD
Security: IAM policies vs local file permissions
Failure modes: Service dependencies vs single machine reliability

This architecture trades local simplicity for production scalability at the cost of operational complexity and network dependencies.

AWS BILLING WARNING

AWS requires a credit card for account signup. Charges begin upon resource creation.

CRITICAL BILLING SAFETY - IMPLEMENT IMMEDIATELY:

1. Set Billing Alerts

# Set $10 billing alert via AWS CLI
aws budgets create-budget --account-id 123456789012 \
    --budget '{
        "BudgetName": "Monthly-Spend-Alert",
        "BudgetLimit": {"Amount": "10", "Unit": "USD"},
        "TimeUnit": "MONTHLY",
        "BudgetType": "COST"
    }'

2. Always Terminate Resources

Stop instances: Saves compute costs, keeps storage costs
Terminate instances: Deletes everything, stops all charges
Delete S3 buckets: Ongoing storage charges until deleted
Never leave resources running overnight

3. Use Free Tier Eligible Resources Only

t2.micro/t3.micro instances (750 hours/month free)
30GB EBS storage free per month
5GB S3 storage free per month
RDS t2.micro database (750 hours/month free)

EXPENSIVE MISTAKES TO AVOID:

GPU Instances: p3.2xlarge costs $3.06/hour ($2,200/month if left running)

Data Transfer: Cross-region transfer costs $0.09/GB (expensive for large datasets)

Load Balancers: Application Load Balancer costs $16.20/month + $0.008 per hour per rule

Auto Scaling: Can launch dozens of instances automatically during traffic spikes

Real Student Bill Examples:

Forgot running p3.8xlarge: $2,400 weekend charge
Left 20 instances in Auto Scaling Group: $1,200 monthly bill
Accidentally replicated 500GB across regions: $45 transfer charge

PROTECTION CHECKLIST:

Billing alerts configured for $10, $25, $50 thresholds
AWS CLI/Console set to us-west-2 (cheapest region)
Only use instance types explicitly mentioned in assignments
Terminate ALL resources after each lab session
Monitor billing dashboard weekly during course

When In Doubt: STOP and TERMINATE EVERYTHING

EC2 Instance Configuration

Create a Linux development environment optimized for ML workloads.

Instance Launch Configuration

AMI Selection

Navigate to EC2 console → Launch Instance
Search “Ubuntu Server 22.04 LTS”
Select official Canonical AMI (ami-0c02fb55956c7d316)
Base Ubuntu installation - will install ML frameworks manually

Instance Type Selection

Choose t3.medium (2 vCPUs, 4GB RAM) for cost efficiency
Avoid GPU instances for initial setup (p3 costs $3+/hour)
Sufficient for small model training and development

Storage Configuration

Root volume: 30 GB gp3 SSD (general purpose)
No additional EBS volumes needed for demo
Enable “Delete on Termination” to avoid storage charges

Network and Security

Use default VPC and subnet
Create new security group: “ml-development”
Allow SSH (port 22) from your IP address only
Allow HTTP (port 80) for API endpoint access

Key Pair Authentication

Create new key pair: “ml-training-key”
Download .pem file (store securely)
Required for SSH access to instance

Launch Process Checklist

# Verify instance is running
aws ec2 describe-instances \
    --instance-ids i-1234567890abcdef0

# Connect via SSH
ssh -i ml-training-key.pem \
    ubuntu@ec2-xx-xx-xx-xx.compute-1.amazonaws.com

# Check system info
uname -a
python3 --version

Expected Costs

t3.medium: $0.0416/hour ($30/month if left running)
Storage: 30GB × $0.08/GB/month = $2.40/month
Data transfer: First 1GB free, then $0.09/GB

Verification: Instance reaches “running” state, passes status checks, accepts SSH connections.

Development Environment Setup

Configure the instance for ML development with manual Docker installation.

System Updates and Dependencies

# Connect to instance
ssh -i ml-training-key.pem ubuntu@<instance-ip>

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install essential development tools
sudo apt install -y \
    git \
    htop \
    tree \
    curl \
    wget \
    unzip

# Verify Python environment
python3 --version
which python3

Docker Installation (Manual)

# Remove any old Docker versions
sudo apt-get remove docker docker-engine docker.io containerd runc

# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

# Add Docker repository
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker Engine
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

# Add user to docker group
sudo usermod -aG docker ubuntu
newgrp docker

# Verify Docker installation
docker --version
docker run hello-world

Python Environment Configuration

# Install Python package manager
sudo apt install -y python3-pip python3-venv

# Create virtual environment for ML
python3 -m venv ml-env
source ml-env/bin/activate

# Install ML frameworks and cloud integration packages
pip install \
    torch \
    boto3 \
    pandas \
    scikit-learn \
    matplotlib \
    flask \
    joblib \
    psutil

# Verify PyTorch installation
python -c "import torch; print(torch.__version__)"
python -c "import torch; print(torch.cuda.is_available())"

AWS CLI Configuration

# Install/update AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Configure credentials (use IAM user with S3 permissions)
aws configure
# AWS Access Key ID: [your-access-key]
# AWS Secret Access Key: [your-secret-key]  
# Default region: us-east-1
# Default output format: json

# Test AWS connectivity
aws s3 ls

Environment Verification

Docker runs without sudo
Virtual environment activated with PyTorch
AWS CLI can list S3 buckets
All required Python packages installed

S3 Data Storage Implementation

Create cloud storage for training data and model artifacts.

Create S3 Bucket via AWS Console

Navigate to S3 service in AWS console
Click “Create bucket”
Bucket name: ml-training-{random-suffix} (must be globally unique)
Region: us-east-1 (same as EC2 instance)
Block public access: Keep default (enabled)
Versioning: Disabled for demo
Default encryption: Server-side encryption with S3 managed keys

Bucket Structure

ml-training-demo-12345/
├── data/
│   ├── raw/
│   │   └── iris.csv
│   └── processed/
├── models/
│   └── experiments/
└── logs/
    └── training/

Upload Sample Dataset

# Create sample dataset locally
python3 << EOF
from sklearn.datasets import load_iris
import pandas as pd

# Load iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
df.to_csv('iris.csv', index=False)
print(f"Created dataset with {len(df)} rows")
EOF

# Upload to S3
aws s3 cp iris.csv s3://ml-training-demo-12345/data/raw/iris.csv

# Verify upload
aws s3 ls s3://ml-training-demo-12345/data/raw/

Test S3 Access from Python

import boto3
import pandas as pd
from io import StringIO

# Initialize S3 client
s3_client = boto3.client('s3')
bucket_name = 'ml-training-demo-12345'

# List bucket contents
response = s3_client.list_objects_v2(Bucket=bucket_name)
for obj in response.get('Contents', []):
    print(f"Object: {obj['Key']}, Size: {obj['Size']} bytes")

# Download data for training
obj = s3_client.get_object(Bucket=bucket_name, Key='data/raw/iris.csv')
data = pd.read_csv(obj['Body'])
print(f"Loaded {len(data)} rows, {len(data.columns)} columns")
print(data.head())

S3 Access Patterns

Download: Copy S3 object to local filesystem
Stream: Read S3 object directly into memory
Upload: Copy local file or memory buffer to S3
List: Enumerate objects in bucket prefix

Cost Monitoring

# Check current month S3 costs
aws ce get-cost-and-usage \
    --time-period Start=2025-01-01,End=2025-02-01 \
    --granularity MONTHLY \
    --metrics BlendedCost \
    --group-by Type=DIMENSION,Key=SERVICE

Verification: S3 bucket created, data uploaded successfully, Python can read/write objects, permissions configured correctly.

PyTorch Model Definition

Define neural network architecture for cloud training.

import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import boto3
from io import StringIO, BytesIO
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import joblib
import json
from datetime import datetime

class IrisClassifier(nn.Module):
    def __init__(self, input_size=4, hidden_size=64, num_classes=3):
        super(IrisClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, num_classes)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

Architecture Details

Input layer: 4 features (sepal/petal length and width)
Hidden layers: 64 neurons each with ReLU activation
Dropout: 0.2 probability for regularization
Output layer: 3 classes (setosa, versicolor, virginica)
Parameters: 4→64→64→3 = 4,675 trainable parameters

Model Memory Requirements

Model weights: ~18KB (4,675 × 4 bytes per float32)
Forward pass: ~512 bytes per sample
Gradient storage: ~18KB additional during training
Total training memory: ~50KB per model instance

S3 Integration Functions

Handle data loading and model persistence in cloud storage.

def load_data_from_s3(bucket_name, key):
    """Load training data from S3 with error handling"""
    try:
        s3_client = boto3.client('s3')
        print(f"Loading data from s3://{bucket_name}/{key}")
        obj = s3_client.get_object(Bucket=bucket_name, Key=key)
        data = pd.read_csv(obj['Body'])
        print(f"Successfully loaded {len(data)} rows, {len(data.columns)} columns")
        return data
    except Exception as e:
        print(f"Error loading data from S3: {str(e)}")
        print(f"Bucket: {bucket_name}, Key: {key}")
        raise

def save_model_to_s3(model, scaler, bucket_name, model_key, scaler_key):
    """Save trained model and scaler to S3"""
    s3_client = boto3.client('s3')
    
    # Save PyTorch model
    model_buffer = BytesIO()
    torch.save(model.state_dict(), model_buffer)
    model_buffer.seek(0)
    s3_client.put_object(
        Bucket=bucket_name,
        Key=model_key,
        Body=model_buffer.getvalue()
    )
    
    # Save scaler
    scaler_buffer = BytesIO()
    joblib.dump(scaler, scaler_buffer)
    scaler_buffer.seek(0)
    s3_client.put_object(
        Bucket=bucket_name,
        Key=scaler_key,
        Body=scaler_buffer.getvalue()
    )

S3 Operation Characteristics

Data loading: Streams CSV directly from S3 without local disk
Model saving: Serializes to memory buffer before S3 upload
Error handling: Explicit exception handling for network failures
Performance: ~20ms latency per S3 operation from EC2

Training Execution Pipeline

Complete training workflow with cloud data and model persistence.

def train_model():
    # Configuration
    bucket_name = 'ml-training-demo-12345'
    data_key = 'data/raw/iris.csv'
    
    # Load data from S3
    print("Loading data from S3...")
    data = load_data_from_s3(bucket_name, data_key)
    
    # Prepare features and labels
    X = data.drop('target', axis=1).values
    y = data['target'].values
    
    # Split and scale data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Convert to PyTorch tensors
    X_train_tensor = torch.FloatTensor(X_train_scaled)
    y_train_tensor = torch.LongTensor(y_train)
    X_test_tensor = torch.FloatTensor(X_test_scaled)
    y_test_tensor = torch.LongTensor(y_test)
    
    # Initialize model and training
    model = IrisClassifier()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    
    # Training loop with resource monitoring
    print("Starting training...")
    import psutil
    start_time = datetime.now()
    
    model.train()
    for epoch in range(100):
        optimizer.zero_grad()
        outputs = model(X_train_tensor)
        loss = criterion(outputs, y_train_tensor)
        loss.backward()
        optimizer.step()
        
        if (epoch + 1) % 20 == 0:
            memory_mb = psutil.Process().memory_info().rss / 1024 / 1024
            print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}, Memory: {memory_mb:.1f}MB')
    
    # Evaluate model
    model.eval()
    with torch.no_grad():
        test_outputs = model(X_test_tensor)
        _, predicted = torch.max(test_outputs.data, 1)
        accuracy = (predicted == y_test_tensor).sum().item() / len(y_test_tensor)
        print(f'Test Accuracy: {accuracy:.4f}')
    
    # Save to S3
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    model_key = f'models/iris_classifier_{timestamp}.pth'
    scaler_key = f'models/scaler_{timestamp}.pkl'
    
    save_model_to_s3(model, scaler, bucket_name, model_key, scaler_key)
    
    # Training performance summary
    end_time = datetime.now()
    training_duration = (end_time - start_time).total_seconds()
    print(f"Training completed in {training_duration:.1f} seconds")
    print(f"Final accuracy: {accuracy:.4f}")
    print(f"Model saved to S3: {model_key}")
    
    return model, scaler, accuracy, training_duration

# Run training
if __name__ == "__main__":
    model, scaler, accuracy, duration = train_model()

Training Performance Characteristics

Data loading: ~20ms from S3 (150 samples, 5 columns)
Preprocessing: <1ms (StandardScaler transformation)
Training: ~2 seconds (100 epochs, 4,675 parameters)
Model saving: ~15ms (18KB model + 2KB scaler to S3)
Total pipeline: ~2.1 seconds end-to-end

Expected Output: Training progress logs, final accuracy metrics, confirmation of model artifacts saved to S3.

Flask API Server Implementation

HTTP API server loading models from S3 for inference.

from flask import Flask, request, jsonify
import torch
import boto3
import joblib
from io import BytesIO
import numpy as np

app = Flask(__name__)

# Global variables for model and scaler
model = None
scaler = None

def load_model_from_s3(bucket_name, model_key, scaler_key):
    """Load model and scaler from S3"""
    s3_client = boto3.client('s3')
    
    # Load PyTorch model
    model_obj = s3_client.get_object(Bucket=bucket_name, Key=model_key)
    model_buffer = BytesIO(model_obj['Body'].read())
    
    model = IrisClassifier()
    model.load_state_dict(torch.load(model_buffer, map_location='cpu'))
    model.eval()
    
    # Load scaler
    scaler_obj = s3_client.get_object(Bucket=bucket_name, Key=scaler_key)
    scaler_buffer = BytesIO(scaler_obj['Body'].read())
    scaler = joblib.load(scaler_buffer)
    
    return model, scaler

@app.route('/health', methods=['GET'])
def health_check():
    return jsonify({
        'status': 'healthy',
        'model_loaded': model is not None
    })

@app.route('/predict', methods=['POST'])
def predict():
    try:
        # Parse input data
        data = request.json
        features = np.array(data['features']).reshape(1, -1)
        
        # Scale features
        features_scaled = scaler.transform(features)
        
        # Make prediction
        with torch.no_grad():
            features_tensor = torch.FloatTensor(features_scaled)
            outputs = model(features_tensor)
            probabilities = torch.softmax(outputs, dim=1)
            predicted_class = torch.argmax(outputs, dim=1).item()
            confidence = probabilities[0][predicted_class].item()
        
        # Class names for Iris dataset
        class_names = ['setosa', 'versicolor', 'virginica']
        
        return jsonify({
            'predicted_class': class_names[predicted_class],
            'confidence': float(confidence),
            'probabilities': probabilities[0].tolist()
        })
        
    except Exception as e:
        return jsonify({'error': str(e)}), 400

# Initialize model on startup
bucket_name = 'ml-training-demo-12345'
model_key = 'models/iris_classifier_20250916_143022.pth'
scaler_key = 'models/scaler_20250916_143022.pkl'

print("Loading model from S3...")
model, scaler = load_model_from_s3(bucket_name, model_key, scaler_key)
print("Model loaded successfully!")

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80, debug=True)

API Performance Characteristics

Model loading: ~35ms (18KB model + 2KB scaler from S3)
Inference latency: ~2ms per request (forward pass only)
Memory usage: ~25MB (Flask + PyTorch + loaded model)
Throughput: ~100 requests/second (single thread)

API Deployment and Testing

Deploy and validate ML inference API on EC2 instance.

Local API Testing

# Save API code as app.py
# Run Flask application
sudo python3 app.py

# Expected startup output:
Loading model from S3...
Model loaded successfully!
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:80
 * Running on http://10.0.1.100:80

# Test from another terminal
# Health check
curl http://localhost/health

# Expected response:
{
  "status": "healthy",
  "model_loaded": true
}

# Make prediction
curl -X POST http://localhost/predict \
     -H "Content-Type: application/json" \
     -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

# Expected response:
{
  "predicted_class": "setosa",
  "confidence": 0.9876,
  "probabilities": [0.9876, 0.0084, 0.0040]
}

Public Internet Access

# Update security group to allow HTTP traffic
aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxxxxx \
    --protocol tcp \
    --port 80 \
    --cidr 0.0.0.0/0

# Test from external machine
curl http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com/health

# Load test with multiple requests
for i in {1..10}; do
  curl -X POST \
    http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com/predict \
    -H "Content-Type: application/json" \
    -d '{"features": [5.1, 3.5, 1.4, 0.2]}' &
done
wait

Error Handling Validation

# Test malformed request
curl -X POST http://localhost/predict \
     -H "Content-Type: application/json" \
     -d '{"invalid": "data"}'

# Expected error response:
{
  "error": "KeyError: 'features'"
}

# Test wrong feature count
curl -X POST http://localhost/predict \
     -H "Content-Type: application/json" \
     -d '{"features": [5.1, 3.5]}'

# Expected error response:
{
  "error": "Input array has wrong dimensions"
}

Performance Verification: API handles 100+ requests/second, <5ms response time, graceful error handling for malformed inputs.

System Monitoring Implementation

Monitor system performance and optimize costs for production use.

CloudWatch Integration

import boto3
from datetime import datetime

# Initialize CloudWatch client
cloudwatch = boto3.client('cloudwatch')

def publish_custom_metrics(accuracy, training_time):
    """Publish ML training metrics to CloudWatch"""
    
    # Model accuracy metric
    cloudwatch.put_metric_data(
        Namespace='ML/Training',
        MetricData=[
            {
                'MetricName': 'ModelAccuracy',
                'Value': accuracy,
                'Unit': 'Percent',
                'Dimensions': [
                    {
                        'Name': 'ModelType',
                        'Value': 'IrisClassifier'
                    }
                ]
            },
            {
                'MetricName': 'TrainingDuration',
                'Value': training_time,
                'Unit': 'Seconds',
                'Dimensions': [
                    {
                        'Name': 'InstanceType',
                        'Value': 't3.medium'
                    }
                ]
            }
        ]
    )

# Add to training script
start_time = datetime.now()
# ... training code ...
end_time = datetime.now()
training_duration = (end_time - start_time).total_seconds()

publish_custom_metrics(accuracy * 100, training_duration)

System Monitoring Commands

# Monitor instance performance
htop
iostat -x 1
df -h

# Check Docker resource usage
docker stats

# Monitor network connectivity
ping google.com
curl -w "@curl-format.txt" -o /dev/null -s http://httpbin.org/delay/2

Cost Analysis and Optimization

# Check current AWS costs
aws ce get-cost-and-usage \
    --time-period Start=2025-01-01,End=2025-01-31 \
    --granularity DAILY \
    --metrics BlendedCost \
    --group-by Type=DIMENSION,Key=SERVICE

# EC2 instance costs
aws ec2 describe-instances \
    --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,State.Name]' \
    --output table

# S3 storage costs
aws s3api list-objects-v2 \
    --bucket ml-training-demo-12345 \
    --query 'sum(Contents[].Size)' \
    --output text

How to Reduce Costs

Instance Management

Stop instances when not in use (saves compute costs)
Use Spot Instances for training workloads (70% discount)
Right-size instances based on actual usage
Schedule automatic start/stop with Lambda functions

Storage Optimization

Delete intermediate training files after model training
Use S3 Lifecycle policies to archive old models
Compress large datasets before uploading
Monitor data transfer costs between services

Development Practices

Use smaller datasets for development and testing
Implement checkpointing to resume interrupted training
Clean up failed experiments and temporary files
Set up billing alerts for cost overruns

Expected Monthly Costs: t3.medium ($30), S3 storage ($5), data transfer ($10) = ~$45 for continuous operation.

Common Implementation Problems

Identify and resolve typical cloud development problems.

Connection and Access Issues

SSH Connection Failures

# Permission denied (publickey)
chmod 400 ml-training-key.pem
ssh -i ml-training-key.pem ubuntu@instance-ip

# Connection timeout
# Check security group allows SSH from your IP
aws ec2 describe-security-groups \
    --group-ids sg-xxxxxxxxx

# Add your current IP to security group
curl ifconfig.me  # Get your public IP
aws ec2 authorize-security-group-ingress \
    --group-id sg-xxxxxxxxx \
    --protocol tcp \
    --port 22 \
    --cidr your-ip/32

S3 Access Errors

# NoCredentialsError
aws configure list
aws sts get-caller-identity

# AccessDenied
aws iam get-user
aws s3 ls s3://bucket-name --debug

# Bucket region mismatch
aws s3api get-bucket-location --bucket bucket-name

Docker Issues

# Permission denied
sudo usermod -aG docker ubuntu
newgrp docker

# Docker daemon not running
sudo systemctl start docker
sudo systemctl enable docker

# Out of disk space
df -h
docker system prune -f

Performance and Resource Issues

Memory and CPU Constraints

# Monitor resource usage
free -h
cat /proc/cpuinfo | grep processor | wc -l
htop

# PyTorch out of memory
# Reduce batch size in training code
batch_size = 16  # Instead of 64

Network and Latency Issues

# Slow S3 transfers
# Use multipart upload for large files
aws configure set default.s3.multipart_threshold 64MB
aws configure set default.s3.multipart_chunksize 16MB

# Test network speed
wget -O /dev/null http://speedtest-sfo1.digitalocean.com/10mb.test

# DNS resolution issues
nslookup s3.amazonaws.com

Application Debugging

import logging
logging.basicConfig(level=logging.DEBUG)

# Add extensive error handling
try:
    data = load_data_from_s3(bucket_name, data_key)
except Exception as e:
    print(f"S3 Error: {str(e)}")
    print(f"Bucket: {bucket_name}, Key: {data_key}")
    raise

# Log training progress
print(f"Epoch {epoch}, Loss: {loss.item():.4f}, Memory: {torch.cuda.memory_allocated()}")

Cost Overrun Prevention

Set up billing alerts in AWS console
Use AWS Cost Explorer for usage analysis
Implement automatic instance shutdown after training
Monitor S3 storage growth and implement cleanup policies

Debugging Strategy: Check permissions first, verify network connectivity, monitor resource usage, implement comprehensive logging.

EC2+S3 System Reality

t3.medium Training: 47 Seconds vs 2 Seconds Local

CPU-only EC2 instance delivers 22× slower training than local GPU.

Demo System Configuration

EC2 t3.medium: 2 vCPU, 4GB RAM
Ubuntu 22.04 with PyTorch installation
S3 bucket for dataset and model storage
IAM role with S3 read/write permissions

CIFAR-10 ResNet-18 Performance

Local RTX 4090: 2.1 seconds/epoch
t3.medium CPU: 47 seconds/epoch
Slowdown factor: 22×

Training Duration Impact

100-epoch training: 3.5 minutes vs 78 minutes
Hyperparameter grid search: Hours vs days
Interactive development impossible on EC2 CPU

Why t3.medium Fails for ML

No GPU acceleration
4GB RAM limits batch size to 16-32 samples
Optimal batch size (256) requires 14GB RAM
CPU utilization: 100% but inefficient for tensor operations

GPU Instance Costs

Instance	vCPU	GPU	RAM	Cost/Hour
t3.medium	2	None	4GB	$0.042
p3.2xlarge	8	1×V100	61GB	$3.06
p3.8xlarge	32	4×V100	244GB	$12.24

Cost-Performance Analysis

t3.medium: 22× slower, 73× cheaper
p3.2xlarge: 1.2× faster than local, 73× more expensive

Break-even Usage

p3.2xlarge profitable above 20 hours/month
Below 20 hours: Local development cheaper
Above 100 hours: Consider reserved instances

Memory Requirements

ResNet-50: 8GB minimum
BERT-base: 12GB minimum
GPT-2 small: 16GB minimum
t3.medium cannot load production models

p3.2xlarge costs $61/day continuous operation vs $0 local GPU after purchase.

S3 Data Loading: 12 Seconds vs 1.4 Seconds Local

Network storage introduces 8× slowdown for dataset loading.

CIFAR-10 Loading Performance (170MB dataset)

Local SSD: 1.4 seconds
S3 single-thread: 12 seconds
S3 multi-thread (8 workers): 4.3 seconds
EBS attached volume: 2.8 seconds

Network Latency Impact

Local file open: 0.02ms
S3 GetObject request: 20-50ms per file
Cross-AZ S3 access: +5-10ms latency
1000 small files: 20-50 seconds vs 0.1 seconds local

Training Pipeline Bottlenecks

# Local development - continuous GPU utilization
for batch in DataLoader(dataset, batch_size=256):
    loss = model(batch)  # GPU busy 98% of time

# S3 streaming - GPU starvation  
for epoch in range(100):
    download_dataset_from_s3()  # 12 second delay
    for batch in cached_dataset:
        loss = model(batch)  # GPU idle during downloads

Checkpoint Saving Delays

Local model save: 50ms (instant)
S3 model upload: 800ms-2.1 seconds
Training interruption risk during uploads

Caching Strategies

EBS Volume Cache

Attach 100GB gp3 volume: $8/month
One-time dataset download: 12 seconds
Subsequent epochs: 2.8 seconds (local EBS speed)
Cache 10-20 datasets before cost equals S3

Instance Store (i3.large)

475GB NVMe SSD included
Read speed: 1.9GB/s (faster than local)
Cost premium: $0.156/hour vs $0.042 t3.medium
Data lost on instance stop/termination

Multi-part Downloads

import concurrent.futures
import boto3

def parallel_download(bucket, prefix, workers=8):
    s3 = boto3.client('s3')
    objects = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
    
    def download_one(key):
        s3.download_file(bucket, key['Key'], f"./data/{key['Key']}")
    
    with concurrent.futures.ThreadPoolExecutor(workers) as executor:
        executor.map(download_one, objects['Contents'])

Cost of Data Movement

First 100GB/month: Free
Additional transfer: $0.09/GB
1TB monthly: $90 data transfer cost
Regional co-location essential

EBS caching reduces loading time to 2.8 seconds but requires manual cache management.

IAM Policy Errors Block S3 Access

Incorrect resource ARNs cause access denied errors.

Common IAM Mistakes

Wrong Resource ARN Format

{
    "Effect": "Allow",
    "Action": "s3:GetObject", 
    "Resource": "arn:aws:s3:::my-bucket"
}

Error: Missing /* for object access Fix: "arn:aws:s3:::my-bucket/*"

Missing List Permission

{
    "Effect": "Allow",
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::my-bucket/*"
}

Error: Cannot list bucket contents Fix: Add s3:ListBucket action

Overly Broad Permissions

{
    "Effect": "Allow",
    "Action": "s3:*",
    "Resource": "*"
}

Risk: Access to all S3 buckets in account Production: Never use wildcard permissions

Minimal Working IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::ml-training-bucket"
        },
        {
            "Effect": "Allow", 
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::ml-training-bucket/*"
        }
    ]
}

Debug IAM Issues

# Test S3 access
aws s3 ls s3://ml-training-bucket --profile demo

# Check effective permissions
aws iam simulate-principal-policy \
    --policy-source-arn arn:aws:iam::account:role/EC2-ML-Role \
    --action-names s3:GetObject \
    --resource-arns arn:aws:s3:::ml-training-bucket/data.csv

CloudTrail for Debugging

All S3 API calls logged with timestamps
Access denied events show exact error cause
Essential for production IAM debugging

IAM permissions require exact ARN matching - bucket vs object permissions commonly confused.

Single Instance: No Fault Tolerance

EC2 instance failure stops all training with no automatic recovery.

Failure Modes

Hardware failure: 2-5 minute detection + restart
Spot instance interruption: 2-minute warning
Software crash: Manual SSH required for diagnosis
AZ outage: Complete system unavailability
Network partition: Training stops, no automatic retry

Data Loss Scenarios

In-memory model state: Lost on any failure
/tmp directory: Cleared on restart
Training progress: Lost without S3 checkpointing
Logs and debugging info: Gone unless CloudWatch configured

Manual Recovery Process

SSH to investigate failure cause (if instance accessible)
Launch replacement instance manually
Restore training environment from scratch
Resume from last S3 checkpoint (if exists)
Restart training job manually

Availability Calculation

Single EC2 instance: 99.5% uptime (AWS SLA)
Monthly downtime: 3.6 hours expected
Training interruption: 2-5 minutes recovery time

High Availability Requirements

Auto Scaling Group

{
    "AutoScalingGroupName": "ml-training-asg",
    "MinSize": 1,
    "MaxSize": 3,
    "DesiredCapacity": 1,
    "HealthCheckType": "EC2",
    "HealthCheckGracePeriod": 300,
    "AvailabilityZones": ["us-east-1a", "us-east-1b"]
}

Application Load Balancer

Health check every 30 seconds
Automatic traffic routing to healthy instances
Multi-AZ deployment for zone failures

Training Job Resilience

def checkpoint_training():
    # Save to S3 every epoch
    checkpoint = {
        'model_state': model.state_dict(),
        'optimizer_state': optimizer.state_dict(), 
        'epoch': current_epoch,
        'loss': current_loss
    }
    torch.save(checkpoint, '/tmp/checkpoint.pth')
    s3.upload_file('/tmp/checkpoint.pth', 
                   'bucket', f'checkpoints/epoch_{current_epoch}.pth')

def resume_training():
    # Resume from latest S3 checkpoint
    latest_checkpoint = find_latest_checkpoint_s3()
    checkpoint = torch.load(latest_checkpoint)
    model.load_state_dict(checkpoint['model_state'])
    return checkpoint['epoch']

Cost of High Availability

Single instance: $30/month
HA setup: $90/month (3× cost)
Load balancer: +$16/month
Total HA cost: $106/month vs $30 basic

Production systems require 3-5× cost increase for fault tolerance and automatic recovery.

Lambda Cold Starts: 2-8 Second Delays

Serverless model serving faces initialization delays absent in always-on systems.

Cold Start Performance

import json
import torch
import boto3

def lambda_handler(event, context):
    # Cold start steps:
    # 1. Download model from S3 (2-4 seconds)
    # 2. Load PyTorch model (1-3 seconds)  
    # 3. Initialize model for inference (0.5-1 seconds)
    
    s3 = boto3.client('s3')
    s3.download_file('bucket', 'model.pth', '/tmp/model.pth')
    model = torch.load('/tmp/model.pth', map_location='cpu')
    
    # Actual inference (10-50ms)
    prediction = model(torch.tensor(event['input']))
    return {'prediction': prediction.tolist()}

Timing Breakdown

Container initialization: 200-500ms
Python runtime startup: 300-800ms
PyTorch import: 1-2 seconds
S3 model download: 1-4 seconds (depends on size)
Model loading: 0.5-2 seconds
Total cold start: 2-8 seconds

Warm Request Performance

Model cached in memory: 10-50ms response
No S3 download or model loading
Warm container reused for ~15 minutes

Always-On EC2 Alternative

from flask import Flask, request, jsonify
import torch

app = Flask(__name__)

# Load model once at startup (not per request)
print("Loading model...")  # 2-4 seconds one-time
model = torch.load('model.pth', map_location='cpu')
model.eval()
print("Model ready")

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    
    # No cold start - model already loaded
    with torch.no_grad():
        prediction = model(torch.tensor(data['features']))
    
    return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80)

Performance Comparison

Approach	Cold Start	Warm Latency	Cost (1M req/month)
Lambda	2-8 seconds	15-50ms	$200
EC2 t3.micro	0ms	20-100ms	$350
EC2 c5.large	0ms	5-20ms	$720

When Lambda Makes Sense

Sporadic traffic (< 1000 requests/day)
Cost optimization priority
Can tolerate cold starts
Model size < 250MB

When EC2 Required

Consistent low latency needed
Large models (> 250MB)
High request volume (> 10,000/day)
Always-on user expectations

Serverless introduces 2-8 second initialization penalty vs 0ms for persistent servers.

Manual Instance Management vs Auto Scaling

Demo system requires manual start/stop vs production auto-scaling complexity.

Manual Operations

# Start training job
aws ec2 start-instances --instance-ids i-1234567890abcdef0

# SSH and run training
ssh -i key.pem ubuntu@instance-ip
python train_model.py

# Check progress manually
tail -f training.log

# Stop instance when done
aws ec2 stop-instances --instance-ids i-1234567890abcdef0

Manual Process Problems

Forget to stop instances → $73/day cost
Instance launch failures require restart
No automatic scaling for load changes
SSH access required for all operations
Training interruption if connection lost

Development Workflow

Start instance: 45 seconds boot time
Install dependencies: 2-5 minutes first time
Run training: Variable duration
Manual monitoring required
Manual termination after completion

Cost Control Issues

Left running overnight: $73 unexpected cost
Weekend forgetting: $146 weekend cost
Instance type mistakes: p3.16xlarge ($24/hour) vs intended t3.medium

Auto Scaling Production Setup

# CloudFormation template
Resources:
  MLAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: 0
      MaxSize: 10
      DesiredCapacity: 1
      LaunchTemplate:
        LaunchTemplateId: !Ref MLLaunchTemplate
        Version: !GetAtt MLLaunchTemplate.LatestVersionNumber
      HealthCheckGracePeriod: 300
      HealthCheckType: EC2

  MLLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate  
    Properties:
      LaunchTemplateData:
        ImageId: ami-0c02fb55956c7d316
        InstanceType: p3.2xlarge
        IamInstanceProfile:
          Arn: !GetAtt MLInstanceProfile.Arn
        UserData:
          Fn::Base64: |
            #!/bin/bash
            aws s3 cp s3://ml-bucket/train.py /home/ubuntu/
            cd /home/ubuntu && python3 train.py
            shutdown -h now  # Auto-terminate when done

Auto Scaling Benefits

Automatic instance replacement on failure
Scale based on queue depth or metrics
No manual intervention required
Automatic cost optimization (scale to zero)

Production Complexity

Infrastructure as code required
Health checks and monitoring setup
Load balancer configuration
Service discovery for distributed training
Cost: 5-10× setup time vs manual approach

Production auto-scaling requires infrastructure complexity but eliminates manual operations and cost overruns.

100× Operational Overhead: Development vs Production

Production deployment multiplies operational requirements by 100×.

Demo System Operations

Weekly Effort: 1-2 Hours

Launch instance when needed
SSH and start training
Check CloudWatch logs for errors
Download results from S3
Stop instance manually

Tools Required

AWS CLI for instance management
SSH client for remote access
Basic S3 commands for data transfer
CloudWatch console for log viewing

Failure Recovery

Restart failed instances manually
Re-run training from beginning
Debug via SSH and log inspection
No monitoring or alerting

Security Model

Single IAM role with broad permissions
Default VPC with basic security groups
No encryption or compliance considerations
Developer access keys with full privileges

Cost Management

Manual instance start/stop
Basic billing alerts at account level
No cost allocation or project tracking

Production System Requirements

Weekly Effort: 15-20 Hours

Infrastructure monitoring and maintenance
Security patch management
Cost optimization analysis
Performance tuning and debugging
Incident response and resolution

Enterprise Operations Stack

# Infrastructure as Code
terraform plan && terraform apply

# Monitoring and Alerting  
kubectl apply -f prometheus-config.yaml
aws cloudwatch put-metric-alarm --alarm-name "High-GPU-Usage"

# Security Compliance
aws config start-config-recorder
aws guardduty create-detector

# Cost Management
aws budgets create-budget --budget file://ml-budget.json
aws ce get-cost-and-usage --time-period Start=2024-01-01,End=2024-02-01

Production Requirements

24/7 monitoring and alerting
Automated backup and disaster recovery
Multi-region deployment for availability
Role-based access control (RBAC)
Encryption at rest and in transit
Compliance auditing and reporting
Load testing and capacity planning
A/B testing and gradual rollouts

Team Structure

DevOps engineer: Infrastructure management
SRE: Monitoring and incident response
Security engineer: Compliance and auditing
ML engineer: Model development and optimization

Production ML systems require dedicated operations team vs single developer for demo system.

p3.2xlarge Economics: $2,196 Monthly vs $400 Local GPU

GPU instance costs exceed local workstation after 3 weeks continuous operation.

Cost Comparison Analysis

AWS p3.2xlarge (1× NVIDIA V100)

Hourly rate: $3.06
Monthly (24/7): $2,196
Yearly (24/7): $26,356
Performance: 1.2× local RTX 4090

Local RTX 4090 Workstation

Hardware cost: $4,500 (GPU + system)
Electricity: $150/month (600W × 24/7)
Total monthly: $150 + depreciation
3-year amortization: $125/month hardware
Total local cost: $275/month

Break-even Analysis

Monthly crossover: 90 hours GPU usage
p3.2xlarge profitable: <90 hours/month
Local profitable: >90 hours/month
Daily break-even: 3 hours/day

Spot Instance Pricing

Spot discount: 70% typical
p3.2xlarge spot: $0.92/hour average
Monthly spot (24/7): $659
Break-even vs local: 300 hours/month

Usage Pattern Economics

Intermittent Research (20 hours/month)

p3.2xlarge on-demand: $61.20
Local alternative: $275 (hardware + electricity)
Cloud savings: $213.80/month

Heavy Development (200 hours/month)

p3.2xlarge on-demand: $612
p3.2xlarge spot: $184
Local alternative: $275
Local savings: $337-429/month

Continuous Production (720 hours/month)

p3.2xlarge on-demand: $2,196
p3.2xlarge spot: $659
p3.2xlarge reserved (3-year): $1,317
Local alternative: $275
Local savings: $384-1,921/month

GPU Performance Comparison

RTX 4090: 83 TFLOPS (FP16)
Tesla V100: 125 TFLOPS (FP16)
V100 memory: 16GB HBM2
RTX 4090 memory: 24GB GDDR6X
V100 advantage: 50% compute, 33% less memory

Reserved Instance Strategy

1-year commitment: 40% discount
3-year commitment: 60% discount
Requires accurate usage forecasting
No flexibility for changing requirements

GPU instances cost-effective below 90 hours/month; above this threshold local hardware provides 60-85% savings.

When EC2+S3 Architecture Fails

Specific technical constraints where demonstrated architecture becomes inadequate.

Memory Constraints

GPT-2 medium (774M parameters): 3.1GB model weights
GPT-3.5 equivalent: ~13GB model weights
Llama-2 70B: 140GB model weights
EC2 limit: r5.24xlarge maximum 768GB
Solution: Multi-instance model parallelism

Training Scale Limits

Single p3.2xlarge: 1 GPU, 61GB RAM
ImageNet training: Acceptable (24 hours)
GPT-3 scale training: 1024+ GPUs required
EC2 constraint: Manual cluster management
Solution: EKS or SageMaker managed training

Request Rate Bottlenecks

Single EC2 instance: ~1000 requests/second maximum
Load balancer + Auto Scaling: ~10,000 requests/second
Global scale: 100,000+ requests/second required
Bottleneck: Database and backend services
Solution: Microservices + CDN architecture

Data Processing Limits

S3 throughput: 5,500 requests/second per prefix
Large training job: 1000 GPUs × 10 requests/second = 10,000 req/s
S3 constraint: Request rate exceeds limits
Solution: Data sharding across prefixes or local caching

Alternative Architecture Patterns

Kubernetes + GPU Operators

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training-cluster
spec:
  replicas: 16
  template:
    spec:
      containers:
      - name: pytorch-training
        image: pytorch/pytorch:latest
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 61Gi

Managed ML Services

SageMaker Training: Automatic cluster management
SageMaker Endpoints: Auto-scaling inference
Cost: 20-30% premium vs EC2, but operational savings
Suitable for teams >5 people

Serverless Data Processing

AWS Batch: Managed job queues
Step Functions: Workflow orchestration
Lambda: Event-driven preprocessing
Cost-effective for intermittent workloads

When to Migrate from EC2+S3

Training jobs require >4 GPUs simultaneously
Inference SLA requires <50ms latency globally
Team >10 people need shared infrastructure
Compliance requires advanced security controls
Cost optimization needs automated resource management

Migration Triggers

Manual scaling becomes operational bottleneck
Security requirements exceed basic IAM policies
Multi-region deployment needed for latency
Training coordination requires job scheduling
Model serving needs A/B testing capabilities

EC2+S3 architecture optimal for single-developer ML projects; enterprise scale requires orchestration platforms and managed services.

Cloud Computing Fundamentals

Local Scaling Limits

Single-Machine Memory and Compute Constraints

Production Workloads Require Distributed Infrastructure

Local Development Constraints Prevent Production Deployment

Resource Pooling Economics

Service Layers Abstract Hardware Management

Market Competition Drives Service Innovation

Pay-per-Use vs Fixed Infrastructure Costs

Provider Capacity Exceeds Individual Requirements

Cloud Infrastructure Enables Production ML Scale

Distributed ML Systems Require New Engineering Skills

Network-Based Services Replace Local File Access

Distributed Systems Failure Modes

Cloud Infrastructure and Services

AWS Global Infrastructure: Regions and Availability Zones

Resource Placement: Cost and Latency Trade-offs

Network Latency Replaces Deterministic Local Access

Partial Failures Require New Error Handling

S3 Hides Data Replication Implementation

Load Balancers Replace Manual Request Distribution

Virtual Resources Replace Physical Infrastructure

Compute Services: Processing Without Hardware Ownership

EC2 Instances Share Physical Servers

Instance Configuration Determines Functionality and Cost

AMIs: Pre-configured Operating Environments

Instance Types Optimize Hardware for Workload Patterns

GPU Instances: Parallel Processing for ML Workloads

AMI Selection Impacts Launch Time and Maintenance

Key Pairs Enable SSH Access

Security Groups Control Instance Network Access

Cloud Storage: Durability and Global Access

Storage Services Abstract Physical Disks

S3 Operational Complexity Exceeds Simple File Storage

Network Services Enable Secure Component Communication

Serverless Executes Code Without Server Management

Lambda Constraints Limit ML Workload Suitability

Integration Patterns Connect Services Through APIs

Memory Limits Impose Service Contraints

Execution Time Implose Training Blocks

Storage Request Limits Create Bottlenecks

Cost Models Favor Different Usage Patterns

Service Constraints Determine Architecture

Cloud ML System Design

From Local Scripts to Cloud Services

Basic ML System Architecture

Training System Design

Serving System Design

Data Management Patterns

System Integration and Orchestration

Cost Management and Optimization

AWS Identity and Access Management

Amazon Resource Names: Global Resource Identification

Resource IDs and Naming Conventions

Identity Hierarchies: Users, Roles, and Service Accounts

Permission Models: Policies, Actions, and Resource Restrictions

AWS Access Methods: Console, CLI, and SDK Integration

Credential Management: Security Keys, Profiles, and Rotation

Role Assumption and Cross-Account Access Patterns

AWS SDK and CLI Configuration Management

Security Best Practices: Permission Boundaries and Access Monitoring

AWS ML Pipeline Implementation

EC2-S3 ML Pipeline Architecture

AWS BILLING WARNING

EC2 Instance Configuration

Development Environment Setup

S3 Data Storage Implementation

PyTorch Model Definition

S3 Integration Functions

Training Execution Pipeline

Flask API Server Implementation

API Deployment and Testing

System Monitoring Implementation

Common Implementation Problems

EC2+S3 System Reality

t3.medium Training: 47 Seconds vs 2 Seconds Local

S3 Data Loading: 12 Seconds vs 1.4 Seconds Local

IAM Policy Errors Block S3 Access

Single Instance: No Fault Tolerance

Lambda Cold Starts: 2-8 Second Delays