EE 547 - Unit 3
Fall 2025
Development environments constrain ML system capabilities.
Typical Development Setup (2024)
Where constraints bind:
Real ML systems require infrastructure that scales beyond individual machines.
Production workloads exceed development capabilities by orders of magnitude.
Large Language Model Training
Production Model Serving
Real-Time Processing Pipelines
Production workloads require entirely different infrastructure architectures, not scaled development setups.
Concrete example of where local development assumptions break.
Development Phase (5 engineers, MacBook Pros)
Production Requirements
Failure Points
Local development assumptions break at cloud scale: datasets exceed memory, training times become prohibitive, single-point failures affect global users.
Datacenters achieve cost efficiencies impossible for individual organizations.
Individual Company Infrastructure
Utilization: 20-30% average utilization with 100% fixed costs → 70% resource waste
Example Startup ML Training
Hyperscaler Infrastructure (AWS, Google, Microsoft)
Result: Rent exactly required resources when needed.
Same Startup with Cloud
Economics: Hyperscalers achieve 10-20x cost efficiency through scale, specialization, and resource pooling.
Cloud providers abstract physical complexity into consumable services.
Physical Infrastructure Layer
Virtualization Layer
Service Layer
Application Layer
Each layer abstracts thousands of operational details. Application development consumes services without managing underlying infrastructure.
Competition between AWS, Google Cloud, and Microsoft Azure drives innovation and price reductions.
Market Share and Positioning (2024)
Provider | Market Share | Strengths | ML Focus |
---|---|---|---|
AWS | 32% | Service breadth, enterprise adoption | SageMaker, comprehensive ML tools |
Microsoft Azure | 23% | Enterprise integration, hybrid cloud | Azure ML, enterprise AI |
Google Cloud | 11% | ML/AI innovation, data analytics | Vertex AI, TensorFlow integration |
Others | 34% | Specialized services, regional players | Various |
Competitive Pressures
Competition Results: 75% price reduction over 10 years, specialized ML hardware, new capabilities quarterly, multiple viable providers prevent vendor lock-in.
Cloud fundamentally changes IT spending from capital investment to operational expense.
Traditional Model: Capital Expenditure
Upfront Investment Requirements
Financial Characteristics
Example: Startup Scaling Challenge
Cloud Model: Operating Expenditure
Pay-as-you-go Model
Financial Characteristics
Same Startup Example with Cloud
OpEx model aligns IT costs with business growth, reducing financial risk and enabling rapid experimentation.
Cloud providers maintain resource pools orders of magnitude larger than individual user needs.
AWS Global Infrastructure (2024)
Practical Implications
Large Model Training Example
Cloud resources appear unlimited because total provider capacity exceeds individual user needs by orders of magnitude. This enables entirely new categories of ML experiments and applications.
Cloud computing provides massive capabilities while introducing operational complexity.
Cloud Computing Capabilities
Massive Scalability
Cost Efficiency
Operational Simplicity
Innovation Access
Global Reach
Required Complexity Management
New Technical Skills
Architecture Changes
Operational Overhead
Vendor Management
Cloud computing provides extraordinary capabilities, but success requires learning new concepts and managing operational complexity. Benefits outweigh costs for most production ML applications.
Cloud programming assumes distributed services rather than single-machine execution.
Local Development Model
# Single-machine assumptions
import torch
import pandas as pd
# Load data (assumes local files)
data = pd.read_csv('dataset.csv')
# Train model (uses local GPU/CPU)
model = train_model(data)
# Save result (local filesystem)
torch.save(model, 'model.pth')
# Serve predictions (single process)
app.run(host='localhost', port=5000)
Assumptions
Cloud-Native Development Model
# Distributed service assumptions
import boto3
# Load data (from cloud storage)
s3.download_file('bucket', 'dataset.csv', '/tmp/data.csv')
# Train model (on cloud compute)
ec2_instance.run_training_job(
data_location='s3://bucket/dataset.csv'
)
# Save result (to cloud storage)
s3.upload_file('model.pth', 'bucket', 'models/v1.pth')
# Serve predictions (managed service)
lambda_function.deploy(
model_path='s3://bucket/models/v1.pth'
)
New Assumptions
Cloud development requires designing for network latency, service failures, and distributed data flows.
Distributed systems introduce complexity not present in local development.
Network Reliability Constraints
Security Requirements Everywhere
Usage-Based Cost Model
Distributed Debugging Complexity
Why This Complexity Exists
Complexity results from solving problems that do not exist in local development:
Cloud development trades local simplicity for global scale and distributed system capabilities.
Cloud services run on geographically distributed datacenters with specific failure and latency characteristics.
AWS Regions (33 worldwide as of 2024)
Definition: Isolated geographic areas containing multiple datacenters
Region Characteristics
Availability Zones per Region (2-6 AZs)
Infrastructure Hierarchy
Global Infrastructure
├── AWS Regions (33)
│ ├── us-east-1 (Virginia)
│ │ ├── us-east-1a (AZ)
│ │ ├── us-east-1b (AZ)
│ │ ├── us-east-1c (AZ)
│ │ ├── us-east-1d (AZ)
│ │ ├── us-east-1e (AZ)
│ │ └── us-east-1f (AZ)
│ ├── us-west-2 (Oregon)
│ │ ├── us-west-2a (AZ)
│ │ ├── us-west-2b (AZ)
│ │ ├── us-west-2c (AZ)
│ │ └── us-west-2d (AZ)
│ ├── eu-west-1 (Ireland)
│ │ ├── eu-west-1a (AZ)
│ │ ├── eu-west-1b (AZ)
│ │ └── eu-west-1c (AZ)
│ ├── ap-southeast-1 (Singapore)
│ │ ├── ap-southeast-1a (AZ)
│ │ ├── ap-southeast-1b (AZ)
│ │ └── ap-southeast-1c (AZ)
│ └── ... (29 more regions)
└── Edge Locations (450+)
├── CloudFront CDN
└── Global Content Delivery
ML System Design Implications
Data Residency Constraints
Multi-AZ Architecture for Availability
Cost vs Latency Trade-offs
ML systems must account for region selection based on data residency, user latency, service availability, and cost constraints.
Cross-AZ data transfer costs create trade-offs between cost and availability for large ML datasets.
ImageNet Training Cost Impact (1.3TB dataset)
Same AZ Placement
Cross-AZ Placement
The Trade-off
Production Architecture Decisions
Training Workloads
Inference Services
Cross-Region Costs
Cross-AZ data transfer at $0.01/GB makes dataset placement a key decision for large-scale ML training.
Distributed systems replace instant local operations with network requests.
Local Development Assumptions
Network Operation Reality
ML Training Pipeline Impact
Network operations introduce 100-2000× latency increase over local operations, requiring different software design patterns.
Distributed systems fail differently than single machines.
Single Machine Failure Model
Recovery: Restart entire system, reload from disk
Distributed System Failure Model
ML Training Example
8-GPU distributed training job:
Error Handling Complexity
# Local development - simple error handling
try:
data = load_training_data('dataset.csv')
model = train_model(data)
save_model(model, 'model.pth')
except Exception as e:
print("Training failed, restart from beginning")
# Distributed training - complex error handling
try:
nodes = discover_healthy_training_nodes()
if len(nodes) < MIN_NODES:
wait_for_node_recovery()
checkpoint = load_latest_checkpoint_if_exists()
model = train_distributed(data, nodes, checkpoint)
except NodeFailure as e:
# Continue with remaining nodes or wait for replacement
handle_node_failure(e.failed_node)
except NetworkPartition as e:
# Pause training until partition heals
wait_for_network_recovery()
except ServiceDegradation as e:
# Retry with exponential backoff
retry_with_backoff(e.failing_service)
Failure Probability Math
Distributed systems require application logic to handle partial failures that never occur in single-machine development.
Simple API masks complex distributed storage system.
What You See: Simple File Operations
import boto3
s3 = boto3.client('s3')
# Appears like local file system
s3.put_object(Bucket='my-bucket', Key='data.csv', Body=data)
s3.get_object(Bucket='my-bucket', Key='data.csv')
s3.delete_object(Bucket='my-bucket', Key='data.csv')
What AWS Implements Behind the Scenes
Data Replication
Consistency Management
Failure Recovery
Complexity You Don’t Handle
# What you would need to implement manually:
# 1. Distributed consensus protocol
# 2. Failure detection and recovery
# 3. Data partitioning and replication
# 4. Consistent hashing for load distribution
# 5. Network protocol for reliable transfer
# 6. Monitoring and alerting systems
# 7. Hardware provisioning and maintenance
Engineering Cost Avoided
vs S3 Cost: $23/TB/month for most workloads
Development Time Savings
S3 provides distributed storage reliability without requiring distributed systems expertise.
Automatic traffic distribution across multiple servers.
Manual Load Distribution Problems
Single Server Bottleneck
Adding Servers Manually
# Deploy model to 3 servers
server1: ec2-1-2-3-4.compute-1.amazonaws.com
server2: ec2-1-2-3-5.compute-1.amazonaws.com
server3: ec2-1-2-3-6.compute-1.amazonaws.com
# Client must choose which server to call
if server1_healthy:
call server1
elif server2_healthy:
call server2
else:
call server3
Problems:
Application Load Balancer Solution
# Single endpoint for clients
API_ENDPOINT = "https://my-api.elb.amazonaws.com/predict"
# Load balancer handles distribution automatically:
# 1. Health checks servers every 30 seconds
# 2. Routes requests to healthy instances only
# 3. Distributes load evenly across instances
# 4. Automatically adds/removes instances
response = requests.post(API_ENDPOINT, json=data)
Complexity Abstracted
Performance Results
Client sees single reliable endpoint instead of managing multiple servers.
Load balancers provide high availability and scalability without client-side complexity.
Cloud providers abstract physical infrastructure into consumable services.
Traditional Infrastructure Model
Constraints:
Cloud Service Model
Advantages:
Core Cloud Service Categories:
Each category solves specific scaling problems that local infrastructure cannot handle cost-effectively.
Compute services provide processing power without hardware ownership.
Virtual Machines (EC2)
Containers (ECS/EKS)
Serverless Functions (Lambda)
Service selection depends on control requirements, scaling patterns, and operational complexity tolerance.
EC2 instances are virtual computers running on AWS physical hardware.
What is an EC2 Instance?
Physical to Virtual Mapping
Instance Lifecycle
EC2 provides the illusion of dedicated hardware while efficiently sharing physical resources among multiple users.
Four key decisions define every EC2 instance configuration.
1. Amazon Machine Image (AMI)
2. Instance Type
3. Storage Configuration
4. Network and Security Settings
Configuration Examples:
Configuration Impact on Cost:
Each configuration choice affects functionality, performance, and monthly costs.
Amazon Machine Images provide the foundation software for EC2 instances.
What AMIs Contain
AMI Categories
Deep Learning AMI Features
AMI Selection Impact
AMI choice significantly impacts development velocity, operational overhead, and ongoing maintenance requirements.
EC2 provides hundreds of instance configurations optimized for different workload patterns.
General Purpose Instances (t3, m5, m6i)
Compute Optimized (c5, c6i)
Memory Optimized (r5, x1e)
Storage Optimized (i3, i4i)
Instance selection balances CPU performance, memory capacity, storage speed, and hourly cost based on workload requirements.
GPU instances provide parallel processing power for ML training and inference.
GPU Instance Families
p4d Instances: Latest ML Training
p3 Instances: General ML Workloads
g4 Instances: ML Inference
Current Limitations:
GPU selection depends on model size, training duration, and budget constraints. Latest hardware provides better performance-per-dollar for large-scale training.
AMIs provide pre-built operating system and software configurations.
Base Operating System Images
Deep Learning AMIs
Custom AMIs
AMI Selection Strategy:
AMI selection significantly impacts instance launch time, configuration complexity, and ongoing maintenance requirements.
Key pairs provide secure authentication for connecting to EC2 instances without passwords.
AWS Key Pair Integration
Key Pair Management
Access Patterns
Key pairs cannot be added to running instances - losing your private key requires instance replacement or complex recovery procedures.
Security groups act as virtual firewalls controlling inbound and outbound traffic to EC2 instances.
Inbound Rules (Traffic TO Your Instance)
Outbound Rules (Traffic FROM Your Instance)
Source and Destination Options
Security Group Strategy:
Security groups require explicit configuration for each network service your ML application needs to access or provide.
Cloud storage services provide durability, scalability, and global accessibility.
Object Storage (S3)
Block Storage (EBS)
File Systems (EFS)
Database Services (RDS, DynamoDB)
Storage service selection depends on access patterns, performance requirements, durability needs, and cost constraints.
Cloud storage abstracts physical disks into managed services with different access patterns.
Traditional Storage Model
Cloud Storage Model
Key Cloud Storage Concepts
Durability: How likely data survives hardware failures
Consistency: When all copies reflect the same data
Access Patterns: How applications read and write data
Storage Service Categories
Block Storage (EBS)
Object Storage (S3)
File Storage (EFS)
Database Storage (RDS)
Storage Selection Criteria: Access frequency, performance requirements, sharing needs, backup/recovery, and cost sensitivity.
S3 appears simple but involves significant operational complexity.
Why S3 Isn’t “Just File Storage”
Global Namespace and Regions
Access Control Complexity
Consistency and Performance Models
Storage Classes and Cost Optimization
S3 operational complexity includes regional data placement, access control management, performance optimization, and cost management across multiple storage classes.
Cloud networking enables secure, scalable communication between services.
Virtual Private Cloud (VPC)
Load Balancers
Content Delivery Network (CloudFront)
DNS and Service Discovery
Networking services reduce latency, improve reliability, and provide security for distributed applications across global infrastructure.
Serverless computing executes code without server management or capacity planning.
Traditional Server-Based Model
Serverless Execution Model
Key Serverless Concepts
Function as a Service (FaaS): Code runs as stateless functions
Event-Driven Architecture: Functions triggered by events
Cold Starts: Initialization delay for new function instances
Serverless Service Categories
Compute Functions (Lambda)
API Management (API Gateway)
Database Services (DynamoDB)
Storage and Messaging
Development and Deployment
Serverless Trade-offs: No server management vs execution time limits, automatic scaling vs cold starts, pay-per-use vs potentially higher costs at scale.
Lambda provides specific implementation of serverless computing with constraints for ML workloads.
Lambda Execution Model
Lambda Limitations for ML Workloads
Suitable ML Use Cases
Not Suitable for:
Lambda provides cost-effective serverless computing for event-driven ML tasks but has significant constraints for large-scale model operations.
Cloud services connect through APIs, events, and data flows.
Request-Response Pattern
Event-Driven Pattern
Data Pipeline Pattern
Shared Storage Pattern
Integration pattern selection depends on latency requirements, failure tolerance, and operational complexity constraints.
Lambda 10GB memory limit prevents large model deployment.
Lambda Memory Constraint
Large Language Models
Result: Lambda cannot load models >4GB
Cold Start Penalty
Models >250MB face initialization delays:
EC2 Memory Capacity
Instance Memory Range
Model Deployment Examples
Memory vs Cost Trade-off
EC2 supports any practical model size with appropriate instance selection.
Memory requirements determine compute service viability before performance or cost considerations.
Lambda 15-minute timeout eliminates ML training.
Lambda Execution Limits
Typical ML Training Duration
Small Models (ImageNet Classification)
Large Models (Language Models)
Fine-tuning Duration
EC2 Training Capability
Unlimited Execution Time
Training Cost Examples
ResNet-50 on p3.2xlarge ($3.06/hour)
GPT-2 Small on p3.8xlarge ($12.24/hour)
BERT Base on p3.16xlarge ($24.48/hour)
Spot Instance Savings
15-minute execution limit makes Lambda unsuitable for any ML training workload.
S3 request rate limits constrain high-throughput workloads.
S3 Request Rate Limits
Per-Prefix Limits
Distributed Training Impact
100-GPU Training Job
1000-GPU Training Job
Request Hotspotting
EBS IOPS Limitations
Volume Type Performance
EBS Type | Max IOPS | Max Throughput | Cost/Month (100GB) |
---|---|---|---|
gp3 | 16,000 | 1,000 MB/s | $8.00 |
io2 | 64,000 | 1,000 MB/s | $65.00 |
gp2 | 10,000 | 250 MB/s | $10.00 |
Database Workload Impact
PostgreSQL with 1M records/second inserts
Machine Learning Dataset Loading
Multi-Instance Sharing
Storage performance limits determine data access patterns and training architecture.
Lambda pay-per-request vs EC2 always-on pricing.
Usage Pattern Analysis
Scenario 1: Sporadic Inference (100 requests/day)
Lambda Costs
EC2 Alternative (t3.micro always-on)
Break-even point: 1,460 requests/day
Scenario 2: High-Volume Inference (100,000 requests/day)
Lambda Costs
EC2 Alternative (c5.large)
Cost Crossover Points
Request Volume Thresholds
Instance Type | Monthly Cost | Lambda Break-even |
---|---|---|
t3.nano | $4.38 | 730 req/day |
t3.micro | $8.76 | 1,460 req/day |
t3.small | $17.52 | 2,920 req/day |
c5.large | $61.32 | 10,220 req/day |
Memory Impact on Lambda Costs
Memory | Cost per GB-second | 1M req/month cost |
---|---|---|
128MB | Base rate | $200 |
1GB | 8x base | $1,600 |
3GB | 24x base | $4,800 |
10GB | 80x base | $16,000 |
Duration Impact
Cost optimization requires matching service pricing model to actual usage patterns.
Hard limits eliminate service options before cost optimization.
Constraint Hierarchy
1. Hard Constraints (Service Elimination)
64,000 IOPS → Multiple EBS volumes required
2. Performance Constraints (Service Selection)
5,500 requests/second → S3 prefix distribution required
16,000 IOPS → io2 volumes required
3. Cost Constraints (Configuration Optimization)
Real Architecture Decisions
Large Model Serving (7GB model)
Batch Processing (2-hour jobs)
Service constraints determine feasible architectures; cost considerations optimize within remaining options.
Transform single-machine PyTorch workflows into systems using EC2 and S3.
Local Development Workflow
# Everything on laptop
import torch
import pandas as pd
# Load data (local file)
data = pd.read_csv('dataset.csv')
# Train model (local GPU)
model = train_pytorch_model(data)
# Save model (local disk)
torch.save(model, 'model.pth')
# Serve predictions (local process)
app.run(host='localhost', port=5000)
Local Constraints:
Cloud Workflow Using EC2 + S3
# Distributed across services
import boto3
import torch
# Load data (from S3)
s3.download_file('ml-bucket', 'dataset.csv', '/tmp/dataset.csv')
# Train model (EC2 with GPU)
model = train_pytorch_model(data)
# Save model (to S3)
torch.save(model, '/tmp/model.pth')
s3.upload_file('/tmp/model.pth', 'ml-bucket', 'models/model.pth')
# Serve predictions (Lambda + S3)
def lambda_handler(event, context):
s3.download_file('ml-bucket', 'models/model.pth', '/tmp/model.pth')
model = torch.load('/tmp/model.pth')
return model.predict(event['input'])
Cloud Capabilities:
EC2 instances and S3 buckets require API integration and IAM configuration for functional ML systems.
Simple ML system using EC2 for training and Lambda for serving.
Component Design
Data Storage (S3)
s3://ml-bucket/data/
s3://ml-bucket/models/
s3://ml-bucket/results/
Training Infrastructure (EC2)
Serving Infrastructure (Lambda)
System Data Flow
Total monthly cost: ~$330 for moderate ML workload with occasional training and regular serving.
EC2-based training system with S3 data management.
Training Job Configuration
EC2 Instance Setup
# Launch instance
aws ec2 run-instances \
--image-id ami-0c02fb55956c7d316 \
--instance-type p3.2xlarge \
--key-name my-key \
--security-group-ids sg-12345678
# SSH and setup
ssh -i my-key.pem ubuntu@instance-ip
sudo apt update && sudo apt install awscli
Training Script Structure
#!/usr/bin/env python3
import boto3
import torch
# Download training data
s3 = boto3.client('s3')
s3.download_file('ml-bucket', 'train.csv', 'data/train.csv')
# Load and train
data = load_data('data/train.csv')
model = MyModel()
train_model(model, data, epochs=100)
# Upload trained model
torch.save(model.state_dict(), 'model.pth')
s3.upload_file('model.pth', 'ml-bucket', 'models/model_v1.pth')
# Cleanup and terminate
os.system('sudo shutdown -h now')
Cost Optimization
Training Performance Analysis
Model Size | Local (RTX 4090) | EC2 (p3.2xlarge) | Cost |
---|---|---|---|
Small (10M params) | 2 hours | 1.5 hours | $4.59 |
Medium (100M params) | 8 hours | 6 hours | $18.36 |
Large (1B params) | Cannot fit | 24 hours | $73.44 |
Training Workflow
Failure Handling
Training System Benefits: Scales beyond local GPU memory, handles larger datasets, provides cost flexibility through spot instances.
Lambda-based serving with S3 model storage.
Lambda Function Implementation
import json
import boto3
import torch
import tempfile
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Download model from S3 (cached after first call)
if not hasattr(lambda_handler, 'model'):
with tempfile.NamedTemporaryFile() as tmp:
s3.download_file('ml-bucket', 'models/model_v1.pth', tmp.name)
lambda_handler.model = torch.load(tmp.name, map_location='cpu')
# Parse input
input_data = json.loads(event['body'])
# Make prediction
with torch.no_grad():
prediction = lambda_handler.model(input_data['features'])
return {
'statusCode': 200,
'body': json.dumps({'prediction': prediction.tolist()})
}
API Gateway Configuration
https://api.example.com/predict
Alternative: EC2 Serving
# For higher throughput or larger models
from flask import Flask, request
import torch
app = Flask(__name__)
model = torch.load('model.pth') # Loaded once at startup
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
prediction = model(data['features'])
return {'prediction': prediction.tolist()}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=80)
Serving Performance Comparison
Approach | Cold Start | Warm Latency | Max Throughput | Cost/1M requests |
---|---|---|---|---|
Lambda | 2-5 seconds | 100-300ms | 1000 concurrent | $200 |
EC2 t3.medium | 0ms | 50-100ms | 100 req/sec | $300 |
EC2 c5.large | 0ms | 20-50ms | 500 req/sec | $600 |
When to Use Each:
Lambda:
EC2:
Serving Design Choice: Lambda for variable workloads, EC2 for consistent high-throughput requirements.
S3-based data organization for ML workflows.
S3 Bucket Organization
ml-project-bucket/
├── data/
│ ├── raw/
│ │ ├── 2024/01/15/data.csv
│ │ └── 2024/01/16/data.csv
│ ├── processed/
│ │ ├── train.parquet
│ │ └── test.parquet
│ └── features/
│ └── feature_v1.csv
├── models/
│ ├── experiments/
│ │ ├── exp_001/model.pth
│ │ └── exp_002/model.pth
│ └── production/
│ ├── model_v1.pth
│ └── model_v2.pth
└── results/
├── predictions/
└── metrics/
Data Processing Pipeline
# Data validation and preprocessing
def process_data():
# Download raw data
s3.download_file('bucket', 'data/raw/data.csv', 'raw.csv')
# Clean and validate
df = pd.read_csv('raw.csv')
df = validate_schema(df)
df = clean_missing_values(df)
# Split and save
train, test = train_test_split(df)
train.to_parquet('train.parquet')
test.to_parquet('test.parquet')
# Upload processed data
s3.upload_file('train.parquet', 'bucket', 'data/processed/train.parquet')
s3.upload_file('test.parquet', 'bucket', 'data/processed/test.parquet')
S3 Storage Class Strategy
Data Type | Access Pattern | Storage Class | Cost/GB/month |
---|---|---|---|
Raw data | Archive only | Glacier | $0.004 |
Processed training data | Weekly access | IA | $0.0125 |
Active models | Daily access | Standard | $0.023 |
Predictions | Real-time | Standard | $0.023 |
Data Lifecycle Management
# Lifecycle policy example
lifecycle_policy = {
'Rules': [{
'Status': 'Enabled',
'Transitions': [
{
'Days': 30,
'StorageClass': 'STANDARD_IA'
},
{
'Days': 90,
'StorageClass': 'GLACIER'
}
]
}]
}
Data Access Patterns
Cost Optimization
Data Strategy: Organize by lifecycle stage, optimize storage classes for access patterns, implement automated lifecycle policies.
Connect EC2 training and Lambda serving through S3.
End-to-End Workflow
Automated Training Pipeline
# CloudWatch Event triggered training
def trigger_training(event, context):
# Launch EC2 training instance
ec2 = boto3.client('ec2')
user_data_script = '''#!/bin/bash
aws s3 cp s3://ml-bucket/scripts/train.py /home/ubuntu/
cd /home/ubuntu
python3 train.py
sudo shutdown -h now
'''
response = ec2.run_instances(
ImageId='ami-0c02fb55956c7d316', # Deep Learning AMI
InstanceType='p3.2xlarge',
MinCount=1, MaxCount=1,
UserData=user_data_script,
IamInstanceProfile={'Name': 'ML-Training-Role'}
)
return {'instance_id': response['Instances'][0]['InstanceId']}
Model Update Workflow
# S3 trigger for model updates
def update_serving_model(event, context):
# New model uploaded to S3
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
if key.startswith('models/production/'):
# Update Lambda environment variable
lambda_client = boto3.client('lambda')
lambda_client.update_function_configuration(
FunctionName='ml-serving-function',
Environment={'Variables': {'MODEL_PATH': key}}
)
Monitoring and Alerting
CloudWatch Metrics
Automated Alerts
# CloudWatch alarm for training failures
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
AlarmName='ML-Training-Failed',
MetricName='InstanceTerminated',
Namespace='AWS/EC2',
Statistic='Sum',
Period=300,
EvaluationPeriods=1,
Threshold=1,
ComparisonOperator='GreaterThanThreshold',
AlarmActions=['arn:aws:sns:us-east-1:123456789012:ml-alerts']
)
System Health Dashboard
Integration Principles: Use S3 as central data store, automate workflows with triggers, implement comprehensive monitoring.
Practical cost control for EC2 and S3 based ML systems.
Cost Breakdown Analysis
Monthly Costs for Typical ML Project
Cost Optimization Strategies
EC2 Training Optimization
S3 Storage Optimization
Lambda Serving Optimization
Monitoring and Budgets
Cost Optimization Impact
Optimization | Before | After | Savings |
---|---|---|---|
Spot instances | $612 | $184 | $428 (70%) |
S3 lifecycle | $11.50 | $5.75 | $5.75 (50%) |
Right-sizing | $200 | $120 | $80 (40%) |
Total | $873.50 | $359.75 | $513.75 |
Monthly savings: 59% through optimization
Budgeting Framework
# Set up cost alerts
import boto3
budgets = boto3.client('budgets')
budgets.create_budget(
AccountId='123456789012',
Budget={
'BudgetName': 'ML-Project-Budget',
'BudgetLimit': {
'Amount': '500',
'Unit': 'USD'
},
'TimeUnit': 'MONTHLY',
'BudgetType': 'COST'
},
NotificationsWithSubscribers=[{
'Notification': {
'NotificationType': 'ACTUAL',
'ComparisonOperator': 'GREATER_THAN',
'Threshold': 80
},
'Subscribers': [{
'SubscriptionType': 'EMAIL',
'Address': 'admin@company.com'
}]
}]
)
Cost Management Process: Set budgets, implement optimizations, monitor usage patterns, adjust resources based on actual requirements.
Transform development system into production-ready ML service.
Production Readiness Checklist
Security
Reliability
Monitoring
Scalability
Compliance
Development vs Production
Aspect | Development | Production |
---|---|---|
Data volume | 1GB sample | 1TB+ full dataset |
Training frequency | Manual | Automated daily/weekly |
Serving SLA | Best effort | 99.9% availability |
Security | Basic | Enterprise-grade |
Cost | $50/month | $500-5000/month |
Production Architecture Changes
Operational Procedures
Success Metrics
Production Transformation: Add redundancy, monitoring, security, and operational procedures around the basic EC2/S3/Lambda architecture.
AWS uses ARNs to uniquely identify every resource across all accounts and regions globally.
ARN Structure Format
arn:partition:service:region:account-id:resource-type/resource-id
Component Breakdown
Partition: AWS deployment (usually “aws”)
aws
- Standard AWS regionsaws-cn
- China regionsaws-us-gov
- GovCloud regionsService: AWS service name
s3
- Simple Storage Serviceec2
- Elastic Compute Cloudiam
- Identity and Access Managementlambda
- Lambda FunctionsRegion: Geographic region identifier
us-east-1
- US East (Virginia)eu-west-1
- EU (Ireland)Account ID: 12-digit account identifier
123456789012
- Specific AWS accountResource: Service-specific identifier
bucket-name
- S3 bucketinstance/i-1234567890abcdef0
- EC2 instanceuser/developer-name
- IAM userReal ARN Examples
S3 Bucket ARN
arn:aws:s3:::ml-training-bucket-12345
S3 Object ARN
arn:aws:s3:::ml-training-bucket-12345/models/bert-base.pth
EC2 Instance ARN
arn:aws:ec2:us-east-1:123456789012:instance/i-0abcd1234efgh5678
IAM Role ARN
arn:aws:iam::123456789012:role/EC2-ML-Training-Role
Lambda Function ARN
arn:aws:lambda:us-east-1:123456789012:function:iris-classifier-api
Policy Usage Example
ARNs enable precise resource identification across AWS’s global infrastructure, supporting granular access control and cross-service integration.
AWS generates unique identifiers for resources with predictable patterns for programmatic access.
AWS-Generated IDs
EC2 Instances
i-
+ 17 hex charactersi-0abcd1234efgh5678
Security Groups
sg-
+ 17 hex characterssg-0123456789abcdef0
VPCs (Virtual Private Clouds)
vpc-
+ 17 hex charactersvpc-12345678901234567
Subnets
subnet-
+ 17 hex characterssubnet-0abcdef1234567890
AMI (Amazon Machine Images)
ami-
+ 17 hex charactersami-0c2b8ca1dad447f8a
EBS Volumes
vol-
+ 17 hex charactersvol-0123456789abcdef0
User-Defined Naming
S3 Bucket Names (Global)
ml-training-data-company-2024
, model-artifacts-prod
IAM Names (Account-scoped)
developer-john-smith
, ci-cd-deployment
EC2-ML-Training-Role
, Lambda-S3-Access
MLTrainingDataAccess
, ModelDeploymentPermissions
Tags for Resource Organization
{
"Environment": "production",
"Project": "ml-classifier",
"Owner": "data-science-team",
"CostCenter": "research-development"
}
Naming Best Practices
Descriptive and Searchable
ml-training-p3xlarge-gpu-instance
my-instance-1
Environment Separation
ml-model-artifacts-dev
ml-model-artifacts-staging
ml-model-artifacts-prod
Service Integration
Consistent resource naming and understanding ID patterns enables automation, cost tracking, and operational management at scale.
Distributed systems require identity verification across network boundaries without shared local authentication.
Distributed Systems Security Problem
Local systems rely on operating system authentication:
Cloud Distribution Challenge
IAM as Distributed Security Solution
AWS IAM solves distributed identity through:
IAM Identity Types
Root Account
IAM Users
IAM Roles
Service-Linked Roles
Identity Hierarchy Structure
AWS Account (Root)
├── IAM Users
│ ├── Individual Developer A
│ ├── Individual Developer B
│ └── CI/CD System User
├── IAM Groups
│ ├── Developers Group
│ ├── Administrators Group
│ └── Read-Only Group
├── IAM Roles
│ ├── EC2-ML-Training-Role
│ ├── Lambda-Execution-Role
│ └── Cross-Account-Access-Role
└── Service-Linked Roles
├── ECS Task Role
├── Auto Scaling Role
└── CloudFormation Role
Identity Relationship Dependencies
User → Group Membership
Role → Trust Relationships
Cross-Account Trust
Critical Design Principle: Least privilege access - grant minimum permissions required for specific tasks, expandable through group membership or role assumption.
Distributed systems require explicit authorization for every network request.
Local vs Distributed Authorization
Local System Authorization (Traditional)
Distributed System Authorization (Cloud)
Policy-Based Authorization Model
IAM implements declarative security through JSON policies:
Policy Document Structure
Basic Policy Components
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::ml-training-bucket/*",
"Condition": {
"StringEquals": {
"s3:prefix": ["models/", "datasets/"]
}
}
}
]
}
Policy Types and Attachment Methods
Identity-Based Policies
Resource-Based Policies
Permission Boundaries
Policy Evaluation Logic
Common Permission Patterns
Service-Specific Actions
s3:ListBucket
- List objects in S3 bucketec2:RunInstances
- Launch EC2 instancesiam:CreateRole
- Create IAM roleslogs:CreateLogGroup
- Create CloudWatch log groupsResource ARN Patterns
arn:aws:s3:::bucket-name/*
- All objects in bucketarn:aws:ec2:us-east-1:*:instance/*
- All instances in regionarn:aws:iam::account-id:role/role-name
- Specific IAM rolePolicy Evaluation Rule: Explicit deny always wins, followed by explicit allow, with implicit deny as default for all unspecified actions.
Multiple programmatic and interactive interfaces provide access to AWS services with different authentication and use case optimization.
AWS Management Console
Console Authentication Flow
User Login → MFA Verification → Session Token
├── Session Duration: 12 hours maximum
├── Automatic logout on inactivity
├── Role switching within console
└── CloudTrail logging of all actions
AWS Command Line Interface (CLI)
CLI Installation and Configuration
# Install AWS CLI v2
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip && sudo ./aws/install
# Configure default profile
aws configure
# AWS Access Key ID: AKIAIOSFODNN7EXAMPLE
# AWS Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Default region name: us-east-1
# Default output format: json
AWS Software Development Kits (SDKs)
SDK Authentication Hierarchy
Python SDK (boto3) Example
import boto3
# Automatic credential resolution
s3_client = boto3.client('s3')
# List buckets
response = s3_client.list_buckets()
for bucket in response['Buckets']:
print(f"Bucket: {bucket['Name']}")
# Upload file with automatic multipart
s3_client.upload_file(
'local_file.txt',
'ml-training-bucket',
'datasets/file.txt'
)
Access Method Performance Comparison
Credential Security Principle: Use temporary credentials (roles) for applications, permanent credentials only for development environments with regular rotation.
Secure credential management requires understanding authentication mechanisms, storage locations, and rotation procedures for maintaining system security.
Credential Types and Use Cases
Access Key Pairs (Permanent Credentials)
Temporary Security Credentials
Multi-Factor Authentication (MFA)
Credential Storage Mechanisms
Local Configuration Files
# ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
[development]
aws_access_key_id = AKIAI44QH8DHBEXAMPLE
aws_secret_access_key = je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY
# ~/.aws/config
[default]
region = us-east-1
output = json
[profile development]
region = us-west-2
output = table
Environment Variable Configuration
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export AWS_DEFAULT_REGION=us-east-1
export AWS_PROFILE=development
Instance Metadata Service (IMDS)
# Get instance role credentials (IMDSv2)
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
CREDENTIALS=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/iam/security-credentials/role-name)
Credential Security Best Practices
Development Environment
Production Environment
Credential Rotation Procedure
Security Implementation Standard: Production systems must use IAM roles with temporary credentials; permanent access keys only for development environments with mandatory rotation procedures.
Distributed systems require transitive trust without credential sharing.
Distributed Trust Problem
Traditional network security uses shared secrets:
Transitive Trust Challenge
ML systems require service-to-service access:
Role Assumption as Trust Delegation
IAM roles implement temporary trust without credential sharing:
Role Assumption Mechanics
Trust Policy Configuration
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::123456789012:user/DeveloperA",
"arn:aws:iam::123456789012:role/EC2-Instance-Role"
]
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "unique-external-identifier"
}
}
}
]
}
Role Assumption Process
sts:AssumeRole
permission for target roleTemporary Credential Characteristics
Cross-Account Access Patterns
Development Account → Production Account
# Assume role in production account
aws sts assume-role \
--role-arn arn:aws:iam::987654321098:role/ProductionDeploymentRole \
--role-session-name deployment-session-2024 \
--external-id unique-external-identifier
# Response contains temporary credentials
{
"Credentials": {
"AccessKeyId": "ASIAIOSFODNN7EXAMPLE",
"SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"SessionToken": "very-long-session-token-string",
"Expiration": "2024-03-15T14:30:00Z"
}
}
Service-to-Service Role Assumption
Cross-Account Trust Relationships
Account A (Production) Trusts Account B (Development)
Account B (111111111111) - Development
├── Developer Users
├── CI/CD Systems
└── Can assume roles in Production Account
Account A (222222222222) - Production
├── ProductionDeploymentRole (trusts Account B)
├── DataAccessRole (trusts specific users)
└── MonitoringRole (trusts service accounts)
Role Chaining Limitations
Access Control Architecture: Cross-account role assumption provides secure resource sharing without permanent credential distribution, enabling centralized identity management across multiple AWS environments.
Programmatic AWS access requires proper configuration of authentication credentials, regional settings, and service-specific parameters through standardized configuration methods.
Configuration Hierarchy and Precedence
Credential Resolution Order
aws s3 ls --profile production
)AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
)~/.aws/credentials
)~/.aws/config
)Profile-Based Configuration Management
# ~/.aws/config
[default]
region = us-east-1
output = json
[profile development]
region = us-west-2
output = table
role_arn = arn:aws:iam::123456789012:role/DevelopmentRole
source_profile = default
[profile production]
region = us-east-1
output = json
role_arn = arn:aws:iam::987654321098:role/ProductionRole
source_profile = default
external_id = prod-external-id-2024
Advanced Configuration Options
Regional Configuration
Output Format Specification
json
: Machine-readable structured outputtable
: Human-readable tabular formattext
: Tab-delimited values for shell scriptingyaml
: YAML-formatted output for configuration filesSDK Configuration Examples
Python (boto3) Configuration
import boto3
from botocore.config import Config
# Session with specific profile
session = boto3.Session(profile_name='development')
s3_client = session.client('s3')
# Client with custom configuration
config = Config(
region_name='us-west-2',
retries={'max_attempts': 10, 'mode': 'adaptive'},
max_pool_connections=50
)
ec2_client = boto3.client('ec2', config=config)
# Role assumption for cross-account access
sts_client = boto3.client('sts')
assumed_role = sts_client.assume_role(
RoleArn='arn:aws:iam::123456789012:role/DataAccessRole',
RoleSessionName='ml-training-session'
)
# Use temporary credentials
temp_credentials = assumed_role['Credentials']
s3_resource = boto3.resource(
's3',
aws_access_key_id=temp_credentials['AccessKeyId'],
aws_secret_access_key=temp_credentials['SecretAccessKey'],
aws_session_token=temp_credentials['SessionToken']
)
CLI Profile Operations
# List configured profiles
aws configure list-profiles
# Use specific profile
aws s3 ls --profile development
# Set default profile
export AWS_PROFILE=development
# Configure new profile interactively
aws configure --profile new-environment
Environment-Specific Configuration
Configuration Management Strategy: Use named profiles for environment separation, environment variables for containerized applications, and IAM roles for production services to maintain security boundaries and operational consistency.
Comprehensive security requires implementing permission boundaries, continuous access monitoring, and automated compliance verification to maintain least-privilege principles.
Permission Boundary Implementation
Maximum Permission Limits
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"ec2:DescribeInstances",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
},
{
"Effect": "Deny",
"Action": [
"iam:*",
"ec2:TerminateInstances",
"s3:DeleteBucket"
],
"Resource": "*"
}
]
}
Boundary Application Pattern
Access Monitoring and Alerting
CloudTrail Event Monitoring
Critical Security Events
# Root account login
"eventName": "ConsoleLogin",
"userIdentity.type": "Root"
# Failed authentication attempts
"errorCode": "SigninFailure"
"errorMessage": "Invalid username or password"
# Policy modification
"eventName": "PutUserPolicy",
"eventName": "AttachRolePolicy"
# Cross-account access
"eventName": "AssumeRole",
"recipientAccountId": "different-account-id"
Automated Compliance Verification
AWS Config Rules for IAM Compliance
Access Review Procedures
Quarterly Access Audit
Automated Security Monitoring
import boto3
import json
from datetime import datetime, timedelta
def audit_iam_users():
iam = boto3.client('iam')
# Get all IAM users
users = iam.list_users()['Users']
for user in users:
username = user['UserName']
# Check last activity
try:
last_used = iam.get_user(UserName=username)['User'].get('PasswordLastUsed')
if last_used:
days_inactive = (datetime.now(last_used.tzinfo) - last_used).days
if days_inactive > 90:
print(f"Warning: User {username} inactive for {days_inactive} days")
except:
print(f"Unable to check activity for {username}")
# Check MFA status
mfa_devices = iam.list_mfa_devices(UserName=username)['MFADevices']
if not mfa_devices:
print(f"Warning: User {username} has no MFA device")
Security Incident Response
Security Architecture Principle: Implement defense-in-depth through permission boundaries, continuous monitoring, and automated compliance verification to maintain security posture at scale.
Local ML development breaks under production data volumes and serving requirements.
Development Environment Limitations
MacBook Pro M3 (32GB RAM)
Production Requirements
Training Workload
Serving Workload
Failure Points
Distributed Architecture Solution
EC2 Compute Scaling
S3 Storage Scaling
Network Integration
EC2 r5.2xlarge (us-east-1a)
├── Training Process: PyTorch + 64GB RAM
├── Data Pipeline: boto3 → S3 streaming
├── Model Output: S3 model artifacts
└── API Server: Flask + gunicorn (100 req/s)
S3 Bucket (us-east-1)
├── /data/imagenet/ (1.3TB training data)
├── /models/experiments/ (trained model weights)
└── /logs/training/ (experiment tracking)
Cost Structure
Operational Complexity
This architecture trades local simplicity for production scalability at the cost of operational complexity and network dependencies.
AWS requires a credit card for account signup. Charges begin upon resource creation.
CRITICAL BILLING SAFETY - IMPLEMENT IMMEDIATELY:
1. Set Billing Alerts
# Set $10 billing alert via AWS CLI
aws budgets create-budget --account-id 123456789012 \
--budget '{
"BudgetName": "Monthly-Spend-Alert",
"BudgetLimit": {"Amount": "10", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}'
2. Always Terminate Resources
3. Use Free Tier Eligible Resources Only
EXPENSIVE MISTAKES TO AVOID:
GPU Instances: p3.2xlarge costs $3.06/hour ($2,200/month if left running)
Data Transfer: Cross-region transfer costs $0.09/GB (expensive for large datasets)
Load Balancers: Application Load Balancer costs $16.20/month + $0.008 per hour per rule
Auto Scaling: Can launch dozens of instances automatically during traffic spikes
Real Student Bill Examples:
PROTECTION CHECKLIST:
When In Doubt: STOP and TERMINATE EVERYTHING
Create a Linux development environment optimized for ML workloads.
Instance Launch Configuration
AMI Selection
Instance Type Selection
Storage Configuration
Network and Security
Key Pair Authentication
Launch Process Checklist
# Verify instance is running
aws ec2 describe-instances \
--instance-ids i-1234567890abcdef0
# Connect via SSH
ssh -i ml-training-key.pem \
ubuntu@ec2-xx-xx-xx-xx.compute-1.amazonaws.com
# Check system info
uname -a
python3 --version
Expected Costs
Common Launch Issues
chmod 400 ml-training-key.pem
Verification: Instance reaches “running” state, passes status checks, accepts SSH connections.
Configure the instance for ML development with manual Docker installation.
System Updates and Dependencies
# Connect to instance
ssh -i ml-training-key.pem ubuntu@<instance-ip>
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install essential development tools
sudo apt install -y \
git \
htop \
tree \
curl \
wget \
unzip
# Verify Python environment
python3 --version
which python3
Docker Installation (Manual)
# Remove any old Docker versions
sudo apt-get remove docker docker-engine docker.io containerd runc
# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
# Add Docker repository
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Install Docker Engine
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
# Add user to docker group
sudo usermod -aG docker ubuntu
newgrp docker
# Verify Docker installation
docker --version
docker run hello-world
Python Environment Configuration
# Install Python package manager
sudo apt install -y python3-pip python3-venv
# Create virtual environment for ML
python3 -m venv ml-env
source ml-env/bin/activate
# Install ML frameworks and cloud integration packages
pip install \
torch \
boto3 \
pandas \
scikit-learn \
matplotlib \
flask \
joblib \
psutil
# Verify PyTorch installation
python -c "import torch; print(torch.__version__)"
python -c "import torch; print(torch.cuda.is_available())"
AWS CLI Configuration
# Install/update AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# Configure credentials (use IAM user with S3 permissions)
aws configure
# AWS Access Key ID: [your-access-key]
# AWS Secret Access Key: [your-secret-key]
# Default region: us-east-1
# Default output format: json
# Test AWS connectivity
aws s3 ls
Environment Verification
Troubleshooting Common Issues: Docker permission errors (restart session), conda environment activation, AWS credential configuration.
Create cloud storage for training data and model artifacts.
Create S3 Bucket via AWS Console
ml-training-{random-suffix}
(must be globally unique)Bucket Structure
ml-training-demo-12345/
├── data/
│ ├── raw/
│ │ └── iris.csv
│ └── processed/
├── models/
│ └── experiments/
└── logs/
└── training/
Upload Sample Dataset
# Create sample dataset locally
python3 << EOF
from sklearn.datasets import load_iris
import pandas as pd
# Load iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
df.to_csv('iris.csv', index=False)
print(f"Created dataset with {len(df)} rows")
EOF
# Upload to S3
aws s3 cp iris.csv s3://ml-training-demo-12345/data/raw/iris.csv
# Verify upload
aws s3 ls s3://ml-training-demo-12345/data/raw/
Test S3 Access from Python
import boto3
import pandas as pd
from io import StringIO
# Initialize S3 client
s3_client = boto3.client('s3')
bucket_name = 'ml-training-demo-12345'
# List bucket contents
response = s3_client.list_objects_v2(Bucket=bucket_name)
for obj in response.get('Contents', []):
print(f"Object: {obj['Key']}, Size: {obj['Size']} bytes")
# Download data for training
obj = s3_client.get_object(Bucket=bucket_name, Key='data/raw/iris.csv')
data = pd.read_csv(obj['Body'])
print(f"Loaded {len(data)} rows, {len(data.columns)} columns")
print(data.head())
S3 Access Patterns
Cost Monitoring
Verification: S3 bucket created, data uploaded successfully, Python can read/write objects, permissions configured correctly.
Define neural network architecture for cloud training.
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import boto3
from io import StringIO, BytesIO
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import joblib
import json
from datetime import datetime
class IrisClassifier(nn.Module):
def __init__(self, input_size=4, hidden_size=64, num_classes=3):
super(IrisClassifier, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, hidden_size)
self.fc3 = nn.Linear(hidden_size, num_classes)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.2)
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.dropout(x)
x = self.relu(self.fc2(x))
x = self.dropout(x)
x = self.fc3(x)
return x
Architecture Details
Model Memory Requirements
Handle data loading and model persistence in cloud storage.
def load_data_from_s3(bucket_name, key):
"""Load training data from S3 with error handling"""
try:
s3_client = boto3.client('s3')
print(f"Loading data from s3://{bucket_name}/{key}")
obj = s3_client.get_object(Bucket=bucket_name, Key=key)
data = pd.read_csv(obj['Body'])
print(f"Successfully loaded {len(data)} rows, {len(data.columns)} columns")
return data
except Exception as e:
print(f"Error loading data from S3: {str(e)}")
print(f"Bucket: {bucket_name}, Key: {key}")
raise
def save_model_to_s3(model, scaler, bucket_name, model_key, scaler_key):
"""Save trained model and scaler to S3"""
s3_client = boto3.client('s3')
# Save PyTorch model
model_buffer = BytesIO()
torch.save(model.state_dict(), model_buffer)
model_buffer.seek(0)
s3_client.put_object(
Bucket=bucket_name,
Key=model_key,
Body=model_buffer.getvalue()
)
# Save scaler
scaler_buffer = BytesIO()
joblib.dump(scaler, scaler_buffer)
scaler_buffer.seek(0)
s3_client.put_object(
Bucket=bucket_name,
Key=scaler_key,
Body=scaler_buffer.getvalue()
)
S3 Operation Characteristics
Complete training workflow with cloud data and model persistence.
def train_model():
# Configuration
bucket_name = 'ml-training-demo-12345'
data_key = 'data/raw/iris.csv'
# Load data from S3
print("Loading data from S3...")
data = load_data_from_s3(bucket_name, data_key)
# Prepare features and labels
X = data.drop('target', axis=1).values
y = data['target'].values
# Split and scale data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train_scaled)
y_train_tensor = torch.LongTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test_scaled)
y_test_tensor = torch.LongTensor(y_test)
# Initialize model and training
model = IrisClassifier()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
# Training loop with resource monitoring
print("Starting training...")
import psutil
start_time = datetime.now()
model.train()
for epoch in range(100):
optimizer.zero_grad()
outputs = model(X_train_tensor)
loss = criterion(outputs, y_train_tensor)
loss.backward()
optimizer.step()
if (epoch + 1) % 20 == 0:
memory_mb = psutil.Process().memory_info().rss / 1024 / 1024
print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}, Memory: {memory_mb:.1f}MB')
# Evaluate model
model.eval()
with torch.no_grad():
test_outputs = model(X_test_tensor)
_, predicted = torch.max(test_outputs.data, 1)
accuracy = (predicted == y_test_tensor).sum().item() / len(y_test_tensor)
print(f'Test Accuracy: {accuracy:.4f}')
# Save to S3
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
model_key = f'models/iris_classifier_{timestamp}.pth'
scaler_key = f'models/scaler_{timestamp}.pkl'
save_model_to_s3(model, scaler, bucket_name, model_key, scaler_key)
# Training performance summary
end_time = datetime.now()
training_duration = (end_time - start_time).total_seconds()
print(f"Training completed in {training_duration:.1f} seconds")
print(f"Final accuracy: {accuracy:.4f}")
print(f"Model saved to S3: {model_key}")
return model, scaler, accuracy, training_duration
# Run training
if __name__ == "__main__":
model, scaler, accuracy, duration = train_model()
Training Performance Characteristics
Expected Output: Training progress logs, final accuracy metrics, confirmation of model artifacts saved to S3.
HTTP API server loading models from S3 for inference.
from flask import Flask, request, jsonify
import torch
import boto3
import joblib
from io import BytesIO
import numpy as np
app = Flask(__name__)
# Global variables for model and scaler
model = None
scaler = None
def load_model_from_s3(bucket_name, model_key, scaler_key):
"""Load model and scaler from S3"""
s3_client = boto3.client('s3')
# Load PyTorch model
model_obj = s3_client.get_object(Bucket=bucket_name, Key=model_key)
model_buffer = BytesIO(model_obj['Body'].read())
model = IrisClassifier()
model.load_state_dict(torch.load(model_buffer, map_location='cpu'))
model.eval()
# Load scaler
scaler_obj = s3_client.get_object(Bucket=bucket_name, Key=scaler_key)
scaler_buffer = BytesIO(scaler_obj['Body'].read())
scaler = joblib.load(scaler_buffer)
return model, scaler
@app.route('/health', methods=['GET'])
def health_check():
return jsonify({
'status': 'healthy',
'model_loaded': model is not None
})
@app.route('/predict', methods=['POST'])
def predict():
try:
# Parse input data
data = request.json
features = np.array(data['features']).reshape(1, -1)
# Scale features
features_scaled = scaler.transform(features)
# Make prediction
with torch.no_grad():
features_tensor = torch.FloatTensor(features_scaled)
outputs = model(features_tensor)
probabilities = torch.softmax(outputs, dim=1)
predicted_class = torch.argmax(outputs, dim=1).item()
confidence = probabilities[0][predicted_class].item()
# Class names for Iris dataset
class_names = ['setosa', 'versicolor', 'virginica']
return jsonify({
'predicted_class': class_names[predicted_class],
'confidence': float(confidence),
'probabilities': probabilities[0].tolist()
})
except Exception as e:
return jsonify({'error': str(e)}), 400
# Initialize model on startup
bucket_name = 'ml-training-demo-12345'
model_key = 'models/iris_classifier_20250916_143022.pth'
scaler_key = 'models/scaler_20250916_143022.pkl'
print("Loading model from S3...")
model, scaler = load_model_from_s3(bucket_name, model_key, scaler_key)
print("Model loaded successfully!")
if __name__ == '__main__':
app.run(host='0.0.0.0', port=80, debug=True)
API Performance Characteristics
Deploy and validate ML inference API on EC2 instance.
Local API Testing
# Save API code as app.py
# Run Flask application
sudo python3 app.py
# Expected startup output:
Loading model from S3...
Model loaded successfully!
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:80
* Running on http://10.0.1.100:80
# Test from another terminal
# Health check
curl http://localhost/health
# Expected response:
{
"status": "healthy",
"model_loaded": true
}
# Make prediction
curl -X POST http://localhost/predict \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'
# Expected response:
{
"predicted_class": "setosa",
"confidence": 0.9876,
"probabilities": [0.9876, 0.0084, 0.0040]
}
Public Internet Access
# Update security group to allow HTTP traffic
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxxxxxx \
--protocol tcp \
--port 80 \
--cidr 0.0.0.0/0
# Test from external machine
curl http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com/health
# Load test with multiple requests
for i in {1..10}; do
curl -X POST \
http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com/predict \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}' &
done
wait
Error Handling Validation
# Test malformed request
curl -X POST http://localhost/predict \
-H "Content-Type: application/json" \
-d '{"invalid": "data"}'
# Expected error response:
{
"error": "KeyError: 'features'"
}
# Test wrong feature count
curl -X POST http://localhost/predict \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5]}'
# Expected error response:
{
"error": "Input array has wrong dimensions"
}
Performance Verification: API handles 100+ requests/second, <5ms response time, graceful error handling for malformed inputs.
Monitor system performance and optimize costs for production use.
CloudWatch Integration
import boto3
from datetime import datetime
# Initialize CloudWatch client
cloudwatch = boto3.client('cloudwatch')
def publish_custom_metrics(accuracy, training_time):
"""Publish ML training metrics to CloudWatch"""
# Model accuracy metric
cloudwatch.put_metric_data(
Namespace='ML/Training',
MetricData=[
{
'MetricName': 'ModelAccuracy',
'Value': accuracy,
'Unit': 'Percent',
'Dimensions': [
{
'Name': 'ModelType',
'Value': 'IrisClassifier'
}
]
},
{
'MetricName': 'TrainingDuration',
'Value': training_time,
'Unit': 'Seconds',
'Dimensions': [
{
'Name': 'InstanceType',
'Value': 't3.medium'
}
]
}
]
)
# Add to training script
start_time = datetime.now()
# ... training code ...
end_time = datetime.now()
training_duration = (end_time - start_time).total_seconds()
publish_custom_metrics(accuracy * 100, training_duration)
System Monitoring Commands
Cost Analysis and Optimization
# Check current AWS costs
aws ce get-cost-and-usage \
--time-period Start=2025-01-01,End=2025-01-31 \
--granularity DAILY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
# EC2 instance costs
aws ec2 describe-instances \
--query 'Reservations[*].Instances[*].[InstanceId,InstanceType,State.Name]' \
--output table
# S3 storage costs
aws s3api list-objects-v2 \
--bucket ml-training-demo-12345 \
--query 'sum(Contents[].Size)' \
--output text
How to Reduce Costs
Instance Management
Storage Optimization
Development Practices
Expected Monthly Costs: t3.medium ($30), S3 storage ($5), data transfer ($10) = ~$45 for continuous operation.
Identify and resolve typical cloud development problems.
Connection and Access Issues
SSH Connection Failures
# Permission denied (publickey)
chmod 400 ml-training-key.pem
ssh -i ml-training-key.pem ubuntu@instance-ip
# Connection timeout
# Check security group allows SSH from your IP
aws ec2 describe-security-groups \
--group-ids sg-xxxxxxxxx
# Add your current IP to security group
curl ifconfig.me # Get your public IP
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxxxxxx \
--protocol tcp \
--port 22 \
--cidr your-ip/32
S3 Access Errors
# NoCredentialsError
aws configure list
aws sts get-caller-identity
# AccessDenied
aws iam get-user
aws s3 ls s3://bucket-name --debug
# Bucket region mismatch
aws s3api get-bucket-location --bucket bucket-name
Docker Issues
Performance and Resource Issues
Memory and CPU Constraints
# Monitor resource usage
free -h
cat /proc/cpuinfo | grep processor | wc -l
htop
# PyTorch out of memory
# Reduce batch size in training code
batch_size = 16 # Instead of 64
Network and Latency Issues
# Slow S3 transfers
# Use multipart upload for large files
aws configure set default.s3.multipart_threshold 64MB
aws configure set default.s3.multipart_chunksize 16MB
# Test network speed
wget -O /dev/null http://speedtest-sfo1.digitalocean.com/10mb.test
# DNS resolution issues
nslookup s3.amazonaws.com
Application Debugging
import logging
logging.basicConfig(level=logging.DEBUG)
# Add extensive error handling
try:
data = load_data_from_s3(bucket_name, data_key)
except Exception as e:
print(f"S3 Error: {str(e)}")
print(f"Bucket: {bucket_name}, Key: {data_key}")
raise
# Log training progress
print(f"Epoch {epoch}, Loss: {loss.item():.4f}, Memory: {torch.cuda.memory_allocated()}")
Cost Overrun Prevention
Debugging Strategy: Check permissions first, verify network connectivity, monitor resource usage, implement comprehensive logging.
CPU-only EC2 instance delivers 22× slower training than local GPU.
Demo System Configuration
CIFAR-10 ResNet-18 Performance
Training Duration Impact
Why t3.medium Fails for ML
GPU Instance Costs
Instance | vCPU | GPU | RAM | Cost/Hour |
---|---|---|---|---|
t3.medium | 2 | None | 4GB | $0.042 |
p3.2xlarge | 8 | 1×V100 | 61GB | $3.06 |
p3.8xlarge | 32 | 4×V100 | 244GB | $12.24 |
Cost-Performance Analysis
Break-even Usage
Memory Requirements
p3.2xlarge costs $61/day continuous operation vs $0 local GPU after purchase.
Network storage introduces 8× slowdown for dataset loading.
CIFAR-10 Loading Performance (170MB dataset)
Network Latency Impact
Training Pipeline Bottlenecks
# Local development - continuous GPU utilization
for batch in DataLoader(dataset, batch_size=256):
loss = model(batch) # GPU busy 98% of time
# S3 streaming - GPU starvation
for epoch in range(100):
download_dataset_from_s3() # 12 second delay
for batch in cached_dataset:
loss = model(batch) # GPU idle during downloads
Checkpoint Saving Delays
Caching Strategies
EBS Volume Cache
Instance Store (i3.large)
Multi-part Downloads
import concurrent.futures
import boto3
def parallel_download(bucket, prefix, workers=8):
s3 = boto3.client('s3')
objects = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
def download_one(key):
s3.download_file(bucket, key['Key'], f"./data/{key['Key']}")
with concurrent.futures.ThreadPoolExecutor(workers) as executor:
executor.map(download_one, objects['Contents'])
Cost of Data Movement
EBS caching reduces loading time to 2.8 seconds but requires manual cache management.
Incorrect resource ARNs cause access denied errors.
Common IAM Mistakes
Wrong Resource ARN Format
Error: Missing /*
for object access Fix: "arn:aws:s3:::my-bucket/*"
Missing List Permission
Error: Cannot list bucket contents Fix: Add s3:ListBucket
action
Overly Broad Permissions
Risk: Access to all S3 buckets in account Production: Never use wildcard permissions
Minimal Working IAM Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::ml-training-bucket"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::ml-training-bucket/*"
}
]
}
Debug IAM Issues
# Test S3 access
aws s3 ls s3://ml-training-bucket --profile demo
# Check effective permissions
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::account:role/EC2-ML-Role \
--action-names s3:GetObject \
--resource-arns arn:aws:s3:::ml-training-bucket/data.csv
CloudTrail for Debugging
IAM permissions require exact ARN matching - bucket vs object permissions commonly confused.
EC2 instance failure stops all training with no automatic recovery.
Failure Modes
Data Loss Scenarios
Manual Recovery Process
Availability Calculation
High Availability Requirements
Auto Scaling Group
{
"AutoScalingGroupName": "ml-training-asg",
"MinSize": 1,
"MaxSize": 3,
"DesiredCapacity": 1,
"HealthCheckType": "EC2",
"HealthCheckGracePeriod": 300,
"AvailabilityZones": ["us-east-1a", "us-east-1b"]
}
Application Load Balancer
Training Job Resilience
def checkpoint_training():
# Save to S3 every epoch
checkpoint = {
'model_state': model.state_dict(),
'optimizer_state': optimizer.state_dict(),
'epoch': current_epoch,
'loss': current_loss
}
torch.save(checkpoint, '/tmp/checkpoint.pth')
s3.upload_file('/tmp/checkpoint.pth',
'bucket', f'checkpoints/epoch_{current_epoch}.pth')
def resume_training():
# Resume from latest S3 checkpoint
latest_checkpoint = find_latest_checkpoint_s3()
checkpoint = torch.load(latest_checkpoint)
model.load_state_dict(checkpoint['model_state'])
return checkpoint['epoch']
Cost of High Availability
Production systems require 3-5× cost increase for fault tolerance and automatic recovery.
Serverless model serving faces initialization delays absent in always-on systems.
Cold Start Performance
import json
import torch
import boto3
def lambda_handler(event, context):
# Cold start steps:
# 1. Download model from S3 (2-4 seconds)
# 2. Load PyTorch model (1-3 seconds)
# 3. Initialize model for inference (0.5-1 seconds)
s3 = boto3.client('s3')
s3.download_file('bucket', 'model.pth', '/tmp/model.pth')
model = torch.load('/tmp/model.pth', map_location='cpu')
# Actual inference (10-50ms)
prediction = model(torch.tensor(event['input']))
return {'prediction': prediction.tolist()}
Timing Breakdown
Warm Request Performance
Always-On EC2 Alternative
from flask import Flask, request, jsonify
import torch
app = Flask(__name__)
# Load model once at startup (not per request)
print("Loading model...") # 2-4 seconds one-time
model = torch.load('model.pth', map_location='cpu')
model.eval()
print("Model ready")
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
# No cold start - model already loaded
with torch.no_grad():
prediction = model(torch.tensor(data['features']))
return jsonify({'prediction': prediction.tolist()})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=80)
Performance Comparison
Approach | Cold Start | Warm Latency | Cost (1M req/month) |
---|---|---|---|
Lambda | 2-8 seconds | 15-50ms | $200 |
EC2 t3.micro | 0ms | 20-100ms | $350 |
EC2 c5.large | 0ms | 5-20ms | $720 |
When Lambda Makes Sense
When EC2 Required
Serverless introduces 2-8 second initialization penalty vs 0ms for persistent servers.
Demo system requires manual start/stop vs production auto-scaling complexity.
Manual Operations
# Start training job
aws ec2 start-instances --instance-ids i-1234567890abcdef0
# SSH and run training
ssh -i key.pem ubuntu@instance-ip
python train_model.py
# Check progress manually
tail -f training.log
# Stop instance when done
aws ec2 stop-instances --instance-ids i-1234567890abcdef0
Manual Process Problems
Development Workflow
Cost Control Issues
Auto Scaling Production Setup
# CloudFormation template
Resources:
MLAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: 0
MaxSize: 10
DesiredCapacity: 1
LaunchTemplate:
LaunchTemplateId: !Ref MLLaunchTemplate
Version: !GetAtt MLLaunchTemplate.LatestVersionNumber
HealthCheckGracePeriod: 300
HealthCheckType: EC2
MLLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateData:
ImageId: ami-0c02fb55956c7d316
InstanceType: p3.2xlarge
IamInstanceProfile:
Arn: !GetAtt MLInstanceProfile.Arn
UserData:
Fn::Base64: |
#!/bin/bash
aws s3 cp s3://ml-bucket/train.py /home/ubuntu/
cd /home/ubuntu && python3 train.py
shutdown -h now # Auto-terminate when done
Auto Scaling Benefits
Production Complexity
Production auto-scaling requires infrastructure complexity but eliminates manual operations and cost overruns.
Production deployment multiplies operational requirements by 100×.
Demo System Operations
Weekly Effort: 1-2 Hours
Tools Required
Failure Recovery
Security Model
Cost Management
Production System Requirements
Weekly Effort: 15-20 Hours
Enterprise Operations Stack
# Infrastructure as Code
terraform plan && terraform apply
# Monitoring and Alerting
kubectl apply -f prometheus-config.yaml
aws cloudwatch put-metric-alarm --alarm-name "High-GPU-Usage"
# Security Compliance
aws config start-config-recorder
aws guardduty create-detector
# Cost Management
aws budgets create-budget --budget file://ml-budget.json
aws ce get-cost-and-usage --time-period Start=2024-01-01,End=2024-02-01
Production Requirements
Team Structure
Production ML systems require dedicated operations team vs single developer for demo system.
GPU instance costs exceed local workstation after 3 weeks continuous operation.
Cost Comparison Analysis
AWS p3.2xlarge (1× NVIDIA V100)
Local RTX 4090 Workstation
Break-even Analysis
Spot Instance Pricing
Usage Pattern Economics
Intermittent Research (20 hours/month)
Heavy Development (200 hours/month)
Continuous Production (720 hours/month)
GPU Performance Comparison
Reserved Instance Strategy
GPU instances cost-effective below 90 hours/month; above this threshold local hardware provides 60-85% savings.
Specific technical constraints where demonstrated architecture becomes inadequate.
Memory Constraints
Training Scale Limits
Request Rate Bottlenecks
Data Processing Limits
Alternative Architecture Patterns
Kubernetes + GPU Operators
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-training-cluster
spec:
replicas: 16
template:
spec:
containers:
- name: pytorch-training
image: pytorch/pytorch:latest
resources:
limits:
nvidia.com/gpu: 1
memory: 61Gi
Managed ML Services
Serverless Data Processing
When to Migrate from EC2+S3
Migration Triggers
EC2+S3 architecture optimal for single-developer ML projects; enterprise scale requires orchestration platforms and managed services.