
EE 547 - Unit 3
Spring 2026
Identity and Access Management
Your development machine is a single point of failure with fixed capacity.
What one machine provides
| Resource | Typical Range |
|---|---|
| RAM | 16–64 GB |
| CPU cores | 8–16 |
| Storage | 1–2 TB SSD |
| GPU VRAM | 0–24 GB |
| Network | Your ISP |
| Availability | When it’s on |
This is enough for development. It’s not enough for production.
Where single machines fail
Scale: Dataset exceeds memory. Model exceeds GPU. Traffic exceeds capacity. You can’t add more hardware to a laptop.
Reliability: Hardware fails. Power goes out. Your machine restarts for updates. One machine means one failure domain.
Geography: Users in Tokyo experience 150ms latency to your server in Los Angeles. Physics doesn’t negotiate.
Elasticity: Traffic spikes 10× during launch. You either over-provision (waste money) or under-provision (drop requests).
These problems share a solution: access to infrastructure you don’t own.
Someone else operates the hardware. You rent capacity.
Operating infrastructure requires:
These costs are largely fixed. A datacenter serving 100 users costs nearly as much as one serving 10,000.
Cloud model:
Providers operate infrastructure at massive scale, amortize fixed costs across many customers.
You pay for what you use. Capacity appears on demand.
Major providers:
Same underlying model, different APIs. This course uses AWS.
The enterprise infrastructure problem
Expertise gap
Hyperscaler economics
Statistical multiplexing
Scale advantages
CapEx: locked in upfront
OpEx: scales with usage
| Scenario | CapEx | OpEx |
|---|---|---|
| Bought 50, need 20 | Pay for 50 | Pay for 20 |
| Bought 50, need 100 | Can’t serve | Scale to 100 |
| Project canceled | Stranded asset | Stop paying |
Market position
Maturity and stability
2006-03-01 — still works)Ecosystem
For this course
AWS provides access to infrastructure through services.
Each service encapsulates a specific capability:
| Service | Capability |
|---|---|
| EC2 | Virtual machines |
| S3 | Object storage |
| EBS | Block storage (virtual disks) |
| VPC | Virtual networks |
| IAM | Identity and access control |
| RDS | Managed relational databases |
| DynamoDB | Key-value database |
| Lambda | Function execution |
| SQS | Message queues |
~200+ services exist. They share common patterns: API access, regional deployment, metered billing.
Every AWS operation is an API call.
Creating an EC2 instance:
POST / HTTP/1.1
Host: ec2.us-east-1.amazonaws.com
Authorization: AWS4-HMAC-SHA256 ...
Action=RunInstances
&ImageId=ami-0abcdef1234567890
&InstanceType=t3.micro
&MinCount=1&MaxCount=1The response contains an instance ID—a handle to a running VM somewhere in AWS infrastructure.
Three interfaces to the same API:
Console — Web UI that constructs HTTP requests
CLI — Command-line tool that constructs HTTP requests
SDK — Library (Python, Go, etc.) that constructs HTTP requests
All three provide different ergonomics for the same underlying operations. The Console is convenient for exploration; the CLI and SDK enable automation.
Each service exposes an endpoint per region.
| Service | Region | Endpoint |
|---|---|---|
| EC2 | N. Virginia | ec2.us-east-1.amazonaws.com |
| EC2 | Ireland | ec2.eu-west-1.amazonaws.com |
| S3 | N. Virginia | s3.us-east-1.amazonaws.com |
| IAM | (global) | iam.amazonaws.com |
The region in the endpoint determines where the request is processed and where resources are created.
AWS operates datacenters. You interact with abstractions over them.
What you specify:
What AWS decides:
Physical constraints still apply:
Latency — Data travels at finite speed. Virginia to Tokyo ≈ 90ms minimum round-trip.
Jurisdiction — Data stored in eu-west-1 physically resides in Ireland, subject to EU law.
Failure — Hardware fails. AWS handles many failure modes transparently; some propagate to your applications.
The abstraction hides operational complexity, not physical reality.
Local operations are fast. Network operations are not.
Orders of magnitude
| Operation | Latency |
|---|---|
| Memory access | 0.0001 ms |
| SSD read | 0.1 ms |
| Same-AZ network | 1–5 ms |
| Cross-AZ network | 5–20 ms |
| Cross-region | 50–200 ms |
What took nanoseconds now takes milliseconds. That’s 10,000× to 1,000,000× slower.
Practical impact
A web request that makes 10 database queries:
Suddenly your “fast” code is slow. The algorithm didn’t change—the environment did.
Design responses:
On your laptop, things either work or they don’t. In distributed systems, things partially work.
Local failure model
Program crashes → everything stops. Out of memory → process dies. Disk full → all writes fail.
Failure is total and obvious. You fix it and restart.
Distributed failure model
Failure is partial and subtle. Your system keeps running, but wrong.
Example: uploading a file
# Local: this either works or throws
with open('output.csv', 'w') as f:
f.write(data)
# S3: what if the network blips mid-upload?
s3.put_object(Bucket='b', Key='output.csv', Body=data)
# Did it succeed? Partially succeed? Need to retry?Patterns you’ll need:
These aren’t edge cases—they’re normal operation at scale.
Regions → Availability Zones → Data Centers
AZ isolation
Failure containment
Data residency
Deploy across multiple AZs → survive AZ failure
AWS deploys infrastructure in multiple geographic locations called regions.
| Code | Location |
|---|---|
us-east-1 |
N. Virginia |
us-east-2 |
Ohio |
us-west-2 |
Oregon |
eu-west-1 |
Ireland |
eu-central-1 |
Frankfurt |
ap-northeast-1 |
Tokyo |
ap-southeast-1 |
Singapore |
sa-east-1 |
São Paulo |
Scale (2025):
Regions are independent deployments.
Each region has its own:
Resources in us-east-1 don’t exist in eu-west-1. An outage in one region doesn’t directly affect others.
Region selection determines:
Each region contains multiple Availability Zones (AZs)—physically separate datacenter facilities.
us-east-1 contains six AZs:
us-east-1a through us-east-1f
Physical characteristics:
Interconnected:
Purpose: failure isolation
Datacenter failures happen—power outages, cooling failures, network cuts, fires.
AZs are designed so that a failure affecting one facility doesn’t affect the others (mostly). If us-east-1a loses power, us-east-1b through us-east-1f continue operating.
AZ names are per-account:
Your us-east-1a may map to a different physical facility than another account’s us-east-1a. AWS randomizes the mapping to distribute load across facilities.
For cross-account coordination, use AZ IDs: use1-az1, use1-az2, etc.
Different AWS resources exist at different scopes within this hierarchy.

Global
Exist once, accessible everywhere:
These resources have no region. IAM policies apply across all regions.
Regional
Exist in one region, span AZs within that region:
Data is automatically replicated across AZs for durability.
Per-AZ
Exist in a specific AZ:
An EC2 instance runs on physical hardware in one AZ. An EBS volume stores data on drives in one AZ.
Placement constraint: Per-AZ resources can only attach to other resources in the same AZ. An EBS volume in us-east-1a cannot attach to an EC2 instance in us-east-1b.

Your application calls AWS services through their APIs. Services also call each other—an EC2 instance reading from S3, a Lambda function writing to DynamoDB. All API calls are authenticated and authorized through IAM.
What AWS provides:
What you provide:
The API contract:
You describe desired state through API calls. AWS materializes that state on physical infrastructure.
RunInstances → VM running on some server
CreateBucket → Storage allocated on some drives
CreateDBInstance → Database on some hardwareThe mapping from logical resource to physical infrastructure is AWS’s responsibility. Your responsibility is understanding what the logical resources provide and how they compose.
EC2 (Elastic Compute Cloud) provides virtual machines.
A physical server in an AWS datacenter runs a hypervisor. The hypervisor partitions hardware resources—CPU cores, memory, network bandwidth—and presents them to multiple virtual machines as if each had dedicated hardware.
What you receive:
From inside the VM, this looks like a physical machine. The OS sees CPUs, RAM, disks, network interfaces.
What remains with AWS:
You specify what you want. AWS decides which physical server provides it.

Multiple EC2 instances share a physical server. The hypervisor enforces isolation—one instance cannot access another’s memory or see its network traffic. From each instance’s perspective, it has dedicated hardware.
EC2 offers many instance types—different allocations of CPU, memory, storage, and network.
Naming convention: {family}{generation}.{size}
| Type | vCPUs | Memory | Network | Use Case |
|---|---|---|---|---|
| t3.micro | 2 | 1 GB | Low | Development, light workloads |
| t3.large | 2 | 8 GB | Low-Mod | Small applications |
| m5.large | 2 | 8 GB | Moderate | Balanced workloads |
| m5.4xlarge | 16 | 64 GB | High | Larger applications |
| c5.4xlarge | 16 | 32 GB | High | Compute-intensive |
| r5.4xlarge | 16 | 128 GB | High | Memory-intensive |
General purpose (t3, m5):
Balanced CPU-to-memory ratio. Suitable for most workloads that don’t have extreme requirements in either dimension.
Compute optimized (c5):
High CPU-to-memory ratio. For workloads that are CPU-bound: batch processing, scientific modeling, video encoding.
c5.4xlarge: 16 vCPUs, 32 GB memory (2:1 ratio)
Memory optimized (r5, x1):
High memory-to-CPU ratio. For workloads that keep large datasets in memory: in-memory databases, caching, analytics.
r5.4xlarge: 16 vCPUs, 128 GB memory (1:8 ratio)
GPU instances (p3, g4):
Include NVIDIA GPUs. For ML training, inference, graphics rendering.
p3.2xlarge: 8 vCPUs, 61 GB, 1× V100 GPU
Storage optimized (i3, d2):
High sequential I/O. For data warehousing, distributed filesystems.
The t3 family uses a CPU credit model.
How it works:
| Type | Baseline | Credits/hour |
|---|---|---|
| t3.micro | 10% | 6 |
| t3.small | 20% | 12 |
| t3.medium | 20% | 24 |
| t3.large | 30% | 36 |
t3.micro can burst to 100% CPU, but sustained usage above 10% depletes credits.
Implications:
Good for:
Not good for:
For sustained workloads, m5 or c5 provide consistent performance without the credit system.
Course work: t3.micro is sufficient and free-tier eligible.
Launching an EC2 instance combines several configuration elements:

Each component contributes a different aspect: what software runs, what resources it has, how it’s accessed, what network it’s on, what it can do.
| Component | What It Determines | When Specified |
|---|---|---|
| AMI | Operating system, pre-installed software | At launch (immutable) |
| Instance type | CPU, memory, network capacity | At launch (can change when stopped) |
| Key pair | SSH authentication | At launch (cannot change) |
| Security group | Allowed inbound/outbound traffic | At launch (can modify later) |
| Subnet | VPC, Availability Zone, IP range | At launch (immutable) |
| IAM role | Permissions for AWS API calls | At launch (can change) |
| EBS volumes | Persistent storage | At launch or attach later |
AMI (Amazon Machine Image):
Template containing OS and software. AWS provides Amazon Linux, Ubuntu, Windows. AMI IDs are region-specific—the same Ubuntu version has different IDs in different regions.
A security group is a stateful firewall applied to instances.
Inbound rules specify what traffic can reach the instance:
Type Port Source
─────────────────────────────
SSH 22 0.0.0.0/0
HTTP 80 0.0.0.0/0
HTTPS 443 0.0.0.0/0
PostgreSQL 5432 10.0.0.0/16
Custom 8080 sg-0abc1234Each rule: protocol, port range, source (CIDR or security group).
Default inbound: deny all
Outbound rules specify what traffic the instance can send:
Default outbound: allow all
Stateful behavior:
If an inbound request is allowed (e.g., HTTP on port 80), the response is automatically allowed outbound—no explicit outbound rule needed.
Similarly, if an outbound request is allowed, the response is allowed inbound.

Rules are evaluated per-packet. If no rule allows traffic, it’s denied (default deny). Multiple security groups can attach to one instance—rules combine as union.
An EC2 instance moves through states:

| State | Compute Billing | Storage |
|---|---|---|
| pending | No | — |
| running | Yes | Attached |
| stopped | No | Persists |
| terminated | No | Deleted* |
*Root volume deleted by default; can configure to retain.
Stop preserves the instance. EBS volumes remain, private IP preserved. Restart later—may land on different physical host.
Terminate destroys the instance permanently. Cannot recover.
EBS (Elastic Block Store) provides persistent storage for EC2 instances.
Characteristics:
Volume types:
| Type | Performance | Use Case |
|---|---|---|
| gp3 | Balanced SSD | General purpose |
| io2 | High IOPS SSD | Databases |
| st1 | Throughput HDD | Big data |
| sc1 | Cold HDD | Infrequent access |
The AZ constraint:
EBS volumes exist in a specific Availability Zone. They can only attach to instances in the same AZ.
Volume in us-east-1a → Instance must be in us-east-1a
This is a consequence of the physical architecture: the volume’s data is stored on drives in that AZ’s datacenter.
Snapshots:
Point-in-time backup stored in S3 (regionally). Can create new volumes from snapshots in any AZ within the region.

By default, root volumes are deleted on termination. Additional volumes persist unless explicitly deleted. This allows data to survive instance replacement.
To reach an EC2 instance:
Network path must exist:
Authentication must succeed:
The username depends on the AMI: ec2-user (Amazon Linux), ubuntu (Ubuntu), Administrator (Windows).
Every EC2 instance can query information about itself:
This link-local address is routed to the instance metadata service, accessible only from within the instance.
Available information:
| Path | Returns |
|---|---|
/instance-id |
i-0123456789abcdef0 |
/instance-type |
t3.micro |
/ami-id |
ami-0abcdef1234567890 |
/local-ipv4 |
172.31.16.42 |
/public-ipv4 |
54.xxx.xxx.xxx |
/placement/availability-zone |
us-east-1a |
/iam/security-credentials/{role} |
Temporary credentials JSON |
The SDKs use this endpoint to automatically obtain IAM role credentials when running on EC2—no access keys needed in code.
An EC2 instance runs your code. That code needs to access other AWS services:
Each of these is an API call. S3 doesn’t know your EC2 instance—it receives an HTTPS request and must decide: should I allow this?
The request arrives at S3:
S3 must determine:
Without proof of identity and permission, S3 rejects the request.
This is what IAM provides.
Authentication: Who is making this request?
Authorization: Is this principal allowed to perform this action?
Every AWS API call—from any source—goes through this evaluation. No exceptions.

Every AWS resource and principal has an Amazon Resource Name (ARN):
| Component | Example | Notes |
|---|---|---|
| Partition | aws |
Usually aws; aws-cn for China |
| Service | iam, s3, ec2 |
The AWS service |
| Region | us-east-1 |
Empty for global services (IAM) |
| Account | 123456789012 |
12-digit AWS account ID |
| Resource | user/alice, role/MyRole |
Service-specific format |
Examples:
arn:aws:iam::123456789012:user/alice
arn:aws:iam::123456789012:role/EC2-S3-Reader
arn:aws:s3:::my-bucket
arn:aws:s3:::my-bucket/data/*
arn:aws:ec2:us-east-1:123456789012:instance/i-0abc123ARNs uniquely identify resources across all of AWS. Policies reference ARNs to specify who can do what to which resources.
AWS generates unique identifiers for resources. The prefix indicates the resource type.
AWS-Generated IDs
| Prefix | Resource Type |
|---|---|
i- |
EC2 instance |
vol- |
EBS volume |
sg- |
Security group |
vpc- |
VPC |
subnet- |
Subnet |
ami- |
Machine image |
snap- |
EBS snapshot |
Example: i-0abcd1234efgh5678
These IDs are immutable—an instance keeps its ID through stop/start cycles. They’re region-scoped (except AMIs, which are region-specific but can be copied).
User-Defined Names
Some resources have user-chosen names:
my-company-data-2025)EC2-S3-Reader)Tags don’t affect behavior—they’re for organization, billing attribution, and automation (e.g., “terminate all instances tagged Environment=dev”).
API requests must be signed with credentials. AWS verifies the signature to authenticate the caller.
Long-term credentials:
~/.aws/credentials or environment variables# ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFE...Risk: If leaked, attacker has indefinite access until you notice and revoke.
Short-term credentials:
{
"AccessKeyId": "ASIAXXX...",
"SecretAccessKey": "xxx...",
"SessionToken": "FwoGZX...",
"Expiration": "2025-01-28T14:30:00Z"
}Benefit: Limited blast radius. Leaked credentials expire.

The SDK handles signing. You provide credentials (or it finds them automatically); it constructs the Authorization header. AWS verifies the signature by recomputing it with the same secret key.
Permissions are defined in policy documents—JSON that specifies what’s allowed or denied.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-bucket/*"
}
]
}| Field | Purpose | Values |
|---|---|---|
Version |
Policy language version | Always "2012-10-17" |
Statement |
Array of permission rules | One or more statements |
Effect |
Allow or deny | "Allow" or "Deny" |
Action |
API operations | "s3:GetObject", "ec2:*", etc. |
Resource |
What the action applies to | ARN or ARN pattern |
Actions are service:operation:
s3:GetObject # Read object
s3:PutObject # Write object
s3:DeleteObject # Delete object
s3:ListBucket # List bucket contents
s3:* # All S3 actions
ec2:RunInstances # Launch instance
ec2:TerminateInstances
ec2:Describe* # All Describe actions
Wildcards match patterns:
s3:* — all S3 actionsec2:Describe* — all Describe actions* — all actions (dangerous)Resources are ARNs or patterns:
# Specific object
arn:aws:s3:::my-bucket/data/file.json
# All objects in bucket
arn:aws:s3:::my-bucket/*
# All objects in prefix
arn:aws:s3:::my-bucket/data/*
# All buckets (rarely appropriate)
arn:aws:s3:::*
# All resources (dangerous)
*The bucket itself vs objects in it:
s3:ListBucket applies to the bucket. s3:GetObject applies to objects.
A policy can contain multiple statements. Common pattern: different permissions for different resources.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ListBucket",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::my-bucket"
},
{
"Sid": "ReadWriteObjects",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-bucket/*"
}
]
}Sid (statement ID) is optional—useful for documentation and debugging.
Both statements must be present: listing requires permission on the bucket, reading/writing requires permission on objects.
When a request arrives, AWS evaluates all applicable policies:

| Rule | Meaning |
|---|---|
| Default deny | If no policy mentions the action, it’s denied |
| Explicit deny wins | A "Deny" statement overrides any "Allow" |
| Explicit allow grants | An "Allow" statement permits the action (unless denied) |
Practical implications:
{
"Statement": [
{"Effect": "Allow", "Action": "s3:*", "Resource": "*"},
{"Effect": "Deny", "Action": "s3:DeleteBucket", "Resource": "*"}
]
}This allows all S3 actions except DeleteBucket. The deny wins.

Identity-based: “This user/role can do X to Y”
Resource-based: “This resource allows X from Y” (includes Principal field)
Both are evaluated. For same-account access, either can grant. We focus on identity-based policies—they’re more common.
Back to EC2 accessing S3. One approach: create an IAM user, generate access keys, put them in your code.
import boto3
s3 = boto3.client(
's3',
aws_access_key_id='AKIAIOSFODNN7EXAMPLE',
aws_secret_access_key='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
)
s3.get_object(Bucket='my-bucket', Key='data.json')Problems:
| Issue | Consequence |
|---|---|
| Keys in code | Checked into git, visible in repository |
| Keys on disk | Anyone with instance access can read them |
| Keys don’t expire | Leaked key = indefinite access |
| Keys per application | Managing many keys is error-prone |
| Key rotation | Manual process, often neglected |
This is how credentials get leaked. Public GitHub repositories are scanned constantly for AWS keys.
A role is an IAM identity that:
When an entity assumes a role, AWS STS (Security Token Service) issues temporary credentials:

Credentials expire (default 1 hour, configurable). When they expire, assume the role again to get new ones.
EC2 instances assume roles through instance profiles.
Instance profile = container that holds an IAM role
When you launch an instance with an instance profile:
http://169.254.169.254/latest/meta-data/iam/security-credentials/MyRole
Response:

SDKs search for credentials in a defined order:

On EC2 with an instance profile, the SDK automatically uses instance metadata. No configuration needed.
1. Create IAM Role with trust policy:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "ec2.amazonaws.com"},
"Action": "sts:AssumeRole"
}]
}2. Attach permissions policy to role:
3. Create instance profile and attach role:
aws iam create-instance-profile --instance-profile-name EC2-S3-Reader
aws iam add-role-to-instance-profile \
--instance-profile-name EC2-S3-Reader \
--role-name EC2-S3-Reader-Role4. Launch EC2 with instance profile:
aws ec2 run-instances \
--image-id ami-0c55b159cbfafe1f0 \
--instance-type t3.micro \
--iam-instance-profile Name=EC2-S3-Reader \
--key-name my-key5. Code on the instance:
| With access keys | With IAM roles |
|---|---|
| Keys in code or config files | No keys to leak |
| Keys valid indefinitely | Credentials expire automatically |
| Manual rotation required | Automatic rotation |
| Keys can be copied anywhere | Credentials tied to instance |
| Compromised key = long-term access | Compromised instance = temporary access |
Instance compromise is still serious—attacker gets whatever permissions the role has. But they don’t get permanent credentials they can use after losing access to the instance.
Least privilege: give roles only the permissions they need. s3:GetObject on one bucket, not s3:* on *.
EBS provides block storage—raw disk that an OS formats and manages as a filesystem. S3 provides object storage—a different abstraction entirely.
Block storage (EBS):
The filesystem abstraction you already know.
Object storage (S3):
A different model optimized for different access patterns.
S3 is not a filesystem you mount. It’s a service you call.
S3 organizes data into buckets containing objects.
Bucket: a container with a globally unique name
training-data-2025 — yours, no one else can use this nameObject: a key-value pair
models/v1/weights.pt)That’s it. Buckets hold objects. Objects are key + bytes + metadata.

The AWS Console and CLI show a folder-like view. This is a UI convenience, not reality.
# These are three separate objects with no relationship:
data/train/batch-001.csv
data/train/batch-002.csv
data/test/batch-001.csv
# There is no "data" directory
# There is no "data/train" directory
# You cannot "cd" into anything
# You cannot "ls" a directory (you filter by prefix)What “listing a directory” actually does:
This calls ListObjectsV2 with Prefix="data/train/" and Delimiter="/". S3 returns objects whose keys start with that prefix. The slash delimiter groups results to simulate folders.
No directory was traversed. A string filter was applied.
The flat namespace has consequences:
No “rename” operation
Renaming old-name.csv to new-name.csv requires:
For a 5 GB file, this means uploading 5 GB again (within S3, but still a copy).
Filesystems rename by changing a pointer. S3 doesn’t have pointers.
No “move” operation
Same as rename—copy then delete.
No “append” operation
Adding 100 bytes to a 5 GB file requires:
Or: store as separate objects and concatenate at read time.
Filesystems append by extending allocation. S3 objects are immutable blobs.
No partial update
Changing byte 1000 requires replacing the entire object.
Once written, an object cannot be modified—only replaced entirely.
# This doesn't append—it overwrites
s3.put_object(
Bucket='my-bucket',
Key='log.txt',
Body='new content' # Replaces everything
)Design implications:
This isn’t a limitation to work around—it’s a model to design for. Many distributed systems work well with immutable data.
S3 is an HTTP API. Every operation is an HTTP request.
| Operation | HTTP Method | What It Does |
|---|---|---|
PutObject |
PUT | Create/replace object |
GetObject |
GET | Retrieve object (or byte range) |
DeleteObject |
DELETE | Remove object |
HeadObject |
HEAD | Get metadata without body |
ListObjectsV2 |
GET on bucket | List keys matching prefix |
PUT /my-bucket/data/file.csv HTTP/1.1
Host: s3.us-east-1.amazonaws.com
Content-Length: 1048576
Authorization: AWS4-HMAC-SHA256 ...
<file bytes>
The CLI and SDK construct these requests. Understanding that it’s HTTP explains the operation set—HTTP doesn’t have “append” or “rename” either.
# Create bucket (name must be globally unique)
aws s3 mb s3://my-bucket-unique-name-12345
# Upload file
aws s3 cp ./local-file.csv s3://my-bucket/data/file.csv
# Download file
aws s3 cp s3://my-bucket/data/file.csv ./local-file.csv
# List objects (with prefix filter, not directory listing)
aws s3 ls s3://my-bucket/data/
# Sync local directory to S3 (uploads new/changed files)
aws s3 sync ./local-dir/ s3://my-bucket/data/
# Delete object
aws s3 rm s3://my-bucket/data/file.csv
# Delete all objects with prefix
aws s3 rm s3://my-bucket/data/ --recursiveaws s3 commands are high-level conveniences. aws s3api exposes the raw API operations.
import boto3
import json
s3 = boto3.client('s3')
# Upload object
s3.put_object(
Bucket='my-bucket',
Key='results/experiment-001.json',
Body=json.dumps({'accuracy': 0.94, 'loss': 0.23}),
ContentType='application/json'
)
# Download object
response = s3.get_object(Bucket='my-bucket', Key='results/experiment-001.json')
data = json.loads(response['Body'].read())
# List objects with prefix
response = s3.list_objects_v2(Bucket='my-bucket', Prefix='results/')
for obj in response.get('Contents', []):
print(f"{obj['Key']}: {obj['Size']} bytes")EC2 instance needs to read from S3. This is the IAM role pattern from Part 3.
Role trust policy (who can assume):
{
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "ec2.amazonaws.com"},
"Action": "sts:AssumeRole"
}]
}Role permissions policy (what they can do):
{
"Statement": [{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::training-data-bucket/*"
}]
}Instance launched with this role’s instance profile. Code on instance:
A common permissions mistake: policy grants object access but not bucket access.
This allows downloading objects. But listing what’s in the bucket:
ListBucket is a bucket operation, not an object operation. It needs the bucket ARN:
{
"Statement": [
{
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::my-bucket"
},
{
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-bucket/*"
}
]
}Note: my-bucket (the bucket) vs my-bucket/* (objects in the bucket).
S3 ARNs don’t include region or account for buckets (bucket names are globally unique):
arn:aws:s3:::bucket-name # The bucket itself
arn:aws:s3:::bucket-name/* # All objects in bucket
arn:aws:s3:::bucket-name/prefix/* # Objects under prefix
arn:aws:s3:::bucket-name/exact-key # Specific object| ARN | Applies To |
|---|---|
arn:aws:s3:::my-bucket |
ListBucket, GetBucketLocation, bucket operations |
arn:aws:s3:::my-bucket/* |
GetObject, PutObject, DeleteObject, object operations |
arn:aws:s3:::my-bucket/data/* |
Object operations on keys starting with data/ |
Getting this wrong is the most common S3 permissions error.
Two different guarantees:
Durability: probability data survives
S3 Standard: 99.999999999% (11 nines)
S3 stores copies across multiple facilities in the region. Designed to sustain simultaneous loss of two facilities.
10 million objects → expect to lose 1 every 10,000 years.
If you PUT successfully, the data is safe.
Availability: probability you can access it
S3 Standard: 99.99%
About 53 minutes/year of potential unavailability.
Availability failures are transient—retry and it works. Data isn’t lost, just temporarily unreachable.
GET might fail occasionally; data is still there.
High durability doesn’t guarantee high availability. They’re independent properties.
S3 serves as durable storage accessible from any compute resource:

Training runs, writes model to S3, terminates. Serving instances start, read model from S3. Lambda processes uploads. All access the same data. S3 persists regardless of which compute resources exist.
When you launch an EC2 instance, it needs a network. IP address, routing, connectivity to other instances and the internet.
AWS doesn’t put your instance on a shared public network. It goes into a VPC—a Virtual Private Cloud that belongs to your account.
VPC properties:
Other AWS accounts can’t see into your VPC. Traffic between VPCs is isolated by default.
Default VPC:
Every region has a default VPC created automatically. When you launch an instance without specifying networking, it goes here.
For learning and simple deployments, the default VPC works fine. Production environments typically use custom VPCs with deliberate network design.
We’ll use the default VPC.
A VPC has a CIDR block—the range of private IP addresses available within it.
10.0.0.0/16
This notation specifies a range:
10.0.0.0 — starting address/16 — first 16 bits are fixed, remaining 16 bits vary10.0.0.0/16 includes 10.0.0.0 through 10.0.255.255 — 65,536 addresses.
| CIDR | Range | Addresses |
|---|---|---|
10.0.0.0/16 |
10.0.0.0 – 10.0.255.255 | 65,536 |
10.0.0.0/24 |
10.0.0.0 – 10.0.0.255 | 256 |
10.0.1.0/24 |
10.0.1.0 – 10.0.1.255 | 256 |
These are private IP addresses—not routable on the public internet. Within your VPC, instances use these addresses to communicate.
To reach the internet, instances need either:
A VPC spans an entire region. Subnets divide it into segments, each in a specific AZ.

Each subnet:
| Property | Implication |
|---|---|
| Exists in one AZ | Instances in this subnet run in this AZ |
| Has a CIDR block | Subset of VPC’s range (e.g., 10.0.1.0/24 within 10.0.0.0/16) |
| Has a route table | Determines where traffic goes |
| Is public or private | Based on routing, not a flag |
The AZ constraint revisited:
When you launch an EC2 instance, you specify a subnet. The subnet determines the AZ. This is why EBS volumes must match—the volume and instance must be in the same AZ, and subnet selection determines instance AZ.
The terms “public” and “private” describe routing behavior, not a setting you toggle.
Public subnet:
Route table includes:
Destination Target
10.0.0.0/16 local
0.0.0.0/0 igw-xxx ← Internet Gateway
Traffic to addresses outside the VPC routes to the Internet Gateway. Instances with public IPs can receive inbound traffic from the internet.
Private subnet:
Route table includes:
Destination Target
10.0.0.0/16 local
No route to internet. Traffic to external addresses has nowhere to go.
Instances here cannot be reached from the internet—no inbound path exists.
The subnet becomes “public” by having a route to an Internet Gateway. Remove that route, it becomes “private.”
An Internet Gateway (IGW) connects your VPC to the public internet.

IGW is horizontally scaled and highly available—AWS manages it. You attach one to your VPC; it handles the translation between public and private IPs.
An instance in a public subnet still needs a public IP to be reachable from the internet.
Auto-assigned public IP:
Elastic IP:
Useful when you need a stable IP (DNS records, firewall whitelists).
The mapping:
Instance has private IP 10.0.1.5 and public IP 54.x.x.x. The IGW translates—outbound traffic appears from 54.x.x.x, inbound traffic to 54.x.x.x routes to 10.0.1.5.
Instances in private subnets often need outbound internet access—downloading packages, calling external APIs—without being reachable from the internet.
NAT Gateway enables this:

Private subnet route table: 0.0.0.0/0 → nat-xxx. Outbound traffic goes to NAT Gateway (in public subnet), which forwards to IGW. Inbound from internet still has no path to private instances.
NAT Gateway is a managed service with meaningful cost:
| Component | Price |
|---|---|
| Hourly charge | ~$0.045/hour (~$32/month) |
| Data processing | $0.045/GB |
A NAT Gateway running continuously with moderate traffic can cost $50-100/month. For development, consider:
Production typically uses NAT Gateway for reliability. Development often skips it or uses alternatives.
A request from the internet reaching an instance traverses multiple layers:

For traffic to reach an instance:
Security groups control traffic at the instance level. Recap from EC2 section:
Inbound rules:
Type Port Source
────────────────────────────
SSH 22 0.0.0.0/0
HTTP 80 0.0.0.0/0
Custom 5432 10.0.0.0/16Each rule: allow traffic matching protocol, port, source.
Default: deny all inbound
Stateful behavior:
If inbound request is allowed, response is automatically allowed outbound.
An HTTP request allowed inbound on port 80 can respond without an explicit outbound rule for that connection.
Implication: You typically only configure inbound rules. Outbound defaults to allow-all, and stateful tracking handles responses.
Security groups apply regardless of subnet type. A public subnet instance still needs security group rules to accept traffic.
Security groups can reference other security groups, not just IP ranges:
“Allow traffic from instances in security group sg-loadbalancer” — even if their IPs change.

App servers accept HTTP only from load balancer, regardless of IP changes. Database accepts connections only from app servers.
S3, DynamoDB, and other AWS services exist outside your VPC. By default, traffic goes over the public internet (through IGW or NAT).
VPC Endpoints provide private connectivity:
Gateway Endpoints (S3, DynamoDB):
Destination Target
pl-xxx (S3 prefix) vpce-xxx
Interface Endpoints (most other services):
For private subnets that need S3 access, a Gateway Endpoint avoids NAT Gateway costs and keeps traffic private.
Distributing traffic across multiple instances:
Application Load Balancer (ALB):

Users hit one DNS name. ALB distributes requests. If an instance fails health checks, ALB stops sending it traffic. We’ll configure this in later assignments.

Public subnet — internet-facing components
Private subnet — internal components
VPC provides isolation—your network, your rules. Within it:
| Component | Role |
|---|---|
| Subnets | Determine AZ placement and routing behavior |
| Route tables | Define public vs private (IGW route or not) |
| Security groups | Control traffic at instance level |
| NAT/Endpoints | Enable private subnet outbound access |
Networking is foundational. EC2 instances, RDS databases, Lambda functions (in VPC mode), and load balancers all exist within this structure. The choices made here—which subnets, which security groups, which routes—determine what can communicate with what.
EC2 provides virtual machines. S3 provides object storage. These are foundational, but AWS offers other models for computation and data.
This section briefly introduces services you’ll encounter in assignments and later lectures:
| Service | What It Provides |
|---|---|
| Lambda | Run code without managing servers |
| SQS | Message queues |
| SNS | Notification delivery |
| RDS | Managed relational databases |
| DynamoDB | Managed NoSQL database |
We cover what each does and when you might use it. Deeper treatment comes in dedicated lectures on databases and application patterns.
EC2 requires you to provision instances, keep them running, and pay by the hour. Lambda offers a different model: upload code, AWS runs it when triggered.
How it works:
No instances to manage. No servers to patch. Code runs, then nothing exists until next trigger.
Lambda functions respond to events from various sources:
| Trigger | Example Use |
|---|---|
| API Gateway | HTTP endpoint calls Lambda |
| S3 | Object uploaded → process it |
| Schedule | Run every hour (cron-like) |
| SQS | Message arrives → process it |
| DynamoDB | Record changes → react |
Common patterns:
Lambda suits workloads that are event-driven, short-lived, and don’t need persistent state between invocations.
Lambda trades flexibility for simplicity. Constraints to know:
| Constraint | Limit |
|---|---|
| Execution timeout | 15 minutes maximum |
| Memory | 128 MB to 10 GB |
| Deployment package | 250 MB (unzipped) |
| Concurrency | 1000 default (can request increase) |
| Stateless | No persistent local storage between invocations |
Cold starts: First invocation (or after idle period) takes longer—Lambda must initialize your code. Subsequent invocations reuse the warm environment. Latency-sensitive applications may notice this delay.
Lambda isn’t a replacement for EC2. It’s a different tool for different workload shapes. Long-running processes, persistent connections, or large memory requirements typically need EC2.
Pay for what you execute:
| Component | Price |
|---|---|
| Requests | $0.20 per million |
| Duration | $0.0000166667 per GB-second |
A function using 512 MB running for 200ms:
1 million invocations at this configuration ≈ $2.
Contrast with EC2: a t3.micro running continuously costs ~$7.50/month regardless of whether it’s doing work. Lambda costs nothing when idle.
SQS (Simple Queue Service) provides message queues—a way for one component to send work to another without direct connection.
The concept:
A queue holds messages. One component (producer) puts messages in. Another component (consumer) takes messages out and processes them.
Producer and consumer don’t communicate directly. The queue sits between them.
# Producer: send message
sqs.send_message(
QueueUrl='https://sqs.../my-queue',
MessageBody='{"task": "process", "id": 123}'
)
# Consumer: receive and process
messages = sqs.receive_message(QueueUrl='...')
for msg in messages.get('Messages', []):
process(msg['Body'])
sqs.delete_message(...) # Acknowledge
Queues enable patterns we’ll study in depth later. For now, the key ideas:
Producer doesn’t wait for consumer
Send a message and continue. If the consumer is slow or temporarily unavailable, messages accumulate in the queue. No work is lost.
Consumer processes at its own pace
Consumer pulls messages when ready. If overwhelmed, messages wait. Can add more consumers to process faster.
Components are independent
Producer and consumer can be deployed, scaled, and updated separately. They agree on message format, not on being available simultaneously.
We’ll explore these patterns in the lecture on asynchronous communication. SQS is the AWS service that implements them.
SNS (Simple Notification Service) delivers messages to subscribers—one message, potentially many recipients.
The concept:
A topic is a channel. Publishers send messages to topics. Subscribers receive messages from topics they’re subscribed to.
One publish → many deliveries.
Subscriber types:

Example: Order placed → publish to “orders” topic → triggers: inventory Lambda, send email confirmation, queue for shipping system. One event, multiple reactions.
| SQS | SNS | |
|---|---|---|
| Model | Queue (one consumer per message) | Pub/sub (many subscribers per message) |
| Delivery | Consumer pulls | SNS pushes to subscribers |
| Persistence | Messages wait in queue | Delivery attempted immediately |
| Use case | Work distribution | Event notification |
Often used together: SNS publishes to multiple SQS queues, each processed by different consumer applications.
Applications often need to store structured data—users, orders, inventory—with relationships between records. Relational databases (PostgreSQL, MySQL) provide this.
Running a database yourself requires:
RDS (Relational Database Service) handles this operational work. You get a database endpoint; AWS manages the infrastructure.
What you choose:
What AWS manages:
What you get:
An endpoint:
mydb.abc123.us-east-1.rds.amazonaws.com:5432
Connect with standard database tools and libraries:
import psycopg2
conn = psycopg2.connect(
host='mydb.abc123.us-east-1.rds.amazonaws.com',
database='myapp',
user='admin',
password='...'
)From your application’s perspective, it’s a PostgreSQL database. The managed part is invisible.
DynamoDB is a different kind of database—NoSQL, specifically key-value and document storage.
Different model from relational:
Serverless pricing model:
Pay per request or provision capacity. No instance to size—scales automatically.
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Users')
# Write item
table.put_item(Item={
'user_id': 'u123',
'name': 'Alice',
'email': 'alice@example.com'
})
# Read item by key
response = table.get_item(Key={'user_id': 'u123'})
user = response['Item']Access patterns are different from SQL. We’ll cover when each model fits in the database lectures.
| RDS | DynamoDB | |
|---|---|---|
| Model | Relational (SQL) | Key-value / document (NoSQL) |
| Schema | Fixed, defined upfront | Flexible, per-item |
| Queries | SQL, joins, complex queries | Key lookup, limited queries |
| Scaling | Vertical (bigger instance) | Horizontal (automatic) |
| Pricing | Instance hours | Request-based or provisioned |
| Managed | Partially (you choose instance) | Fully (no servers to size) |
Neither is better—they fit different problems. Applications often use both: RDS for relational data with complex queries, DynamoDB for high-scale key-value access.
We’ll study these tradeoffs in the database lectures.
AWS bills based on resource usage. Different resources meter in fundamentally different ways:
| Metering Model | How It Works | Examples |
|---|---|---|
| Time-based | Charge per unit time resource exists/runs | EC2 instances, RDS instances, NAT Gateway |
| Capacity-based | Charge per unit capacity provisioned | EBS volumes, provisioned IOPS |
| Usage-based | Charge per unit actually consumed | S3 storage, S3 requests, Lambda invocations |
| Movement-based | Charge per unit data transferred | Data transfer out, cross-region transfer |
A single deployment involves multiple metering models simultaneously. An EC2 instance incurs time-based charges (compute), capacity-based charges (EBS), and potentially movement-based charges (data transfer).
Understanding the metering model lets you reason about costs before incurring them.
EC2 instances charge per-second while in the running state (60-second minimum).
| Instance Type | Hourly | Monthly (continuous) |
|---|---|---|
| t3.micro | $0.0104 | $7.59 |
| t3.small | $0.0208 | $15.18 |
| t3.medium | $0.0416 | $30.37 |
| m5.large | $0.096 | $70.08 |
| m5.xlarge | $0.192 | $140.16 |
On-Demand pricing, us-east-1, Linux. Other regions ±10-20%.
Instance state determines billing:
| State | Compute Charge | EBS Charge |
|---|---|---|
| running | Yes | Yes |
| stopped | No | Yes |
| terminated | No | No (volume deleted) |
Stopping an instance stops compute charges. The EBS volume still exists and still bills. Terminating ends all charges (root volume deleted by default).
EBS and S3 use different metering models:
EBS: Capacity-based
Charges for provisioned size, not used space.
| Volume Type | Per GB-month |
|---|---|
| gp3 | $0.08 |
| gp2 | $0.10 |
| io2 | $0.125 + IOPS |
A 100 GB gp3 volume: $8/month
Whether you store 1 GB or 100 GB on it, the charge is the same. You’re paying for the capacity you reserved.
S3: Usage-based
Charges for actual storage plus operations.
| Component | Price |
|---|---|
| Storage (Standard) | $0.023/GB-month |
| PUT/POST/LIST | $0.005/1,000 |
| GET/SELECT | $0.0004/1,000 |
100 GB stored: $2.30/month
S3 charges grow with actual data stored. Empty bucket costs nothing.
Data transfer charges based on movement, independent of the resources involved:
| Transfer | Price |
|---|---|
| Inbound from internet | Free |
| Outbound to internet | $0.09/GB |
| Between regions | $0.02/GB |
| Between AZs (same region) | $0.01/GB each direction |
| Within same AZ | Free |
| To S3/DynamoDB (same region) | Free |

Architectures that move large amounts of data across regions or to the internet accumulate transfer costs.
New AWS accounts receive a free tier for 12 months from account creation:
| Service | Monthly Allowance |
|---|---|
| EC2 | 750 hours of t3.micro |
| EBS | 30 GB of gp2/gp3 |
| S3 | 5 GB storage, 20,000 GET, 2,000 PUT |
| RDS | 750 hours of db.t3.micro |
| Data transfer | 100 GB outbound |
The 750-hour boundary:
One month ≈ 730 hours. The free tier covers approximately one t3.micro instance running continuously.
1 × t3.micro × 24h × 31d = 744 hours → within free tier
2 × t3.micro × 24h × 31d = 1,488 hours → 738 hours charged
1 × t3.small × 24h × 31d = 744 hours → all hours charged (t3.small not covered)
Free tier applies to specific instance types. A t3.small incurs full charges regardless of free tier status.
AWS provides mechanisms for tracking costs:
Billing Dashboard
Current month charges by service. Updated multiple times daily.
Cost Explorer
Historical cost data, filtering by service/region/tag, forecasting based on current usage patterns.
Budgets
Configurable thresholds with alert notifications. Set a monthly budget, receive email when approaching or exceeding it.
These mechanisms surface cost information. The billing model is pay-as-you-go—charges accumulate automatically. Monitoring makes accumulation visible.
Your laptop runs ML training as a single process. Data, compute, and storage are all local.
Local training
data = pd.read_csv('dataset.csv') # Local disk
model = train(data) # Local CPU/GPU
torch.save(model, 'model.pth') # Local diskEverything shares memory, disk, and failure fate. If the process crashes, everything stops together. If the disk fails, you lose data and model.
One machine, one failure domain.
Distributed training
data = load_from_s3('bucket', 'data.csv') # Network call
model = train(data) # EC2 instance
save_to_s3(model, 'bucket', 'model.pth') # Network callComponents are networked. S3 can fail while EC2 runs. EC2 can terminate while S3 persists. Network can drop between them.
Data outlives compute. Compute is ephemeral. Network connects (and separates) them.
Every arrow in your architecture diagram is network latency.
Local operations
| Operation | Time |
|---|---|
| Read 1MB from SSD | 1 ms |
| Load pandas DataFrame | 10 ms |
| PyTorch forward pass | 5 ms |
| Write checkpoint to disk | 2 ms |
Total per batch: ~20 ms. Predictable. Consistent.
With S3 in the loop
| Operation | Time |
|---|---|
| Fetch 1MB from S3 | 20–50 ms |
| Load pandas DataFrame | 10 ms |
| PyTorch forward pass | 5 ms |
| Write checkpoint to S3 | 30–100 ms |
Total per batch: 65–165 ms. Variable. Depends on network.
If you checkpoint every batch, you’ve added 3–5× overhead. Checkpoint every 100 batches instead.
Design for latency: batch S3 operations, prefetch data, checkpoint strategically.
Distributed systems fail partially. Each component fails independently.
Independent failure domains
S3 doesn’t know your EC2 instance crashed. EC2 doesn’t know S3 returned an error. You must handle the boundaries.
Failure scenarios
| Event | Data | Model | Training |
|---|---|---|---|
| EC2 terminates | Safe (S3) | Lost (if not saved) | Lost |
| S3 request fails | Retry works | Retry works | Continues |
| OOM on EC2 | Safe (S3) | Lost (if not saved) | Lost |
| Network partition | Safe (S3) | Stuck | Stuck |
The pattern: S3 is durable, EC2 is ephemeral. Save state to S3 frequently enough that losing EC2 is recoverable.
S3 is the durable integration point. EC2 is stateless compute.

Data flow pattern
Why this works
Recovery
No state is lost because state lives in S3, not EC2.
EC2 instances assume IAM roles. No access keys in code.
How it works
What the role needs
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::training-bucket/*"
]
},
{
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::training-bucket"
}
]
}Bucket ARN for ListBucket. Object ARN (with /*) for GetObject/PutObject. This distinction matters.
Checkpoints convert EC2 failures from catastrophic to recoverable.
Without checkpoints
With checkpoints every epoch
Checkpoint frequency trade-off
| Frequency | S3 Writes | Recovery Loss | Overhead |
|---|---|---|---|
| Every batch | 10,000/epoch | Seconds | High (latency) |
| Every epoch | 100 total | Minutes | Low |
| Every 10 epochs | 10 total | ~1 hour | Minimal |
Choose based on:
For spot instances: checkpoint at least every 2 minutes or on interruption signal.
Operations should be safe to retry. Network failures mean you often don’t know if something succeeded.
The problem
# Upload model to S3
s3.put_object(Bucket='b', Key='model.pt', Body=data)
# Network timeout. Did it succeed?
# If you retry and it already succeeded, is that okay?For S3 put_object: yes, safe to retry. Same key overwrites with same content. No harm.
Design for idempotency
Safe to retry:
Not safe to retry (without care):
Pattern: Use deterministic keys
Three things cost money: compute time, storage, and data transfer.
Compute (EC2)
Design response: Terminate when done. Use spot instances for fault-tolerant work. Don’t leave instances running overnight.
Storage (S3)
Design response: Delete intermediate files. Use lifecycle policies for old checkpoints.
Data transfer
| Path | Cost |
|---|---|
| Into AWS | Free |
| Within AZ | Free |
| Cross-AZ (same region) | $0.01/GB |
| Cross-region | $0.02–0.09/GB |
| Out to internet | $0.09/GB |
Design response: Keep data and compute in the same region. Avoid repeatedly downloading large datasets from S3 to local machine.
Example costs for a training job
| Resource | Usage | Cost |
|---|---|---|
| EC2 p3.2xlarge | 8 hours | $24.48 |
| S3 storage | 50 GB/month | $1.15 |
| Data transfer | Within region | $0 |
Compute dominates. Optimize instance usage first.
AWS charges your credit card. Resources cost money from the moment they’re created.
What catches students
| Mistake | Cost |
|---|---|
| p3.2xlarge left running over weekend | $220 |
| Forgot to delete 500GB S3 bucket | $12/month forever |
| Auto-scaling launched 20 instances | $50/hour |
| Cross-region replication enabled | $45 transfer |
These are real examples from students.
Protection measures
Free tier limits
| Service | Free Amount |
|---|---|
| EC2 t2.micro | 750 hours/month |
| S3 | 5 GB storage |
| Data transfer out | 100 GB/month |
Beyond free tier, you pay.
Safe practices
t3.micro for development/testingStructure enables automation. Predictable paths enable programmatic access.
ml-project-{username}/
├── data/
│ ├── raw/ # Original, immutable
│ │ └── dataset_v1.csv
│ └── processed/ # Transformed, ready for training
│ ├── train.parquet
│ └── test.parquet
├── checkpoints/
│ └── experiment_001/
│ ├── epoch_0010.pt
│ ├── epoch_0020.pt
│ └── epoch_0030.pt
├── models/
│ └── experiment_001/
│ └── final.pt
└── logs/
└── experiment_001/
└── training.logConventions that help:
epoch_0010 not epoch_10) — sorts correctlyA training job as cloud operations:
Startup
Training loop
Completion
What can fail and what happens
| Failure | Impact | Recovery |
|---|---|---|
| S3 read fails | Training can’t start | Retry with backoff |
| S3 write fails | Checkpoint lost | Retry; if persistent, alert |
| EC2 terminates | Training stops | New instance + last checkpoint |
| OOM | Process crashes | Reduce batch size, restart |
| Code bug | Process crashes | Fix bug, restart from checkpoint |
Every failure mode has a recovery path because:
Two approaches: download to file, or stream into memory.
Download to local file
import boto3
s3 = boto3.client('s3')
# Download to local filesystem
s3.download_file(
Bucket='my-bucket',
Key='data/training.csv',
Filename='/tmp/training.csv'
)
# Then read locally
import pandas as pd
df = pd.read_csv('/tmp/training.csv')Use when:
Stream directly into memory
import boto3
import pandas as pd
from io import BytesIO
s3 = boto3.client('s3')
# Get object returns a streaming body
response = s3.get_object(
Bucket='my-bucket',
Key='data/training.csv'
)
# Read directly into pandas
df = pd.read_csv(response['Body'])Use when:
Both approaches use the same IAM permissions: s3:GetObject on the object ARN.
Upload from file or from memory buffer.
Upload from file
import boto3
s3 = boto3.client('s3')
# Upload a local file
s3.upload_file(
Filename='model.pt',
Bucket='my-bucket',
Key='models/experiment_001/final.pt'
)For large files (>100MB), upload_file automatically uses multipart upload.
Upload with metadata
Upload from memory
import boto3
from io import BytesIO
import torch
s3 = boto3.client('s3')
# Save model to memory buffer
buffer = BytesIO()
torch.save(model.state_dict(), buffer)
buffer.seek(0) # Rewind to beginning
# Upload buffer contents
s3.put_object(
Bucket='my-bucket',
Key='models/final.pt',
Body=buffer.getvalue()
)Use when:
Requires s3:PutObject on the object ARN.
List operations return metadata, not contents.
List objects with a prefix
import boto3
s3 = boto3.client('s3')
response = s3.list_objects_v2(
Bucket='my-bucket',
Prefix='checkpoints/experiment_001/'
)
for obj in response.get('Contents', []):
print(f"{obj['Key']}: {obj['Size']} bytes")Output:
Requires s3:ListBucket on the bucket ARN (not object ARN).
S3 operations can fail. Handle transient errors with retries.
Common errors
from botocore.exceptions import ClientError
try:
s3.download_file('bucket', 'key', 'local')
except ClientError as e:
error_code = e.response['Error']['Code']
if error_code == 'NoSuchKey':
# Object doesn't exist
print("File not found in S3")
elif error_code == 'AccessDenied':
# Permission issue
print("Check IAM policy")
elif error_code == '403':
# Often bucket vs object ARN issue
print("Check resource ARN in policy")
else:
raiseRetry with backoff
import time
from botocore.exceptions import ClientError
def download_with_retry(bucket, key, local, max_retries=3):
for attempt in range(max_retries):
try:
s3.download_file(bucket, key, local)
return # Success
except ClientError as e:
error_code = e.response['Error']['Code']
# Don't retry permanent errors
if error_code in ['NoSuchKey', 'AccessDenied']:
raise
# Retry transient errors
if attempt < max_retries - 1:
wait = 2 ** attempt # 1, 2, 4 seconds
time.sleep(wait)
else:
raiseboto3 has built-in retry logic for some errors, but explicit handling gives you control.
Code running on EC2 can query information about itself.
Metadata endpoint
import requests
# Instance identity
response = requests.get(
'http://169.254.169.254/latest/meta-data/instance-id',
timeout=1
)
instance_id = response.text
# e.g., "i-0abc123def456"
# Current region
response = requests.get(
'http://169.254.169.254/latest/meta-data/placement/region',
timeout=1
)
region = response.text
# e.g., "us-east-1"
# Instance type
response = requests.get(
'http://169.254.169.254/latest/meta-data/instance-type',
timeout=1
)
instance_type = response.text
# e.g., "t3.medium"Detect if running on EC2
def is_running_on_ec2():
try:
requests.get(
'http://169.254.169.254/latest/meta-data/',
timeout=0.5
)
return True
except requests.exceptions.RequestException:
return False
# Use different config based on environment
if is_running_on_ec2():
# Use instance role credentials (automatic)
s3 = boto3.client('s3')
else:
# Use local credentials file
s3 = boto3.client('s3') # Same code, different sourceThe SDK handles credential discovery automatically, but knowing the environment can help with configuration.
Metadata endpoint only works from within EC2. Times out elsewhere.
Spot instances can be reclaimed with 2 minutes warning.
Check for interruption notice
import requests
def check_spot_interruption():
"""Returns termination time if spot will be interrupted."""
try:
response = requests.get(
'http://169.254.169.254/latest/meta-data/'
'spot/instance-action',
timeout=1
)
if response.status_code == 200:
data = response.json()
return data.get('time') # Termination time
except requests.exceptions.RequestException:
pass
return NoneReturns None normally. Returns timestamp when termination is imminent.
Graceful training loop
def train_with_interruption_handling(model, data):
for epoch in range(num_epochs):
# Check before each epoch
if check_spot_interruption():
print("Spot interruption! Saving checkpoint...")
save_checkpoint(model, epoch)
return "interrupted"
# Train one epoch
train_epoch(model, data)
# Regular checkpoint
if epoch % checkpoint_frequency == 0:
save_checkpoint(model, epoch)
return "completed"With 2-minute warning, you have time to save state. Don’t ignore it.
boto3 searches for credentials in order. First match wins.
Search order
Explicit in code (don’t do this)
Environment variables
Credentials file (~/.aws/credentials)
Config file (~/.aws/config)
Instance metadata (EC2 role)
Container credentials (ECS/EKS)
Best practice by environment
| Environment | Credential Source |
|---|---|
| Local dev | ~/.aws/credentials file |
| EC2 instance | Instance profile (IAM role) |
| Lambda | Execution role (automatic) |
| CI/CD | Environment variables |
Never in code. Keys in code get committed to git, leaked, compromised.
Verify what you’re using
Permission errors have patterns. Learn to read them.
“Access Denied” on S3
botocore.exceptions.ClientError:
An error occurred (AccessDenied) when calling
the GetObject operation: Access DeniedChecklist:
s3:GetObject)The bucket vs object ARN trap
Wrong. GetObject needs object ARN:
“Access Denied” on ListBucket
ListBucket needs bucket ARN:
Not object ARN:
Complete policy for read/write
Different environment, different credentials, different permissions.
Check who you are
Locally: shows your IAM user On EC2: shows the instance role
Different identities have different permissions.
Check what region
S3 buckets are regional. EC2 and bucket must agree (or you pay transfer costs and add latency).
Common causes
| Symptom | Likely Cause |
|---|---|
| Access Denied | Role missing permission |
| No Credentials | Instance has no role attached |
| Bucket not found | Wrong region configured |
| Timeout | Security group blocks outbound |
Verify instance role
# From the EC2 instance
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
# Should return role name, e.g.:
# EC2-S3-Access-RoleIf empty, no role attached. Attach one in EC2 console.
Network problems manifest as hangs or timeouts.
S3 operations hang
Possible causes:
nslookup s3.amazonaws.comQuick network test
Set timeouts explicitly
from botocore.config import Config
config = Config(
connect_timeout=5,
read_timeout=30,
retries={'max_attempts': 3}
)
s3 = boto3.client('s3', config=config)Without explicit timeouts, operations can hang indefinitely.
VPC endpoint for S3
If in a private subnet without NAT:
VPC Endpoint (Gateway type) for S3
├── No NAT gateway needed
├── No internet gateway needed
├── Traffic stays in AWS network
└── Often faster and cheaperAsk your infrastructure setup if S3 access isn’t working from private subnets.
EC2 instances have finite memory. Training can exhaust it.
Monitor memory usage
import psutil
def log_memory():
mem = psutil.virtual_memory()
print(f"Memory: {mem.used / 1e9:.1f}GB / "
f"{mem.total / 1e9:.1f}GB "
f"({mem.percent}%)")
# Call periodically during training
for epoch in range(num_epochs):
log_memory()
train_epoch(model, data)From command line
Common causes
| Cause | Solution |
|---|---|
| Batch size too large | Reduce batch size |
| Loading full dataset | Use data loader with batching |
| Accumulating history | Clear gradients, don’t store all losses |
| Memory leak | Check for growing lists/dicts |
Reduce memory usage
# Don't store all losses
losses = []
for batch in data:
loss = train_step(batch)
losses.append(loss.item()) # .item() not loss
# Clear GPU memory
torch.cuda.empty_cache()
# Use gradient checkpointing for large models
model.gradient_checkpointing_enable()If still OOM: use a larger instance type, or redesign to process in smaller chunks.
Putting the patterns together.
import boto3
import torch
from botocore.exceptions import ClientError
import time
def load_checkpoint(s3, bucket, prefix):
"""Load latest checkpoint if exists."""
try:
response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
if not response.get('Contents'):
return None, 0
latest_key = sorted(response['Contents'], key=lambda x: x['Key'])[-1]['Key']
obj = s3.get_object(Bucket=bucket, Key=latest_key)
checkpoint = torch.load(obj['Body'])
epoch = checkpoint['epoch']
print(f"Resumed from {latest_key} (epoch {epoch})")
return checkpoint, epoch
except ClientError:
return None, 0
def save_checkpoint(s3, bucket, model, epoch):
"""Save checkpoint to S3."""
checkpoint = {'epoch': epoch, 'model_state': model.state_dict()}
buffer = io.BytesIO()
torch.save(checkpoint, buffer)
buffer.seek(0)
key = f"checkpoints/epoch_{epoch:04d}.pt"
s3.put_object(Bucket=bucket, Key=key, Body=buffer.getvalue())
print(f"Saved checkpoint: {key}")
def train():
s3 = boto3.client('s3')
bucket = 'my-training-bucket'
# Load data
obj = s3.get_object(Bucket=bucket, Key='data/train.csv')
data = pd.read_csv(obj['Body'])
# Initialize or resume
model = MyModel()
checkpoint, start_epoch = load_checkpoint(s3, bucket, 'checkpoints/')
if checkpoint:
model.load_state_dict(checkpoint['model_state'])
# Training loop
for epoch in range(start_epoch, 100):
train_epoch(model, data)
if epoch % 10 == 0:
save_checkpoint(s3, bucket, model, epoch)
# Save final model
save_checkpoint(s3, bucket, model, 100)
if __name__ == '__main__':
train()Loading model from S3, serving predictions.
from flask import Flask, request, jsonify
import boto3
import torch
from io import BytesIO
app = Flask(__name__)
model = None
def load_model():
"""Load model from S3 once at startup."""
s3 = boto3.client('s3')
obj = s3.get_object(
Bucket='my-bucket',
Key='models/final.pt'
)
buffer = BytesIO(obj['Body'].read())
model = MyModel()
model.load_state_dict(torch.load(buffer, map_location='cpu'))
model.eval()
return model
@app.route('/health')
def health():
return jsonify({'status': 'healthy', 'model_loaded': model is not None})
@app.route('/predict', methods=['POST'])
def predict():
if model is None:
return jsonify({'error': 'Model not loaded'}), 503
try:
data = request.json
features = torch.tensor(data['features'])
with torch.no_grad():
output = model(features)
return jsonify({'prediction': output.tolist()})
except KeyError as e:
return jsonify({'error': f'Missing field: {e}'}), 400
except Exception as e:
return jsonify({'error': str(e)}), 500
# Load model at startup
model = load_model()
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)