
EE 547 - Unit 1
Spring 2026
A web application in production:
Each component runs as a separate process, often on separate machines. No component can directly access another’s memory.
This is the norm for modern applications. The single-machine program is the exception, not the rule.

When the application server needs user data, it cannot read the database’s memory. It sends a request and waits for a response.
App Server Database
| |
|-------- request --------->|
| |
| (processing) |
| |
|<------- response ---------|
| |
Every interaction between components follows this pattern. Query results, cache lookups, API calls—all require explicit message exchange.
The application server’s code pauses, waiting for bytes to arrive over the network.

The network is not a function call. When the app server sends a query:
The request might not arrive. Router failure, network partition, packet corruption.
The request might arrive late. Congestion, retransmission, routing changes.
The response might not arrive. Database processed the query, but the response was lost.
Messages might arrive out of order. Request A sent before B, but B arrives first.
None of these failure modes exist in a single process. A function call either executes or the process crashes. There is no “maybe it executed.”

A function call has two outcomes: return a value, or throw an exception. A network request has three.
Success
Request sent. Server processed. Response received.
Client knows the operation completed.
Failure
Request sent. Server returned error. Error received.
Client knows it failed.
The unknown case forces difficult decisions. Retry and risk duplicate execution. Don’t retry and risk losing the operation.

A single-machine program either runs or doesn’t. A distributed system occupies states in between.
Some requests succeed. Others fail. Others succeed slowly. The outcome depends on which path a request takes through the system.
Every component must be written assuming the components it depends on might be unavailable—right now, for this request, even if they worked a moment ago.
Machine A’s clock: 10:00:00.000
Machine B’s clock: 10:00:00.003
A sends a message at its 10:00:00.000. B receives it at its 10:00:00.002.
Meanwhile, B had a local event at its 10:00:00.001.
Did A’s send happen before or after B’s event? The timestamps suggest A sent first (10:00:00.000 < 10:00:00.001), but A’s clock might be behind B’s.
Clocks drift. Synchronization protocols get them close—milliseconds—but not identical. For operations that happen within milliseconds of each other, timestamps cannot determine order.
In a single process, “before” means “earlier in the instruction sequence.” Across machines, there is no shared instruction sequence to consult.

No shared memory. Network uncertainty. Partial failure. No global clock.
These arise because components run on separate machines, communicating over a network.
But consider: are these really about the network? Or about something more fundamental?
A useful test: what if we remove the network?
Each statement executes after the previous completes.
The variable total exists in one place. Reading it returns what was last written.
If the loop crashes, the entire process stops. No ambiguity about what state persists.
These properties require no effort from the programmer. The language runtime provides them automatically.
One thread of execution. Shared memory. Single failure domain. Coordination is implicit.

A process can contain multiple threads. Each thread:
The operating system schedules threads, switching between them. From each thread’s perspective, it runs continuously. In reality, threads interleave.
Threads enable parallelism within a process. Multiple CPU cores can execute different threads simultaneously. A web server might handle each request in a separate thread.

# Shared state
total = 100
# Thread A # Thread B
read total → 100 read total → 100
add 50 → 150 add 30 → 130
write total ← 150 write total ← 130Both threads read total = 100. Thread A computes 150, Thread B computes 130. Whichever writes last wins.
Expected result: 180.
Actual result: 130 or 150.
Both threads run on the same CPU. Memory access takes nanoseconds. The problem is not speed. Two agents modified shared state without coordination.

Deadlock: Two locks, opposite order
Thread A holds lock_1, waits for lock_2. Thread B holds lock_2, waits for lock_1. Neither proceeds.
Lost update: Check-then-act
Visibility: Cached state
Thread A writes flag = True. Thread B reads flag = False. Each thread’s view of memory differs due to CPU caching.
These are coordination failures. No network involved. Shared memory. Nanosecond access. Still broken.

Threading demonstrates the point: coordination challenges exist even when:
The difficulty is reasoning about concurrent access to shared state. Multiple agents, acting independently, can interfere with each other.
Adding network distribution makes coordination harder—but it doesn’t create the fundamental problem. The fundamental problem is concurrency itself.
Distribution amplifies coordination challenges. It doesn’t invent them.
Threads have:
Shared memory (can read each other’s state)
Same clock (CPU timestamp counter)
Atomic failure (process crashes entirely)
Synchronous primitives (locks, semaphores)
Distributed components have:
No shared memory (must send messages)
No common clock (clocks drift independently)
Partial failure (some components fail, others continue)
Only asynchronous messages (no global locks)

When Thread A calls a function, the function executes or the process crashes. There is no intermediate state.
When Component A sends a message to Component B:
The message might never arrive. A could wait forever.
B might process the message and crash before responding. A doesn’t know if the operation succeeded.
B might be slow. A cannot distinguish “slow” from “dead.”
A times out and retries. But B processed the first request. Now the operation happens twice.
This uncertainty—not knowing whether a remote operation succeeded—has no equivalent in local programming. It requires fundamentally different patterns.

In a single process, failure is total. If any thread causes a segmentation fault, the entire process terminates. All threads die together.
In a distributed system, failure is partial. The database crashes. The web server continues running. It sends queries into a void and waits for responses that will never come.
Partial failure means:
Every component must handle the ongoing uncertainty of whether its dependencies are available, right now, for this operation.

Two containers on a laptop. Two services in a datacenter. Two thousand instances across regions.
Web app calling a database
Application sends INSERT. Connection drops.
This happens whether the database is on the same machine, in the same rack, or across the world. The uncertainty is identical.
Background worker processing jobs
Worker pulls job from queue. Worker crashes.
Same coordination problem at any scale. The number of machines changes the probability of failure, not the nature of it.
The patterns for handling uncertainty at small scale are the same patterns used at large scale.
If coordination challenges were only about scale, you’d need a datacenter to learn distributed systems.
But the challenges are about structure, not size. A laptop running containers exhibits:
The same architectural patterns apply. The same failure modes appear. The same solutions work.

Isolation and packaging
Containers provide boundaries where components can fail independently. Orchestration manages component lifecycle.
Communication
APIs and protocols define how components exchange messages. HTTP, REST, and other patterns structure coordination.
State management
Databases provide coordination mechanisms for shared state. Storage systems handle persistence across failures.
Handling uncertainty
Timeouts, retries, and error handling manage the unknown outcomes of remote operations.
Composition
Services combine into systems where coordination requirements are explicit in the architecture.
Observability
Monitoring and logging make distributed behavior visible. Cannot attach a debugger across twelve machines.
Distribution creates coordination challenges. The course teaches patterns for managing them.
Coordination across components requires:
A single process avoids all of this. The runtime handles coordination implicitly.
Given the added complexity, distribution requires justification. What do separate components provide that a single process cannot?
A database engine optimizes for:
An HTTP server optimizes for:
Building both into one program means neither reaches its potential. Separate components allow focused optimization.

PostgreSQL: first released 1996. Continuous development since then.
Replicating this in application code: impractical. The specialization represents decades of focused work on a specific domain.
The same applies to web servers (nginx, Apache), caches (Redis, Memcached), message queues (RabbitMQ, Kafka), search engines (Elasticsearch), and every other infrastructure component.
Using specialized components means building on accumulated expertise rather than starting from scratch.
Database
High storage (data persists)
Moderate memory (buffer pool, caches)
Variable CPU (query execution)
Web/Application server
Low storage (stateless)
Moderate memory (request handling)
Variable CPU (business logic)
Cache
No persistent storage
High memory (that’s the point)
Low CPU (simple operations)
Background worker
Low storage
Low memory (between jobs)
High CPU (during processing)

Bundling components together means provisioning for the maximum of each resource type across all workloads.
Combined deployment on one machine:
Separate deployments:
Total resources can be lower because each component gets what it needs, not what the most demanding component requires.
When one component needs more, scale that component. Others remain unchanged.


Web traffic peaks during business hours. Batch jobs run overnight when web traffic is low.
Combined deployment: must handle peak web traffic + batch processing simultaneously (even though they naturally offset).
Separate deployment: web servers scale up during day, batch workers scale up at night. Total capacity can be lower because peaks don’t overlap.

Process boundaries create failure domains. A crash in one process cannot corrupt memory in another.
| Component | Has Access To | Does Not Have |
|---|---|---|
| Web Server | TLS certificates | Database credentials |
| App Server | Database credentials | Payment API keys |
| Payment Service | Payment API keys | User session data |
| Background Worker | Job queue credentials | Direct user access |
Compromise of one component limits exposure. Attacker who breaches the web server cannot read the database directly—must breach the app server separately.
Each boundary requires a separate exploit. Defense multiplies.

External interface defines the contract. Internal changes—database migrations, caching strategies, validation logic—are invisible to clients.
| Scenario | Interface | Old Implementation | New Implementation | Client Change |
|---|---|---|---|---|
| Database migration | SQL protocol | MySQL | PostgreSQL | Connection string |
| Managed cache | Redis protocol | Self-hosted Redis | ElastiCache | Endpoint config |
| Search upgrade | Query API | Custom code | Elasticsearch | API adapter |
| Email provider | SMTP / HTTP API | Self-hosted | SendGrid | Credentials |
Standard interfaces mean implementations are interchangeable. The component behind the interface can change—upgraded, replaced, outsourced—without rewriting clients.

Writing 10,000 lines to compose millions of lines of specialized infrastructure.

Scaling requires decomposing along boundaries that allow independent scaling. Distribution makes decomposition possible. Architecture determines whether scaling works.
| Benefit | Cost |
|---|---|
| Specialized components | Explicit coordination |
| Independent resource allocation | Network uncertainty |
| Failure containment | Partial failure handling |
| Security boundaries | Operational complexity |
| Deployment independence | Interface management |
| Composition from existing systems | Distributed debugging |
The benefits must outweigh the costs for the specific system being built. A system that doesn’t need failure containment, security boundaries, or independent scaling pays the coordination cost without receiving the benefit.
Traditional model:
Cloud model:
You don’t own resources. You consume them through APIs.

When you request a database, you’re calling a service. That service coordinates with storage services, networking services, monitoring services. Distributed systems all the way down.
Storage is not a local disk. It’s a service at a URL.
https://s3.us-east-1.amazonaws.com/my-bucket/my-file.csv
Database is not a local process. It’s a service at a hostname.
my-database.abc123.us-east-1.rds.amazonaws.com:5432
Your application component makes network requests to:
Same communication model. Same coordination challenges. Some endpoints you wrote, some you’re renting.
Creating a storage bucket:
What happens:
You sent a message. A distributed system processed it. Resources appeared.
# Write
s3.put_object(
Bucket='my-application-data',
Key='users/12345.json',
Body=json.dumps(user_data)
)
# Read
response = s3.get_object(
Bucket='my-application-data',
Key='users/12345.json'
)
user_data = json.loads(response['Body'].read())Every read and write is a network request. The storage service handles:
The coordination problems exist. The provider solves them. You pay for the service.
Hardware disk fails:
Physical server overheats:
Network switch fails in datacenter:
Storage service returns error:
Database connection times out:
Service in another region unreachable:
The provider guarantees their components work. They don’t guarantee your application handles failures correctly.

When something breaks, knowing which side of the boundary helps:
Describing what you want:
# "I want a PostgreSQL database"
database:
engine: postgres
version: "14"
storage: 100GB
backup: dailyProvider figures out:
You declare intent. Provider handles implementation.
# docker-compose.yml
services:
web:
image: nginx
ports:
- "80:80"
api:
build: ./api
environment:
DATABASE_URL: postgres://db:5432/app
db:
image: postgres:14
volumes:
- db-data:/var/lib/postgresql/dataThree services. Their relationships. Their configuration.
Run docker-compose up. Infrastructure exists.
Change the file. Run again. Infrastructure updates.
Delete the file. Run down. Infrastructure gone.
Fixed hardware:
Resources as services:
Scaling is an API call, not a procurement process.
Provider operates datacenters in multiple regions:
| Region | Location | Use Case |
|---|---|---|
| us-east-1 | N. Virginia | Primary US |
| eu-west-1 | Ireland | European users, GDPR |
| ap-northeast-1 | Tokyo | Asian users |
Deploy to a region by specifying it in configuration:
Physical presence in Frankfurt, Tokyo, São Paulo—without owning datacenters there.
| Traditional | Cloud |
|---|---|
| Own hardware | Rent services |
| Build infrastructure | Configure infrastructure |
| Fixed capacity | Elastic capacity |
| Capital expense | Operational expense |
| Provision in weeks | Provision in seconds |
| Limited geography | Global presence |
What doesn’t change:
Cloud shifts where the infrastructure comes from. It doesn’t eliminate the distributed systems problems—it gives you building blocks to construct solutions.




Workload A cannot detect that B and C exist. It sees a complete machine with its allocated resources. The isolation is invisible from inside.
Virtualization creates an abstraction that behaves like the real thing.

Software interacting with a virtual resource doesn’t know it’s virtual. The abstraction is complete.
A hypervisor sits between physical hardware and guest operating systems.

Each VM gets virtualized CPU, RAM, disk, network. Guest OS runs unmodified—it believes it has real hardware.

The guest OS issues normal hardware instructions. The hypervisor intercepts and redirects them to the VM’s allocated resources. Isolation enforced transparently.
Instead of virtualizing hardware, containers virtualize the OS environment.

All containers share the same kernel. Each container sees an isolated view of the filesystem, process list, network.

Kernel tracks all processes. Each container’s view is filtered—it sees only its own processes, starting from PID 1.

VM isolation at hardware boundary (hypervisor). Container isolation at OS boundary (kernel features).
Containers are the practical deployment unit:
Each component of a distributed application runs in its own container. Containers communicate over the network—same model as Section 1.
Later lectures cover container mechanics in depth. For now: containers provide isolation boundaries for application components.



Coordination requires explicit communication. Every piece of shared information must be copied across the network.
In a distributed system, components coordinate exclusively through messages:
Every coordination mechanism reduces to: send a message, receive a response.
This is fundamentally different from single-process coordination where threads can share memory, signals, and synchronization primitives.


Timeouts are guesses. A timeout expiring means “no response yet” - not “server is dead.”
A protocol establishes rules both parties agree to follow:
What protocols specify:
Why protocols matter:
Protocols are contracts. HTTP, TCP, DNS, SMTP - each defines a specific contract for a specific purpose.

We work primarily at the application layer. TCP handles reliable delivery underneath.
HTTP (Hypertext Transfer Protocol) is widely used for service communication:
Originally designed for web browsers. Now the default for APIs, cloud services, and service-to-service communication.
Details of HTTP syntax and semantics come in later lectures when we build APIs.

Simple model: send, wait, receive. But waiting time is wasted if client could do other work.

Synchronous
Asynchronous
Neither is universally better. Choice depends on:

Asynchronous communication isolates failures temporally. But errors still need handling eventually.
How components communicate affects everything:
Every inter-component boundary is a communication decision. These decisions accumulate into system-wide characteristics.
Later lectures address these trade-offs in depth: API design, async patterns, failure handling strategies.
get user 42 expects user 42 to exist:
This is state - persistent information that requests read and modify.
In a distributed system, the component making a request may be on a different machine than where the data lives.


Deployments, crashes, scaling events, orchestrator decisions - all clear memory.
For ephemeral data (request context, short caches), this is fine. For user data, orders, anything important - memory-only is unacceptable.
When multiple components need the same data, it must live somewhere all can reach:

Durability
Must survive component restarts
→ Cannot live only in memory
Accessibility
Must be reachable by multiple components
→ Must be network-accessible

Components read/write to external store. Components can restart without losing data.
A database centralizes state management for distributed components:
| Role | What It Provides |
|---|---|
| Location | Single endpoint all services reach |
| Durability | Data survives component failures |
| Concurrency | Handles simultaneous readers/writers |
| Query | Find by ID, search, filter, aggregate |
Without this, each service would implement its own storage, replication, concurrency control, and query logic.

Both read 100, compute independently, write back. B’s write overwrites A’s change.
Threads (shared memory)
Distributed state (shared database)
Same fundamental problem: independent agents accessing shared mutable state.
Databases provide coordination mechanisms (transactions, locking, isolation). Using them correctly requires understanding why the problem exists.
One application may have fundamentally different kinds of data:

A storage system optimized for one pattern may perform poorly for another.
Relational (PostgreSQL, MySQL)
Key-Value (Redis, DynamoDB)
Document (MongoDB)
Specialized
Relational for everything
Embeddings similarity search: works, but 100x slower than vector DB
Time-series at scale: complex partitioning, custom indexing
Sessions: paying for ACID guarantees you don’t need
Specialized for everything
Key-value for complex queries: joins in application code
Multiple systems: operational overhead, data sync challenges
Wrong tool for simple cases: complexity without benefit
The question isn’t “which database is best” but “what does this specific state require.”
Single database → potential bottleneck, single point of failure.
Replication addresses this:

Strong consistency
All replicas agree before acknowledging writes
Eventual consistency
Writes acknowledge immediately; replicas sync later
Banking transaction: strong consistency required.
Social media like count: eventual consistency acceptable.
Requirements determine the appropriate choice.
| Choice | Consequences |
|---|---|
| Where state lives | Latency, failure domains, operational responsibility |
| Storage system | Query capabilities, throughput limits, consistency model |
| Replication strategy | Durability guarantees, read scaling, consistency trade-offs |
| Consistency level | Application correctness constraints, availability during failures |
No configuration optimizes all dimensions. State architecture is choosing which trade-offs to accept.
A service is an independent process that:
Not a library you link against - a running process you communicate with over the network.

Each service can be:
| Dimension | Independence Means |
|---|---|
| Developed | Different team, different pace, different language |
| Deployed | Update one service without redeploying others |
| Scaled | Add instances of high-load services only |
| Failed | One service crashing doesn’t necessarily crash others |
This independence comes from the network boundary - services interact through defined interfaces, not shared memory or direct function calls.

An API (Application Programming Interface) specifies:
What’s exposed
This is the contract clients depend on.
What’s hidden
Implementation can change without breaking clients.

If interface is stable:
Services evolve independently.
If interface changes:
Breaking changes propagate.
API design and versioning are significant engineering challenges. A poorly designed API creates coupling that undermines service independence.
A user request often traverses multiple services:

Each service boundary adds:
| Cost | Impact |
|---|---|
| Latency | Network round-trip per hop (typically 1-50ms each) |
| Failure probability | Each service is an additional failure point |
| Complexity | Debugging spans multiple processes, logs, traces |
| Operational overhead | More services = more to deploy, monitor, maintain |
A request traversing 5 services with 99% availability each: 0.99^5 = 95% success rate.
Service boundaries are not free. They exist where independence benefits outweigh coordination costs.

Service fails fast
Unpleasant but bounded.
Service responds slowly
Can exhaust caller’s resources.
A service responding in 30 seconds instead of 30ms can cause more damage than one that fails immediately.
Every service call needs a timeout - but choosing timeout values is hard:
Timeout too short
Timeout too long
Timeout = how long are you willing to wait before assuming failure? There’s no universally correct answer.
When a request fails, should the caller retry?
Retries help when:
Retries hurt when:
Retry strategies (exponential backoff, jitter, limits) are necessary but add complexity.
When a downstream service is failing, continuing to call it:

Circuit breaker: fail fast when downstream is unhealthy, periodically test for recovery.
In a distributed system, something is always failing somewhere. Services must handle:
Graceful degradation: return partial results, use cached data, disable features - rather than complete failure.
This is a design concern from the start, not something to add later.
The concepts from earlier sections manifest through services:
| Concept | How It Appears |
|---|---|
| Isolation (Section 4) | Each service in its own container |
| Communication (Section 5) | Services interact via APIs over network |
| State (Section 6) | Services manage or share state through databases |
| Coordination (Section 1) | Service interactions require coordination patterns |
Building distributed applications = designing services, their APIs, their interactions, and their failure modes.
Training a large model on a substantial dataset creates specific infrastructure requirements:
Data challenges
Compute challenges
A single Python script on a laptop doesn’t scale to this. What infrastructure makes it tractable?
Object storage holds the dataset:
Data pipeline streams to training:
Training code sees an iterator of batches. Infrastructure handles the rest.

A training run takes 8 hours. At hour 6, the node crashes. Without preparation: start over, lose 6 hours of GPU time.
Checkpointing solves this:
Checkpoints also enable:

Running many experiments creates a management problem:
Experiment tracking captures:
| What | Why |
|---|---|
| Hyperparameters | Reproduce the run |
| Metrics over time | Compare learning dynamics |
| Code version / git hash | Know exactly what ran |
| Final model location | Find the artifact |
This is a service that training jobs write to - separate from the training code itself.
GPUs are expensive and limited. Multiple users/experiments compete for them.
Without orchestration:
With orchestration:


Each component addresses a specific requirement. Data flows left-to-right; state persists in storage systems; orchestrator coordinates the whole.
Different problem, similar structural patterns. A transit agency publishes vehicle positions. Build an application that:
Must handle:
Complications:
The transit API provides raw data, but:
Ingestion service handles this boundary:
Internal services never touch external API directly. Ingestion isolates that dependency.

Positions arrive in bursts (all vehicles update near-simultaneously). Processing (delay detection) takes variable time.
Message queue between them:
This is async communication from Section 5:

Real-time: events flow through queue, processor detects delays, sends notifications.
Query: user requests current positions, API queries database, returns response.
Different patterns for different access types. Database is the meeting point.
When processor detects a delay, it doesn’t send email directly. Instead:
Why separate?


External data enters through controlled boundary. Real-time and query paths serve different needs. Async processing prevents blocking. Components are independently deployable.
| Pattern | ML Training | Transit App |
|---|---|---|
| External boundary | Object storage holds dataset | Ingestion service wraps external API |
| Async processing | Data pipeline feeds GPU workers | Queue between ingestion and processor |
| Durable state | Checkpoints, experiment logs | Database for positions and history |
| Separation of concerns | Training vs tracking vs scheduling | Ingestion vs processing vs notification |
| Stateless compute | Workers can restart, reschedule | Services can scale, restart |
Different domains, same structural principles. The concepts from earlier sections appear as concrete components.