NoSQL and Distributed Databases

EE 547 - Unit 5

Dr. Brandon Franzke

Fall 2025

Outline

Relational Model Limitations

Structural and Scale Constraints

  • Object-relational impedance mismatch
  • JOIN complexity grows super-linearly
  • Schema rigidity vs sparse data

Distributed Systems

  • Horizontal partitioning for capacity
  • Replication trade-offs and consistency models
  • CAP theorem constraints
  • Consistency level tuning

NoSQL Data Modeling

  • Access pattern-driven design
  • Denormalization trade-offs

NoSQL Data Models

Key-Value and Document Stores

  • GET/PUT operations with O(1) access
  • Hierarchical document structure
  • Atomic operations and race conditions

Wide-Column and Graph Storage

  • Sparse columnar storage for time-series
  • Native graph traversal vs JOIN-based queries

Architecture Patterns

  • Polyglot persistence trade-offs
  • Managed database services

When Relational Models Break Down

Relational Models Excel at Structured, Uniform Data

Flight booking schema works because:

  • Uniform structure: Every flight has same attributes
  • Clear relationships: Foreign keys model real entities
  • Predictable queries: Find bookings, flights on routes
  • This fit between data shape and model is not universal
-- Natural queries for uniform data
SELECT * FROM bookings
WHERE passenger_id = 1234;

SELECT f.* FROM flights f
WHERE departure_airport = 'LAX'
  AND arrival_airport = 'JFK';

When data and queries match structure, relational wins

Schema Rigidity When Data Varies

Add special services to passengers:

  • Wheelchair assistance
  • Meal preferences (vegetarian, gluten-free, kosher)
  • Pet carrier
  • Unaccompanied minor
  • Medical oxygen
  • Extra legroom request

Relational approach 1: Add columns

ALTER TABLE passengers ADD COLUMN wheelchair BOOLEAN;
ALTER TABLE passengers ADD COLUMN meal_pref VARCHAR(50);
ALTER TABLE passengers ADD COLUMN pet_carrier BOOLEAN;
-- ...12 more columns

Problem: Most passengers have mostly NULLs (90% of cells)

Relational approach 2: Services table

CREATE TABLE passenger_services (
  passenger_id INT,
  service_key VARCHAR(50),
  service_value TEXT
);

Problem: Loses type safety, querying awkward

-- Find passengers needing wheelchair
SELECT DISTINCT passenger_id
FROM passenger_services
WHERE service_key = 'wheelchair'
  AND service_value = 'true';

-- Complex: wheelchair AND vegetarian
SELECT ps1.passenger_id
FROM passenger_services ps1
JOIN passenger_services ps2
  ON ps1.passenger_id = ps2.passenger_id
WHERE ps1.service_key = 'wheelchair'
  AND ps2.service_key = 'meal_pref'
  AND ps2.service_value = 'vegetarian';

Neither approach feels natural — data has variable structure per entity

Approach 1 storage example:

passenger_id name wheelchair meal pet oxygen
1 Alice NULL NULL NULL NULL
2 Bob NULL vegan NULL NULL
3 Carol true NULL NULL NULL

With 10,000 passengers: 10,000 rows × 15 service columns = 150,000 cells, ~135,000 NULL values

Variable Schema Problem Generalizes

E-commerce products:

Books:

  • ISBN: VARCHAR(13)
  • author: VARCHAR(100)
  • pages: INT
  • publisher: VARCHAR(100)

Electronics:

  • voltage: INT
  • warranty_months: INT
  • power_watts: INT
  • battery_type: VARCHAR(50)

Clothing:

  • size: VARCHAR(10)
  • material: VARCHAR(100)
  • care_instructions: TEXT
  • fit_type: VARCHAR(50)

Relational solution?

  • Single products table with 50+ columns, mostly NULL
  • product_attributes table (key-value pairs)
  • Separate tables per category (breaks product queries)

IoT sensor data:

Temperature sensors:

  • celsius: FLOAT
  • calibration_date: DATE

Motion sensors:

  • x_accel: FLOAT
  • y_accel: FLOAT
  • z_accel: FLOAT
  • sensitivity: INT

Camera sensors:

  • image_url: VARCHAR(500)
  • resolution_width: INT
  • resolution_height: INT
  • fps: INT
  • codec: VARCHAR(20)

Pattern: Entity type is uniform (product, sensor), but attributes vary within type

Relational model assumes: All entities of same type have same attributes

Reality: Many domains have heterogeneous attributes

Nested Data Requires JOIN Operations for Reassembly

Application code thinks in objects:

# Natural object access
booking.passenger.frequent_flyer_number
booking.flight.departure_airport
booking.flight.aircraft.model

SQL requires JOIN to reassemble:

SELECT b.*, p.*, f.*, a.*
FROM bookings b
  JOIN passengers p ON b.passenger_id = p.passenger_id
  JOIN flights f ON b.flight_id = f.flight_id
  JOIN aircraft a ON f.aircraft_id = a.aircraft_id
WHERE b.booking_id = 12345;

Computational cost:

  • Each JOIN: O(n log n) at best with indexes
  • Three JOINs for three-level nesting
  • Deeply nested structures → multiple JOINs

Worse: Comments on posts on forums

-- Get comment with post and forum context
SELECT c.*, p.*, f.*
FROM comments c
  JOIN posts p ON c.post_id = p.post_id
  JOIN forums f ON p.forum_id = f.forum_id;

Impedance mismatch: Object hierarchy vs tabular relations

Every nested access = JOIN operation

Impedance Mismatch - Serialization Round-Trips

Impedance mismatch: Application code operates on object graphs, databases store flat tables. Moving data between these representations requires expensive translation.

Loading a passenger with 3 bookings and flight details:

# ORM loading pattern (N+1 query problem)
passenger = session.query(Passenger).get(123)  # 1 query
for booking in passenger.bookings:              # 1 query
    flight = booking.flight                     # 3 queries
    aircraft = flight.aircraft                  # 3 queries
# Total: 8 queries

Eager loading with JOINs:

SELECT p.*, b.*, f.*, a.*
FROM passengers p
  LEFT JOIN bookings b ON p.passenger_id = b.passenger_id
  LEFT JOIN flights f ON b.flight_id = f.flight_id
  LEFT JOIN aircraft a ON f.aircraft_id = a.aircraft_id
WHERE p.passenger_id = 123;

Result: Duplicated passenger data × 3 rows (one per booking)

Alternative: Serialize entire object as JSON

-- Store object as JSON blob
ALTER TABLE passengers ADD COLUMN data JSONB;

Problem: Lose query capability, indexing, constraints

Load time for 1 passenger + 3 bookings:

  • N+1 Queries: 45ms (7 roundtrips)
  • Eager JOIN: 12ms (data duplication across rows)
  • JSON Column: 3ms (no additional queries, but limited query power)

Each approach has serious drawbacks

Query Patterns That Fight Relational Structure

Relational optimized for:

  • Get entity by ID: O(1) with primary key
  • Filter on indexed columns: O(log n)
  • JOIN related tables: O(n log n)

Problematic patterns:

1. Path queries:

-- Flights connecting through exactly 2 hubs
SELECT f1.flight_number, f2.flight_number, f3.flight_number
FROM flights f1
  JOIN flights f2 ON f1.arrival_airport = f2.departure_airport
  JOIN flights f3 ON f2.arrival_airport = f3.departure_airport
WHERE f1.departure_airport = 'LAX'
  AND f3.arrival_airport = 'LHR'
  AND f1.arrival_airport NOT IN ('LAX', 'LHR')
  AND f2.arrival_airport NOT IN ('LAX', 'LHR');

Self-JOINs explode with path length

2. Sparse attribute filters:

-- Passengers with (gluten-free OR wheelchair) AND pet
SELECT DISTINCT ps1.passenger_id
FROM passenger_services ps1
WHERE ps1.passenger_id IN (
    SELECT passenger_id FROM passenger_services
    WHERE (service_key = 'meal' AND service_value = 'gluten-free')
       OR (service_key = 'wheelchair' AND service_value = 'true')
  )
  AND ps1.passenger_id IN (
    SELECT passenger_id FROM passenger_services
    WHERE service_key = 'pet_carrier' AND service_value = 'true'
  );

3. Full-text search:

-- Passengers whose names contain substring
SELECT * FROM passengers
WHERE first_name LIKE '%john%'
   OR last_name LIKE '%john%';

Cannot use B-tree index, requires full table scan

These work but feel forced, require complex queries, or full table scans

Social Connection Queries in Relational Models

Schema:

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name VARCHAR(100),
  city VARCHAR(50)
);

CREATE TABLE connections (
  user_id INT REFERENCES users(id),
  friend_id INT REFERENCES users(id),
  PRIMARY KEY (user_id, friend_id)
);

Query: “Friends of friends who live in Boston”

SELECT DISTINCT u.*
FROM connections c1
  JOIN connections c2 ON c1.friend_id = c2.user_id
  JOIN users u ON c2.friend_id = u.id
WHERE c1.user_id = 123
  AND u.city = 'Boston'
  AND c2.friend_id != 123;

Cost: 2 self-JOINs, O(n²) join complexity

3-hop traversal (friends-of-friends-of-friends)?

  • 3 self-JOINs, O(n³) complexity
  • Each additional hop: another JOIN, exponential growth

Arbitrary depth?

  • Requires recursive CTEs (complex, poorly optimized)

Recursive CTEs:

WITH RECURSIVE paths AS (
  SELECT user_id, friend_id, 1 AS depth
  FROM connections
  WHERE user_id = 123
  UNION ALL
  SELECT p.user_id, c.friend_id, p.depth + 1
  FROM paths p
  JOIN connections c ON p.friend_id = c.user_id
  WHERE p.depth < 5
)
SELECT DISTINCT friend_id FROM paths;

Complex, performance unpredictable

JSON Columns Provide Flexibility But Limits Query Power

PostgreSQL JSONB, MySQL JSON type:

ALTER TABLE passengers ADD COLUMN services JSONB;

-- Store variable data as JSON
INSERT INTO passengers (id, name, email, services)
VALUES (1, 'Alice', 'alice@example.com',
  '{"wheelchair": true, "meal": "vegetarian"}');

INSERT INTO passengers (2, 'Bob', 'bob@example.com',
  '{"meal": "gluten-free", "pet_carrier": true, "extra_legroom": true}');

Querying JSON data:

-- PostgreSQL JSONB operators
SELECT * FROM passengers
WHERE services->>'wheelchair' = 'true';

SELECT * FROM passengers
WHERE services ? 'meal'
  AND services->>'meal' = 'vegetarian';

Problems:

  • Query optimization: Cannot use B-tree indexes on nested paths efficiently
  • Partial indexes help but don’t solve fundamental issue
CREATE INDEX idx_wheelchair
ON passengers ((services->>'wheelchair'))
WHERE services->>'wheelchair' IS NOT NULL;
  • Type constraints lost: Everything is text in JSON
  • No foreign key relationships inside JSON

Performance comparison:

Result: Relational overhead without relational benefits for nested data

Better solutions exist when data is fundamentally non-tabular

Sparse Data and NULL Proliferation

Add event tracking to flights:

ALTER TABLE flights ADD COLUMN delay_reason VARCHAR(100);
ALTER TABLE flights ADD COLUMN delay_minutes INT;
ALTER TABLE flights ADD COLUMN diverted_airport CHAR(3);
ALTER TABLE flights ADD COLUMN emergency_landing BOOLEAN;
ALTER TABLE flights ADD COLUMN diversion_reason TEXT;
ALTER TABLE flights ADD COLUMN emergency_type VARCHAR(50);
ALTER TABLE flights ADD COLUMN emergency_services BOOLEAN;
ALTER TABLE flights ADD COLUMN passenger_medical BOOLEAN;
-- ...15 more event-specific columns

Most flights: NULL for all special event fields

flight_id delay_reason diverted emergency
1001 NULL NULL NULL
1002 NULL NULL NULL
1003 weather NULL NULL
1004 NULL NULL NULL

With 1M flights, 20 event columns:

  • 20M total cells
  • 19M NULLs (95%)
  • Storage waste: Implementation-dependent, but significant

Semantic ambiguity:

  • NULL = unknown?
  • NULL = does not apply?
  • NULL = not recorded?

Querying sparse data is awkward:

-- Find flights with delays NOT due to weather
SELECT * FROM flights
WHERE delay_reason IS NOT NULL
  AND delay_reason != 'weather';

-- Flights with ANY special event
SELECT * FROM flights
WHERE delay_reason IS NOT NULL
   OR diverted_airport IS NOT NULL
   OR emergency_landing IS NOT NULL
   OR passenger_medical IS NOT NULL
   -- ...15 more conditions

Sparse attributes common in:

  • Event tracking
  • Optional features
  • User preferences
  • Sensor readings
  • Medical records

Schema Evolution Under Load

Need to add column to flights table with 10M rows:

ALTER TABLE flights ADD COLUMN delay_minutes INTEGER;

PostgreSQL behavior:

  • Acquires ACCESS EXCLUSIVE lock
  • Blocks ALL reads and writes during operation
  • On large tables: Minutes of downtime
  • Application connections timeout and fail

Timing for different table sizes:

Rows ALTER TABLE time Downtime
10K ~1 second Acceptable
100K ~10 seconds Problematic
1M ~30 seconds Unacceptable
10M ~5 minutes Critical

Zero-downtime migration requires shadow table (complexity: high, time: hours)

Application-level workaround (complex):

  1. Create shadow table with new schema
  2. Implement dual-write to both tables
  3. Backfill old data to new table
  4. Switch application to new table
  5. Drop old table

Error-prone:

  • Maintaining data consistency during migration
  • Coordinating application deployments
  • Rollback strategy if issues arise

Relational Remains Correct for Structured Data

Flight operations data FITS relational well:

Uniform schema:

  • Every flight has same structure
  • Every booking follows same pattern
  • Every passenger has consistent attributes

Clear relationships:

  • Foreign keys model real entities
  • Referential integrity enforced by database
  • JOIN operations match conceptual relationships

Transaction requirements:

  • Booking seat → decrease inventory atomically
  • Payment + confirmation must succeed together
  • ACID properties critical for correctness

Known queries:

  • Predictable access patterns
  • Standard reporting needs
  • Queries match table structure
-- These queries are natural and efficient
SELECT COUNT(*) FROM bookings WHERE flight_id = 1234;

SELECT f.*, COUNT(b.booking_id) as passenger_count
FROM flights f
LEFT JOIN bookings b ON f.flight_id = b.flight_id
GROUP BY f.flight_id;

BEGIN TRANSACTION;
  INSERT INTO bookings (passenger_id, flight_id, seat)
  VALUES (123, 456, '12A');
  UPDATE flights SET available_seats = available_seats - 1
  WHERE flight_id = 456;
COMMIT;

If your data looks like this, relational is the correct choice

When relational is correct:

  • Uniform schema per entity type
  • Relationships map to foreign keys
  • ACID transactions required
  • Queries match table structure
  • Data integrity constraints needed

When to consider alternatives:

  • Variable attributes per entity
  • Deep object hierarchies
  • Sparse/optional attributes
  • Graph traversal queries
  • Frequent schema changes

Match data structure to model — not every problem needs NoSQL

Query Patterns Define Data Model

Relational Normalizes, NoSQL Denormalizes

Relational modeling:

  • Normalize to 3NF, eliminate redundancy
  • Single source of truth for each fact
  • JOIN to assemble complete views

NoSQL modeling:

  • Start with queries, store data for read patterns
  • Redundancy acceptable for query performance
  • Trade storage and update complexity for read simplicity

Example: Flight booking confirmation email

Relational approach: JOIN 5 tables

SELECT b.booking_reference, p.name, p.email,
       f.flight_number, f.scheduled_departure,
       d.name AS departure_city,
       a.name AS arrival_city,
       ac.model AS aircraft_type
FROM bookings b
  JOIN passengers p ON b.passenger_id = p.passenger_id
  JOIN flights f ON b.flight_id = f.flight_id
  JOIN airports d ON f.departure_airport = d.airport_code
  JOIN airports a ON f.arrival_airport = a.airport_code
  JOIN aircraft ac ON f.aircraft_id = ac.aircraft_id
WHERE b.booking_id = 12345;

NoSQL approach: Store complete confirmation in booking document

{
  "booking_id": 12345,
  "booking_reference": "ABC123",
  "passenger": {
    "name": "Alice Chen",
    "email": "alice@example.com"
  },
  "flight": {
    "number": "UA123",
    "departure": "2025-02-15T08:30:00Z",
    "departure_city": "Los Angeles",
    "arrival_city": "New York",
    "aircraft": "Boeing 737"
  }
}

Update cost: User name changes → update in 1000 places vs 1 place

When acceptable: Reads >> Writes (confirmation viewed 100× more than user changes name)

Access Pattern - Flight Status Display

Requirement: Airport display showing flight status

  • Refreshes every 30 seconds
  • Shows next 2 hours of departures
  • 100 displays at LAX

Query: “All departures from LAX in next 2 hours”

Data needed:

  • Flight number, gate, status
  • Destination city name (not just code)
  • Aircraft type
  • Departure time

Relational approach:

SELECT f.flight_number, f.gate, f.status,
       f.scheduled_departure,
       a.model AS aircraft_type,
       ap.name AS destination_city
FROM flights f
  JOIN aircraft a ON f.aircraft_id = a.aircraft_id
  JOIN airports ap ON f.arrival_airport = ap.airport_code
WHERE f.departure_airport = 'LAX'
  AND f.scheduled_departure BETWEEN NOW()
      AND NOW() + INTERVAL '2 hours';

Frequency analysis:

  • 100 displays × 120 queries/hour = 12,000 queries/hour
  • Schedule changes: ~5 updates/hour
  • Read:Write ratio = 2400:1

NoSQL approach: Denormalize for display query

{
  "flight_id": 1234,
  "flight_number": "UA123",
  "departure_airport": "LAX",
  "arrival_airport": "JFK",
  "destination_city": "New York",  // Duplicated
  "aircraft_model": "Boeing 737",   // Duplicated
  "gate": "B12",
  "status": "On Time",
  "scheduled_departure": "2025-02-15T14:30:00Z"
}

Single query, no joins:

db.flights.find({
  departure_airport: "LAX",
  scheduled_departure: {
    $gte: new Date(),
    $lte: new Date(Date.now() + 2*60*60*1000)
  }
})

Read:Write ratio 2400:1 justifies denormalization

Aircraft model duplicated in every flight using that aircraft

City name duplicated in every flight to that destination

Updates rare, reads constant

Embedding - Passenger Special Services

Earlier Problem: Variable passenger services

  • Wheelchair assistance
  • Meal preferences
  • Pet carrier
  • Varying per passenger

Document approach: Embed services as nested object

{
  "passenger_id": 123,
  "name": "Alice Chen",
  "email": "alice@example.com",
  "phone": "+1-555-0123",
  "services": {
    "wheelchair": true,
    "meal": "vegetarian",
    "pet_carrier": true,
    "seat_preference": "aisle",
    "medical_oxygen": false
  },
  "frequent_flyer": {
    "number": "FF123456",
    "tier": "Gold",
    "miles": 85000
  }
}

Why embed:

  • Accessed together in primary query pattern
  • Small size (<1KB total)
  • No independent existence
  • Services have no meaning without passenger context

Single document fetch gets everything

No NULL columns for passengers without services

Store only what exists, no wasted NULL storage

Referencing - Flight Bookings

Flight with 200 bookings - cannot embed all:

Size problem:

  • 200 bookings × 1KB each = 200KB
  • MongoDB limit: 16MB
  • DynamoDB limit: 400KB
  • Approaching limits with just bookings

Access problem:

  • Fetching flight details does not need all bookings
  • Display shows “seats available”, not passenger list
  • Loading 200KB when 1KB needed

Update problem:

  • Every new booking rewrites entire flight document
  • Concurrent booking updates conflict
  • Document-level locking affects all bookings

Solution: Reference pattern

// Flight document (small, frequently accessed)
{
  "flight_id": 456,
  "flight_number": "UA123",
  "departure_airport": "LAX",
  "arrival_airport": "JFK",
  "scheduled_departure": "2025-02-15T14:30:00Z",
  "seats_available": 12,
  "total_seats": 180
}

// Booking document (separate collection)
{
  "booking_id": 789,
  "flight_id": 456,  // Reference to flight
  "passenger_id": 123,
  "seat": "12A",
  "booking_time": "2025-01-15T10:30:00Z"
}

Query patterns:

Get flight details:

db.flights.findOne({flight_id: 456})
// Returns 1KB document

Get bookings for flight:

db.bookings.find({flight_id: 456})
// Returns cursor to 200 documents

Add new booking:

// Two operations (in transaction if supported)
db.bookings.insert({...})
db.flights.update(
  {flight_id: 456},
  {$inc: {seats_available: -1}}
)

Trade-off:

  • Embedded: 200 KB document (grows with bookings)
  • Referenced: 1 KB flight + 1 KB per booking (separate queries)
  • Document size limit: 16 MB (MongoDB)

Like foreign keys, but application-enforced (not database engine!)

Sensor Data at Scale - Distribution Patterns

Scenario: IoT temperature sensor network

  • 100,000 sensors
  • 1 reading per minute per sensor
  • 5 years retention required

Data volume:

  • 100K sensors × 60 readings/hour = 6M readings/hour
  • 144M readings/day
  • 52.6B readings/year
  • 5 years = 263B readings

Cannot fit on single server:

  • 263B readings × 50 bytes/reading = 13TB
  • Query latency unacceptable on 13TB
  • Write throughput: 1,667 writes/second continuous

Read patterns:

  1. Recent readings for one sensor (95% of queries)
    • “Sensor 42000 last 24 hours”
  2. All readings for sensor in time range (4%)
    • “Sensor 42000 January 2025”
  3. Average across all sensors (1%)
    • “Average temperature now”

Distribution required: 13TB cannot fit on single server

Partition strategy determines query efficiency:

  • Primary pattern: Single sensor queries (99%)
  • Rare pattern: Cross-sensor aggregation (1%)

Sensor Data - Partition by Sensor ID

Partition strategy: All data for sensor_id on same server

Physical layout across 5 servers:

  • Server 1: sensor_id 0-19,999 (20K sensors)
  • Server 2: sensor_id 20,000-39,999
  • Server 3: sensor_id 40,000-59,999
  • Server 4: sensor_id 60,000-79,999
  • Server 5: sensor_id 80,000-99,999

Each server stores:

sensor_id | timestamp           | temperature
----------|--------------------|-----------
40000     | 2025-01-15 00:00  | 21.0
40000     | 2025-01-15 00:01  | 21.1
40000     | 2025-01-15 00:02  | 21.2
...
40001     | 2025-01-15 00:00  | 19.5

Query 1: “Sensor 42000, last 24 hours”

  • Hash(42000) → Server 3
  • Route query to single server
  • Single server processes query
  • Fast: O(1) routing, O(log n) within server

Query 2: “Average temperature all sensors”

  • Must contact all 5 servers
  • Each computes local average
  • Coordinator aggregates results
  • Slow: 5× network calls

Performance:

  • Single sensor query: ~5ms
  • All sensors query: ~50ms (parallel) + aggregation

Partition key determines query efficiency

Sensor Data - Row Structure Optimized for Queries

Query requirement: Readings with metadata

  • Location where sensor installed
  • Sensor type and calibration info
  • Building/floor information

Relational approach (bad for distributed):

-- Metadata on central server
CREATE TABLE sensors (
  sensor_id INT PRIMARY KEY,
  location VARCHAR(100),
  building VARCHAR(50),
  sensor_type VARCHAR(20)
);

-- Readings partitioned across 5 servers
CREATE TABLE readings (
  sensor_id INT,
  timestamp TIMESTAMP,
  temperature FLOAT,
  FOREIGN KEY (sensor_id) REFERENCES sensors
);

Query needs metadata:

  1. Fetch metadata from central server
  2. Fetch readings from data server
  3. Join in application = 2 network round-trips

Denormalized approach:

{
  "sensor_id": 42000,
  "timestamp": "2025-01-15T10:30:00Z",
  "temperature": 22.5,
  "location": "Building A, Floor 3, Room 301",  // Duplicated
  "sensor_type": "indoor_temp",                  // Duplicated
  "building": "Building A"                       // Duplicated
}

Storage trade-off:

  • +40 bytes per reading (metadata)
  • 263B readings × 40 bytes = 10.5TB extra
  • Total: 23.5TB vs 13TB

Query performance:

Query performance:

  • Normalized: 25ms (2 server roundtrips)
  • Denormalized: 5ms (single server)

Update frequency:

  • Sensor location changes: ~1 per month
  • Temperature readings: 100K per minute

Denormalization justified by read:write ratio

Sensor Data - Compound Sort Key for Time Queries

Data on Server 3 (sensor_id 40,000-59,999):

  • Primary: Partition by sensor_id
  • Secondary: Sort by timestamp within partition

Physical layout on disk:

Partition: sensor_id=40000
--------------------------
timestamp            | temp
2025-01-15 00:00:00 | 21.0
2025-01-15 00:01:00 | 21.1
2025-01-15 00:02:00 | 21.2
...
2025-01-15 23:59:00 | 20.8

Partition: sensor_id=40001
--------------------------
timestamp            | temp
2025-01-15 00:00:00 | 19.5
2025-01-15 00:01:00 | 19.6
...

Query: “sensor_id=40000 between 10:00 and 12:00”

  1. Hash(40000) → locate partition
  2. Binary search to find 10:00:00
  3. Sequential scan until 12:00:00
  4. Return ~120 readings

Efficient: O(log n) + O(k) where k = result size

Cannot efficiently query: “All sensors at 10:00”

  • Would scan all 20,000 partitions on this server
  • Then contact other 4 servers
  • Query pattern not supported by layout

Sort order within partition provides O(log n) range queries

Multiple Access Patterns Require Multiple Tables

Two different access patterns:

Pattern 1: Recent readings by sensor

  • “Get sensor 42000 for last 24 hours”
  • Need: Partition by sensor_id, sort by timestamp

Pattern 2: All sensors at specific time

  • “Get all sensors at 10:00-11:00”
  • Need: Partition by timestamp, sort by sensor_id

Same data, stored twice:

-- Table 1: Optimized for sensor queries
CREATE TABLE readings_by_sensor (
  sensor_id INT,        -- Partition key
  timestamp TIMESTAMP,  -- Sort key
  temperature FLOAT,
  location TEXT,
  PRIMARY KEY (sensor_id, timestamp)
);

-- Table 2: Optimized for time queries
CREATE TABLE readings_by_time (
  time_bucket INT,      -- Partition key (hour)
  sensor_id INT,        -- Sort key
  temperature FLOAT,
  location TEXT,
  PRIMARY KEY (time_bucket, sensor_id)
);

Write path: Application writes to both tables

Read path: Query router chooses table based on pattern

Cannot optimize single table for both patterns

Trade storage and write complexity for query performance

NoSQL Data Modeling Anti-Pattern - Relational Thinking

Anti-pattern: Normalize sensor data

// Sensors collection (metadata)
{
  "sensor_id": 42000,
  "location": "Building A, Floor 3",
  "sensor_type": "indoor_temp",
  "calibration_date": "2024-01-15"
}

// Readings collection (data)
{
  "reading_id": 999999,
  "sensor_id": 42000,  // Reference
  "timestamp": "2025-01-15T10:30:00Z",
  "temperature": 22.5
}

Query to display dashboard (100 recent readings):

  1. Fetch 100 readings from readings collection
  2. Extract unique sensor_ids
  3. Fetch sensor metadata for each unique sensor
  4. Join in application

Problem:

  • 100 readings from 20 sensors = 21 queries
  • Network latency: 20ms × 21 = 420ms
  • Dashboard refresh every second? Impossible

Correct approach: Denormalize

  • Store location with every reading
  • Single query returns everything
  • Latency: 20ms total (21× faster)

Relational normalization creates N+1 query problem

When Denormalization Breaks - Update Anomalies

Scenario: Sensor relocated (Building A → Building B)

With denormalization, location stored in:

  • 1,440 readings per day (every minute)
  • 43,200 readings per month
  • 518,400 readings per year

Update options:

Option 1: Update all historical records

db.readings.updateMany(
  {sensor_id: 42000},
  {$set: {location: "Building B, Floor 2"}}
)
  • 518,400 documents updated
  • Takes minutes on distributed system
  • Blocks other operations
  • Historical data now incorrect

Option 2: Update going forward only

  • Historical: “Building A”
  • Current: “Building B”
  • Query results inconsistent
  • Acceptable for time-series

Option 3: Versioned metadata

{
  "sensor_id": 42000,
  "timestamp": "2025-01-15T10:30:00Z",
  "temperature": 22.5,
  "metadata_version": 3  // Reference version
}

More complex queries, additional lookup

When denormalization acceptable:

  • Read:Write > 100:1
  • Updates rare (monthly)
  • Eventual consistency OK
  • Historical accuracy not critical

When to avoid:

  • Frequent updates (prices change hourly)
  • Strong consistency required (inventory)
  • Write-heavy workloads
  • Regulatory accuracy requirements

Trade-off: Read performance vs update complexity

Scaling and Distribution Trade-offs

Single Node Capacity Limits

PostgreSQL practical limits on single node:

  • Connections: 10,000 concurrent (connection overhead)
  • Queries: 100,000 queries/sec (CPU bound)
  • Storage: ~2TB practical limit (vacuum, index maintenance)
  • Single point of failure

Real production at scale:

Netflix streaming service:

  • 230M subscribers
  • 1% concurrent = 2.3M active users
  • 10 requests/sec per user = 23M requests/sec
  • Single node handles: 100K requests/sec
  • Required: 230+ database nodes minimum

Reddit (2023):

  • 14TB PostgreSQL database
  • Schema migrations: 6 hours of downtime
  • Read replicas: 50+ to handle read load

Single-machine limits: Cannot fit internet-scale workloads on one node

Single node limits:

Resource Limit Bottleneck
Connections 10K Memory overhead
Queries/sec 100K CPU cores
Storage 2TB Vacuum time
Write throughput 50K/sec Disk I/O

Capacity mismatch:

Example workload (social media):

  • 100M users
  • 5% active concurrently = 5M
  • 5 queries/user/sec = 25M queries/sec

Single PostgreSQL node: 100K queries/sec

Gap: 250× capacity needed

Cannot be solved with bigger hardware — must distribute across multiple nodes

Vertical vs Horizontal Scaling Economics

Vertical scaling = bigger machine:

AWS RDS pricing (2024):

  • 128GB RAM: $1,380/month → $10.78/GB
  • 1TB RAM: $11,040/month → $10.78/GB
  • 4TB RAM: $38,880/month → $9.72/GB
  • 8TB RAM: $77,760/month → $9.72/GB

Capacity scaling:

  • 10× data requires 10× RAM
  • 10× queries requires 10× CPU
  • But: 10× cost gets 10× capacity (best case)

Problems with vertical:

  • Single point of failure
  • Limited by hardware (8TB RAM max)
  • Downtime for upgrades
  • Cannot add capacity gradually

At Reddit’s 14TB:

  • No single machine available
  • Must distribute anyway

Horizontal scaling = more machines:

10 nodes × 128GB RAM:

  • Capacity: 1.28TB total
  • Cost: 10 × $1,380 = $13,800/month
  • Redundancy: Lose 1 node, still operational
  • Incremental: Add nodes as needed

Comparison for 1TB capacity:

Approach Configuration Cost Redundancy
Vertical 1 × 1TB $11,040 None
Horizontal 10 × 128GB $13,800 9 nodes survive failure

Horizontal advantages:

  • Linear cost scaling
  • Built-in redundancy
  • No single point of failure
  • Add capacity without downtime

Trade-off: Horizontal requires distributing data and coordinating across nodes

Partitioning Splits Data Across Nodes

Partition = subset of data assigned to one node

Example: 10M users, 10 nodes

  • Each node stores 1M users
  • Query routes to node containing requested data

Two partitioning strategies:

Hash-based partitioning:

node = hash(user_id) % num_nodes

hash("alice") = 0x7a3f → Node 7
hash("bob") = 0x2c91 → Node 2
hash("carol") = 0x9e5d → Node 9

Characteristics:

  • Even distribution (no hotspots)
  • Random assignment
  • Range queries scan all nodes

Range-based partitioning:

A-B → Node 1
C-E → Node 2
F-J → Node 3
...
Z → Node 10

"alice" → Node 1
"bob" → Node 1
"carol" → Node 2

Characteristics:

  • Range scans stay on few nodes
  • Risk: Hotspots if data skewed

Trade-off: Even load vs efficient range queries

Partition Key Determines Data Location and Query Routing

Partition key = attribute used to determine node assignment

Example: E-commerce orders table

Option 1: Partition by user_id

-- Efficient: Single node
SELECT * FROM orders WHERE user_id = 123;
hash(123) = Node 4Query Node 4 only

-- Expensive: All nodes
SELECT * FROM orders WHERE order_date = '2024-10-07';
→ Must query all 10 nodes, merge results

Option 2: Partition by order_date

-- Efficient: Single node
SELECT * FROM orders WHERE order_date = '2024-10-07';
'2024-10-07' in range for Node 3Query Node 3 only

-- Expensive: All nodes
SELECT * FROM orders WHERE user_id = 123;
Date unknown, must query all 10 nodes

Partition key choice optimizes one access pattern at expense of others

Cost of wrong choice:

  • Single-node query: 10ms latency
  • Cross-partition query: 10 nodes × 10ms = 100ms latency
  • Plus: Network coordination overhead

Partition key choice determines which queries are efficient

Replication Provides Redundancy

Replication = store multiple copies of data on different nodes

Why replicate:

1. Fault tolerance:

  • Disk failure: 0.5% annual failure rate
  • With 100 nodes: Expect 1 failure every 2 months
  • Without replication: Data loss
  • With 3 replicas: 2 copies survive

2. Read scaling:

  • Primary handles writes
  • Replicas handle reads
  • 3 replicas = 3× read capacity

3. Geographic distribution:

  • US datacenter + EU datacenter + Asia datacenter
  • Place replica near users: 10ms vs 200ms latency

Standard configuration: N=3 replicas

Example: Partition A (users 0-999,999)

  • Primary: Node 1 (handles writes)
  • Replica 1: Node 2 (handles reads)
  • Replica 2: Node 3 (handles reads)

Storage cost: 3× disk space

3 replicas = survive 2 simultaneous failures

Replication Creates Consistency Problem

Write must propagate from primary to replicas

Write flow:

  1. Client sends write to Primary (Node 1)
  2. Primary updates local copy
  3. Primary sends update to Replicas (Nodes 2, 3)
  4. Replicas acknowledge receipt

Network latencies:

  • Same datacenter: 1-5ms between nodes
  • Cross-datacenter: 50-200ms
  • Replica may be temporarily offline

Critical question: When does write “succeed”?

Option A: Wait for all replicas (synchronous)

Client → Primary (5ms) → Wait for replicas (50ms)
Total latency: 55ms
If replica offline: Write fails
Result: All nodes have same data (consistent)

Option B: Return immediately (asynchronous)

Client → Primary (5ms) → Return success
Replicas updated in background (async)
Total latency: 5ms
If replica offline: Write still succeeds
Result: Replicas temporarily stale (inconsistent)

Trade-off: 5ms vs 55ms response time

Consistency vs latency vs availability trade-off

Network Partitions Split Clusters

Network partition = nodes cannot communicate due to network failure

Common causes:

  • Switch failure in datacenter
  • Fiber cable cut (construction accident)
  • Network misconfiguration
  • AWS documented frequency: ~0.1% of requests see partition effects

Example: 3-node cluster [A, B, C]

Normal operation:

  • All nodes communicate
  • Writes propagate A → B → C

Network partition occurs:

  • Network splits: [Node A] isolated from [Nodes B, C]
  • Both sides still operational
  • Both sides think they’re correct

Critical problem: Neither side knows if other side crashed or network failed

Duration:

  • Seconds: Brief network glitch
  • Minutes: Switch reboot
  • Hours: Waiting for human intervention (fiber repair)

During partition:

  • Both sides may accept writes
  • Creates conflicting data versions
  • Must resolve after partition heals

Partition forces choice: reject writes or accept conflicts

CAP Theorem Forces a Choice During Partitions

Three properties distributed systems want:

Consistency (C):

  • All nodes return same value for same query
  • Read always returns most recent write
  • No stale data

Availability (A):

  • Every request gets non-error response
  • No timeouts or failures
  • System always operational

Partition tolerance (P):

  • System works despite network splits
  • Required for any distributed system
  • Network failures inevitable

CAP Theorem: Cannot have all three during network partition

During partition, must choose:

CP (Consistency + Partition tolerance):

  • Reject requests to minority partition
  • Guarantees consistency
  • Sacrifices availability

AP (Availability + Partition tolerance):

  • Accept requests on all partitions
  • Guarantees availability
  • Sacrifices immediate consistency

Real systems:

  • Banking: CP (consistency critical)
  • Social media: AP (availability critical)
  • Inventory: CP (avoid overselling)
  • Analytics: AP (stale data acceptable)

CP Systems Sacrifice Availability for Consistency

CP behavior during partition: [Node A] | [Nodes B, C]

Minority partition (Node A):

  • Rejects all write requests: “Error: Cannot reach majority”
  • May reject reads (depending on configuration)
  • Becomes unavailable to clients

Majority partition (Nodes B, C):

  • Continues accepting writes
  • Requires majority (2 of 3 nodes) for writes
  • Guarantees no conflicting data

After partition heals:

  • Node A syncs from B or C
  • No conflicts to resolve
  • Data guaranteed consistent

Write quorum: W > N/2

  • 3 nodes: Must write to 2 nodes
  • Latency: Wait for slowest of 2 nodes
  • Cost: Higher latency, possible unavailability

PostgreSQL example:

  • Primary fails → replica promoted
  • Failover time: 30-120 seconds
  • During failover: All writes fail
  • After: Guaranteed consistency

Use CP when:

  • Financial transactions (no double-spending)
  • Inventory (no overselling)
  • User authentication (no conflicting state)

Trade-off: Latency and availability for guaranteed consistency

AP Systems Sacrifice Consistency for Availability

AP behavior during partition: [Node A] | [Nodes B, C]

Both partitions accept writes:

  • Node A: Accepts writes from its clients
  • Nodes B, C: Accept writes from their clients
  • Result: Divergent data versions

Example conflict:

Node A: user.balance = 100 (withdraw $50)
Node B: user.balance = 120 (deposit $20)
Initial: user.balance = 150

After partition heals: Which is correct?

Conflict resolution strategies:

1. Last-write-wins (timestamp):

  • Keep write with latest timestamp
  • Simple, fast
  • Risk: Clock skew causes wrong choice, data loss

2. Application merge:

  • Application logic combines conflicts
  • Example: Both transactions valid → balance = 70
  • Correct but complex

3. Vector clocks:

  • Track causality between writes
  • Detect concurrent writes
  • DynamoDB approach

Cassandra example:

  • Write with W=1 (any node): 5ms latency
  • Always available during partitions
  • Read may return stale data
  • Eventual consistency: Replicas converge in milliseconds to seconds

Trade-off: Consistency for availability and low latency

BASE Properties Define AP System Guarantees

ACID (relational databases):

  • Atomic: All or nothing
  • Consistent: Constraints enforced
  • Isolated: Concurrent transactions don’t interfere
  • Durable: Committed data survives failures

BASE (distributed systems):

  • Basically Available: System responds even during failures
  • Soft state: State may change without input (due to eventual consistency)
  • Eventual consistency: Replicas converge given enough time

Basically Available:

  • System operational during network partitions
  • May return stale data or degraded service
  • Prioritizes availability over correctness

Soft State:

  • No guaranteed consistency at any given moment
  • State changes as updates propagate
  • Application must handle inconsistencies

Eventual Consistency:

  • All replicas converge to same value
  • Time window: milliseconds to seconds
  • No guarantee when convergence occurs

Practical implications:

  • Read from replica: May get old value
  • Write to one node: Other nodes lag behind
  • Network partition: Nodes diverge temporarily
  • Partition heals: Conflict resolution required

BASE trades immediate consistency for availability and partition tolerance (AP in CAP)

Tunable Consistency Balances Trade-offs

Modern systems allow per-query consistency level

Cassandra consistency levels:

ONE: Return after 1 replica responds

  • Latency: 5ms
  • Consistency: Eventual (may read stale)
  • Availability: High (works with 1 node)

QUORUM: Return after majority responds (R+W > N)

  • Latency: 20ms
  • Consistency: Strong (if W=QUORUM also)
  • Availability: Medium (needs majority alive)

ALL: Return after all replicas respond

  • Latency: 50ms
  • Consistency: Strong
  • Availability: Low (all nodes must be up)

DynamoDB read options:

  • Eventually consistent: 5ms, may be stale
  • Strongly consistent: 10ms, sees all writes

Application chooses per-query:

# Fast read for displaying comments (stale OK)
comments = db.query(
    consistency='ONE'  # 5ms, eventual
)

# Strong read for payment verification
balance = db.query(
    consistency='QUORUM'  # 20ms, consistent
)

Consistency vs latency trade-off:

Level Replicas Latency Consistency Availability
ONE 1 5ms Eventual High
QUORUM 2/3 20ms Strong* Medium
ALL 3/3 50ms Strong Low

*If write also uses QUORUM: R+W > N guarantees overlap

Quorum mathematics:

  • N = 3 replicas
  • W = 2 (write quorum)
  • R = 2 (read quorum)
  • R + W = 4 > N = 3
  • Guarantees: Read overlaps with write nodes

Latency distribution (Cassandra, same datacenter):

  • ONE: p50=5ms, p99=10ms
  • QUORUM: p50=20ms, p99=50ms
  • ALL: p50=50ms, p99=200ms

Application decides trade-off:

  • Social feed: ONE (fast, stale OK)
  • Shopping cart: QUORUM (consistent)
  • Account balance: ALL (critical accuracy)

Different consistency levels optimize different use cases within same system

Key-Value Stores

Key-Value Model - Dictionary at Database Scale

Fundamental abstraction:

PUT(key, value)  # Store
value = GET(key) # Retrieve
DELETE(key)      # Remove

Value is opaque blob:

  • Database does not interpret contents
  • No schema enforcement
  • No query language for value structure
  • Application handles serialization

Examples: Redis, Memcached, Riak, DynamoDB (key-value mode)

Why minimal structure:

  • O(1) lookup with hash function
  • Trivial horizontal partitioning
  • Minimal feature set reduces overhead
  • No query parser or optimizer

Contrast with relational:

  • Relational: Schema, types, constraints, indexes, query optimizer
  • Key-value: Hash table, bytes in/bytes out
  • Trade query capability for raw performance

Simple interface provides O(1) access with minimal overhead

Operations Beyond GET/PUT

Core operations:

# Basic
redis.set("key", "value")
redis.get("key")
redis.delete("key")

# Conditional update
redis.setnx("key", "value")  # SET if Not eXists
redis.set("key", "new", xx=True)  # SET if exists

# Atomic counter
redis.incr("counter")  # Atomic increment
redis.decr("counter")  # Atomic decrement

# Time to live
redis.setex("session:abc", 1800, data)  # 30 min TTL
redis.expire("key", 3600)  # Set TTL on existing key

Advanced (Redis):

# Sorted sets for ranking
redis.zadd("leaderboard", {"player1": 1000})
rank = redis.zrank("leaderboard", "player1")

# Range queries on sorted keys
keys = redis.keys("session:*")  # Blocks server during scan
cursor, keys = redis.scan(0, match="session:*")  # Non-blocking iteration

Cannot do:

  • Query on value contents without retrieving
  • Secondary indexes
  • Joins across keys
  • Aggregations without application logic

Example session showing operations:

#| echo: true
#| eval: false

# Store session with 30-minute expiration
import redis
import json

r = redis.Redis()

session_data = {
    "user_id": 123,
    "login_time": "2025-01-15T10:30:00Z",
    "cart": ["product_456", "product_789"]
}

# Store
r.setex(
    "session:a3f9b2c1",
    1800,  # 30 minutes
    json.dumps(session_data)
)

# Retrieve
data = r.get("session:a3f9b2c1")
if data:
    session = json.loads(data)
    print(f"User {session['user_id']} logged in")
else:
    print("Session expired or not found")

# Extend TTL
r.expire("session:a3f9b2c1", 1800)

# After 30 minutes: automatic deletion
# No polling, no cleanup jobs required

Operations execute in microseconds

Must know exact key for retrieval

Session Management with Automatic Expiration

Web application session requirements:

  • Store per-user state (user_id, preferences, cart)
  • Expire after inactivity (30 minutes typical)
  • High volume (every HTTP request checks session)
  • Fast access (< 5ms latency requirement)

Relational approach problems:

CREATE TABLE sessions (
  session_id VARCHAR(64) PRIMARY KEY,
  user_id INT,
  data JSONB,
  created_at TIMESTAMP,
  last_activity TIMESTAMP
);

-- Every request
SELECT * FROM sessions
WHERE session_id = ?
  AND last_activity > NOW() - INTERVAL '30 minutes';

-- Cleanup job (runs every minute)
DELETE FROM sessions
WHERE last_activity < NOW() - INTERVAL '30 minutes';

Issues:

  • Cleanup job scans millions of rows
  • Index on last_activity, but still expensive
  • Delete locks table
  • Write amplification: Record activity on every request

Key-value with TTL:

# Store with automatic expiration
redis.setex(f"session:{token}", 1800, session_data)

# Retrieve (no expiration check needed)
data = redis.get(f"session:{token}")

# Extend session on activity
if data:
    redis.expire(f"session:{token}", 1800)

Performance:

  • Session check: < 1ms (in-memory)
  • Relational: 10-50ms (disk, index lookup, expiration check)
  • Automatic cleanup vs periodic scan

Value Serialization - Application Choice

Database stores bytes, application chooses format:

JSON - Human readable:

session = {
    "user_id": 123,
    "cart": ["product_456", "product_789"],
    "preferences": {"theme": "dark"},
    "login_time": "2025-01-15T10:30:00Z"
}
value = json.dumps(session)
# Size: 147 bytes
# Parse time: ~50 microseconds

MessagePack - Binary efficient:

import msgpack
value = msgpack.packb(session)
# Size: 89 bytes (39% smaller)
# Parse time: ~20 microseconds

Protocol Buffers - Typed schema:

message Session {
  int32 user_id = 1;
  repeated string cart = 2;
  map<string, string> preferences = 3;
  string login_time = 4;
}
value = session_pb.SerializeToString()
# Size: 68 bytes (54% smaller)
# Parse time: ~15 microseconds
# Schema evolution support

Trade-offs:

JSON:

  • Debug with redis-cli
  • Language interop trivial
  • Larger size
  • Slower parse

MessagePack:

  • Opaque in redis-cli
  • Smaller, faster
  • Language support good
  • No schema definition

Protocol Buffers:

  • Smallest, fastest
  • Schema migration paths
  • Requires .proto files
  • Build step required

Most applications: JSON for simplicity

High-throughput systems: Binary formats

Caching Database Queries - Read-Through Pattern

Product catalog query:

SELECT p.*, c.name AS category_name
FROM products p
  JOIN categories c ON p.category_id = c.category_id
WHERE p.product_id = 456;

Execution time: 45ms (JOIN, disk I/O)

Cache layer intercepts:

def get_product(product_id):
    cache_key = f"product:{product_id}"

    # Check cache
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)  # 1ms

    # Cache miss - query database
    product = db.execute("""
        SELECT p.*, c.name AS category_name
        FROM products p
        JOIN categories c ON p.category_id = c.category_id
        WHERE p.product_id = ?
    """, product_id)  # 45ms

    # Cache result for 5 minutes
    redis.setex(cache_key, 300, json.dumps(product))

    return product

Performance calculation:

  • Cache hit rate: 95%
  • Average latency: 0.95 × 1ms + 0.05 × 45ms = 3.2ms
  • vs without cache: 45ms
  • 14× faster average response

Database load reduction:

  • 1000 requests/second
  • Without cache: 1000 database queries/second
  • With 95% hit rate: 50 database queries/second
  • 20× reduction in database load

Cache absorbs read load

Database handles writes and cache misses

Distributed Caching with Consistent Hashing

Single cache server limit:

  • 64GB RAM = ~50M cached items
  • Network bandwidth: 10Gbps
  • Need horizontal scaling

Naive sharding breaks:

server_id = hash(key) % N  # N = number of servers

Adding server changes N → all keys remap

Consistent hashing solution:

  • Hash ring: 0 to 2³²-1
  • Servers and keys both hash to ring positions
  • Key stored on next clockwise server
  • Adding server: Only adjacent keys relocate (~1/N)

Example with 4 servers:

# Server positions on ring (hash of server ID)
servers = {
    "server1": hash("server1") % (2**32),  # 428312345
    "server2": hash("server2") % (2**32),  # 1827364829
    "server3": hash("server3") % (2**32),  # 2913746582
    "server4": hash("server4") % (2**32),  # 3728191028
}

# Key placement
key_hash = hash("product:456") % (2**32)  # 945628371
# Finds next server clockwise: server2

Adding server5:

  • New position: 2100000000
  • Only keys between server2 and server5 relocate
  • ~25% of keys move (1/4 servers)
  • Other 75% remain cached

Hash slots distribute cache load without global invalidation

Rate Limiting with Atomic Counters

API rate limit: 100 requests per minute per user

Wrong approach - Race condition:

# Read current count
count = int(redis.get(f"rate:{user_id}") or 0)

# Check limit
if count >= 100:
    return "Rate limit exceeded"

# Increment count
redis.set(f"rate:{user_id}", count + 1)
redis.expire(f"rate:{user_id}", 60)

Problem with concurrent requests:

Time    Request A           Request B
t0      Read count=99       Read count=99
t1      Check: 99 < 100     Check: 99 < 100
t2      Set count=100       Set count=100
t3      Allow request       Allow request

Both requests see count=99, both allowed → 101 requests

Correct approach - Atomic operation:

key = f"rate:{user_id}:{current_minute}"

# Atomic increment returns new value
count = redis.incr(key)

# Set TTL on first request
if count == 1:
    redis.expire(key, 60)

# Check after increment
if count > 100:
    return "Rate limit exceeded"

return "Request allowed"

Atomic increment guarantees correct count under concurrency

Single round-trip, no race conditions

Leaderboards with Sorted Sets

Gaming leaderboard requirements:

  • Rank 1 million players by score

  • Update scores frequently (every game completion)

  • Query operations:

    • Player rank: “What rank is player_id=12345?”
    • Top N: “Who are top 10 players?”
    • Score range: “Players with 1000-2000 points”

Redis sorted set:

import redis
r = redis.Redis()

# Update player score (O(log N))
r.zadd("leaderboard", {"player_12345": 1850})
r.zadd("leaderboard", {"player_67890": 2100})

# Get player rank - 0-indexed, from lowest (O(log N))
rank = r.zrevrank("leaderboard", "player_12345")
# Returns: 43127 (meaning 43,128th place)

# Get top 10 players with scores (O(log N + 10))
top_10 = r.zrevrange("leaderboard", 0, 9, withscores=True)
# Returns: [("player_67890", 2100.0), ...]

# Get players in score range (O(log N + M))
players = r.zrangebyscore("leaderboard", 1000, 2000)

# Get player score (O(1))
score = r.zscore("leaderboard", "player_12345")

All operations logarithmic or better with 1M players

Internal structure: Skip list

Specialized data structure for ranking operations

Key-Value Limitations

Cannot query on value contents:

All sessions for user:

# Need to track session keys separately
redis.sadd(f"user_sessions:{user_id}", session_id)

# Or scan all keys (expensive)
cursor = 0
sessions = []
while True:
    cursor, keys = redis.scan(cursor, match="session:*")
    for key in keys:
        data = json.loads(redis.get(key))
        if data["user_id"] == target_user:
            sessions.append(data)
    if cursor == 0:
        break
# Scans entire keyspace, O(N) operation

Products under $20:

# Impossible without secondary index
# Must maintain separate sorted set
redis.zadd("products_by_price", {product_id: price})
cheap = redis.zrangebyscore("products_by_price", 0, 20)

Application must manually maintain indexes

No relationships:

# Get user and their orders - two queries
user = json.loads(redis.get(f"user:{user_id}"))
order_ids = json.loads(redis.get(f"user_orders:{user_id}"))
orders = [json.loads(redis.get(f"order:{oid}")) for oid in order_ids]

No JOIN operation

No multi-key transactions (usually):

Redis transactions limited:

# WATCH/MULTI/EXEC for optimistic locking
redis.watch("account:A", "account:B")
balance_a = redis.get("account:A")
balance_b = redis.get("account:B")

pipe = redis.pipeline()
pipe.set("account:A", balance_a - 100)
pipe.set("account:B", balance_b + 100)
result = pipe.execute()  # Fails if watched keys modified

DynamoDB: 25 item transaction limit

When key-value is wrong:

  • Complex queries on attributes
  • Relationships between entities
  • Consistency across multiple items
  • Need secondary indexes on value fields

When key-value is right:

  • Known exact key for access
  • Simple GET/PUT operations dominate
  • High throughput, low latency required
  • Data can be denormalized per key
  • Temporary or cached data

Key-value complements relational, does not replace

Document Databases

Document Model - Structured Objects as Storage Unit

Document database stores structured objects as atomic units

Recall embed vs reference patterns from earlier. Document databases implement these patterns while understanding document structure.

Key-value limitation:

# Key-value: Opaque blob
redis.set("booking:123", json.dumps(booking_data))
value = redis.get("booking:123")
booking = json.loads(value)  # Application interprets

# Cannot query: "Find bookings for alice@example.com"
# Must scan all keys and deserialize

Document database capability:

// Database understands structure
db.bookings.find({
  "passenger.email": "alice@example.com"
})
// Direct query on nested field

Database parses JSON structure to support:

  • Queries on nested fields without deserialization
  • Indexes on any field, including nested
  • Aggregations across document contents
  • Structure validation

Primary implementations: MongoDB (BSON), CouchDB (JSON), DocumentDB

Document databases bridge gap between key-value flexibility and relational query power

MongoDB Collections and Documents

Collection: Group of related documents (analogous to relational table)

Document: Single JSON object with _id primary key

No enforced schema: Documents in same collection can have different fields

Booking collection example:

{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "booking_reference": "ABC123",
  "passenger_id": 123,
  "flight_id": 456,
  "seat": "12A",
  "booking_time": ISODate("2025-01-15T10:30:00Z"),
  "price": 450.00
}

MongoDB-specific types:

  • ObjectId: 12-byte unique identifier (timestamp + random)
  • ISODate: Native datetime type (UTC)
  • NumberLong, NumberDecimal: Typed numeric values
  • Embedded documents and arrays

Storage format: BSON (Binary JSON)

  • Binary encoding: More efficient than text JSON
  • Additional types: Date, Binary, Decimal128, ObjectId
  • Preserves field order (unlike JSON spec)
  • ~1.5× space overhead vs raw JSON for typical documents

Collection contains documents with varying fields

Querying Nested Fields - Dot Notation

Relational limitation: Accessing nested data required JOINs

Document structure with nesting:

{
  "booking_id": 12345,
  "passenger": {
    "name": "Alice Chen",
    "email": "alice@example.com",
    "frequent_flyer": {
      "number": "FF123456",
      "tier": "Gold",
      "miles": 85000
    }
  },
  "flight": {
    "number": "UA123",
    "departure_airport": "LAX",
    "scheduled_departure": ISODate("2025-02-15T14:30:00Z")
  }
}

Dot notation queries:

// Find bookings for specific email
db.bookings.find({
  "passenger.email": "alice@example.com"
})

// Query nested 3 levels deep
db.bookings.find({
  "passenger.frequent_flyer.tier": "Gold"
})

// Combine nested and top-level filters
db.bookings.find({
  "flight.departure_airport": "LAX",
  "flight.scheduled_departure": {
    $gte: ISODate("2025-02-01"),
    $lt: ISODate("2025-03-01")
  }
})

Index on nested field:

db.bookings.createIndex({"passenger.email": 1})

Query execution with index:

  • Nested field indexed like top-level field
  • B-tree lookup: O(log n)
  • 1M documents: ~2ms query time

Comparison to relational approach:

-- Requires 2 JOINs for same data
SELECT b.*
FROM bookings b
  JOIN passengers p ON b.passenger_id = p.passenger_id
  JOIN frequent_flyer ff ON p.passenger_id = ff.passenger_id
WHERE p.email = 'alice@example.com'
  AND ff.tier = 'Gold';

Performance measurement (1M bookings):

  • Document nested query: 2ms
  • Relational 2-JOIN query: 15ms
  • Difference: JOIN operation cost (hash join on 1M rows)

Why faster:

  • Document: Single index lookup + fetch
  • Relational: Index lookup + JOIN build + JOIN probe

Nested data stored contiguously eliminates JOIN cost

Querying Arrays - Flight Stops Example

Scenario: Multi-stop flights (LAX → DEN → JFK)

{
  "flight_id": 789,
  "flight_number": "UA456",
  "stops": [
    {
      "airport": "LAX",
      "arrival": null,
      "departure": ISODate("2025-02-15T08:00:00Z"),
      "gate": "B12"
    },
    {
      "airport": "DEN",
      "arrival": ISODate("2025-02-15T10:30:00Z"),
      "departure": ISODate("2025-02-15T11:00:00Z"),
      "gate": "A5"
    },
    {
      "airport": "JFK",
      "arrival": ISODate("2025-02-15T17:00:00Z"),
      "departure": null,
      "gate": "C8"
    }
  ]
}

Query any array element:

// Find all flights stopping at DEN
db.flights.find({
  "stops.airport": "DEN"
})
// Matches if ANY element has airport: "DEN"

Query specific array position:

// First stop must be LAX
db.flights.find({
  "stops.0.airport": "LAX"
})

Match array element with multiple conditions:

// Find flights with DEN stop, departure after 11:00
db.flights.find({
  "stops": {
    $elemMatch: {
      "airport": "DEN",
      "departure": {$gte: ISODate("2025-02-15T11:00:00Z")}
    }
  }
})

Array operators:

  • $elemMatch: Element meets multiple conditions
  • $all: Array contains all specified elements
  • $size: Array has specific length
  • $: Positional operator for updates

Array queries operate across all elements

Indexes on Nested Fields and Arrays

Creating indexes on document structure:

// Index on nested field (2 levels deep)
db.bookings.createIndex({"passenger.email": 1})

// Compound index across nesting levels
db.bookings.createIndex({
  "flight.departure_airport": 1,
  "flight.scheduled_departure": 1
})

Multikey indexes on arrays:

// Index on array field
db.flights.createIndex({"stops.airport": 1})

// Flight with 3 stops creates 3 index entries:
// "LAX" → flight_id: 789
// "DEN" → flight_id: 789
// "JFK" → flight_id: 789

Query execution:

// With index (fast)
db.bookings.find({"passenger.email": "alice@example.com"})
// Index seek: O(log n), ~2ms for 1M documents

// Without index (slow)
db.bookings.find({"passenger.phone": "+1-555-0123"})
// Collection scan: O(n), ~500ms for 1M documents

Index size impact:

  • Nested field indexes: Same cost as top-level
  • Multikey (array) indexes: Size = sum of array lengths across collection
  • 10K flights × 3 stops average = 30K index entries

MongoDB limit: 64 indexes per collection

Indexes work on any document field

Performance (1M documents): Indexed query: 2ms, Collection scan: 500ms

Aggregation Pipeline - Multi-Stage Processing

Query: “Average booking price by departure airport for last 30 days”

Relational approach requires JOIN:

SELECT f.departure_airport, AVG(b.price)
FROM bookings b
JOIN flights f ON b.flight_id = f.flight_id
WHERE b.booking_time >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY f.departure_airport
ORDER BY AVG(b.price) DESC
LIMIT 10;

MongoDB aggregation pipeline:

db.bookings.aggregate([
  // Stage 1: Filter to last 30 days
  {
    $match: {
      booking_time: {
        $gte: ISODate("2025-01-15"),
        $lt: ISODate("2025-02-15")
      }
    }
  },

  // Stage 2: Group by departure airport
  {
    $group: {
      _id: "$flight.departure_airport",
      avg_price: {$avg: "$price"},
      total_bookings: {$sum: 1},
      min_price: {$min: "$price"},
      max_price: {$max: "$price"}
    }
  },

  // Stage 3: Sort by average price descending
  {
    $sort: {avg_price: -1}
  },

  // Stage 4: Limit to top 10
  {
    $limit: 10
  }
])

Result:

[
  {
    "_id": "LAX",
    "avg_price": 487.50,
    "total_bookings": 2500,
    "min_price": 250.00,
    "max_price": 950.00
  },
  {
    "_id": "JFK",
    "avg_price": 465.25,
    "total_bookings": 2100,
    "min_price": 275.00,
    "max_price": 875.00
  }
]

Pipeline processes documents through stages on database server

Aggregation - Unwind for Array Processing

Problem: Analyze which airports have most stops (connection hubs)

Document with stops array:

{
  "flight_number": "UA456",
  "departure": "LAX",
  "arrival": "JFK",
  "stops": ["LAX", "DEN", "JFK"]
}

Pipeline with $unwind:

db.flights.aggregate([
  // Stage 1: Expand stops array
  {
    $unwind: "$stops"
  },
  // After $unwind, creates 3 documents:
  // {flight_number: "UA456", stops: "LAX"}
  // {flight_number: "UA456", stops: "DEN"}
  // {flight_number: "UA456", stops: "JFK"}

  // Stage 2: Group by airport
  {
    $group: {
      _id: "$stops",
      flight_count: {$sum: 1}
    }
  },

  // Stage 3: Sort by frequency
  {
    $sort: {flight_count: -1}
  },

  // Stage 4: Top 10 airports
  {
    $limit: 10
  }
])

Result:

[
  {"_id": "DEN", "flight_count": 2500},
  {"_id": "ORD", "flight_count": 2100},
  {"_id": "ATL", "flight_count": 1900},
  {"_id": "DFW", "flight_count": 1750}
]

$unwind document explosion:

Array expansion multiplies document count:

  • Input: 10K flight documents
  • Average 3 stops per flight
  • After $unwind: 30K documents in pipeline
  • Memory: 30K × 1KB = 30MB buffered

Optimization: filter before unwind

db.flights.aggregate([
  {$match: {departure_date: {$gte: ISODate("2025-02-01")}}},  // Filter first
  {$unwind: "$stops"},
  {$group: {_id: "$stops", flight_count: {$sum: 1}}}
])

Pipeline size:

  • Filter reduces 10K → 2K flights
  • Unwind creates 2K × 3 = 6K documents
  • 5× memory reduction vs unwind-first

Measured performance:

  • Unwind-first: 850ms, 30MB peak memory
  • Filter-first: 180ms, 6MB peak memory

$unwind before $match creates unnecessary work

Schema Validation - Optional Enforcement

Problem: No schema enforcement - errors undetected

  • Typo: emial instead of email
  • Wrong type: "123" instead of 123
  • Missing required field

MongoDB JSON Schema validation:

db.createCollection("bookings", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["booking_reference", "passenger_id",
                 "flight_id", "price"],
      properties: {
        booking_reference: {
          bsonType: "string",
          pattern: "^[A-Z]{3}[0-9]{3}$",
          description: "Format: ABC123"
        },
        passenger_id: {
          bsonType: "int",
          minimum: 1
        },
        price: {
          bsonType: "decimal",
          minimum: 0,
          maximum: 10000
        },
        passenger: {
          bsonType: "object",
          properties: {
            email: {
              bsonType: "string",
              pattern: "^.+@.+\\..+$"
            }
          }
        }
      }
    }
  },
  validationAction: "error"  // or "warn"
})

Validation behavior:

Valid insert succeeds:

db.bookings.insertOne({
  booking_reference: "ABC123",
  passenger_id: 123,
  flight_id: 456,
  price: NumberDecimal("450.00")
})
// Success

Invalid insert rejected:

db.bookings.insertOne({
  booking_reference: "INVALID",  // Wrong pattern
  passenger_id: "123",           // Wrong type
  price: -50.00                  // Below minimum
})
// Error: Document failed validation

Validation options:

  • validationAction: "error" - Reject invalid documents
  • validationAction: "warn" - Log warning, allow insert
  • validationLevel: "moderate" - Validate only new/updated documents

Trade-off:

  • Validation = Type safety, data quality
  • No validation = Maximum flexibility, faster writes
  • Can enable per-collection based on requirements

Validation provides safety without rigid schema

Update Operations - Document Granularity

Updating nested fields:

// Update single nested field
db.bookings.updateOne(
  {booking_id: 12345},
  {$set: {"passenger.email": "alice.new@example.com"}}
)
// Only passenger.email changes, rest untouched

Updating array elements:

// Update element by position
db.flights.updateOne(
  {flight_id: 789},
  {$set: {"stops.1.departure": ISODate("2025-02-15T11:30:00Z")}}
)
// stops[1].departure updated

// Update element by condition ($ positional operator)
db.flights.updateOne(
  {flight_id: 789, "stops.airport": "DEN"},
  {$set: {"stops.$.departure": ISODate("2025-02-15T11:30:00Z")}}
)
// Updates first matching element

Update operators:

// Set field value
{$set: {field: value}}

// Remove field
{$unset: {field: ""}}

// Increment number
{$inc: {available_seats: -1}}

// Add to array
{$push: {tags: "premium"}}

// Remove from array
{$pull: {tags: "economy"}}

Atomicity guarantees:

Single document updates are atomic:

// Atomic: Decrement seats and add booking
db.flights.updateOne(
  {flight_id: 456, available_seats: {$gt: 0}},
  {
    $inc: {available_seats: -1},
    $push: {recent_bookings: booking_id}
  }
)
// Both operations succeed or both fail

Cross-document operations not atomic (without transactions):

// NOT atomic without transaction
db.flights.updateOne({flight_id: 456}, {$inc: {available_seats: -1}})
db.bookings.insertOne({flight_id: 456, ...})
// Second operation can fail, leaving inconsistent state

MongoDB 4.0+: Multi-document transactions

session = db.getMongo().startSession()
session.startTransaction()
try {
  session.getDatabase("airline").flights.updateOne(...)
  session.getDatabase("airline").bookings.insertOne(...)
  session.commitTransaction()
} catch (error) {
  session.abortTransaction()
}

Performance: Transactions add overhead (~2-3× slower)

Single-document atomicity sufficient for most embedded designs

Document Size Limits and Denormalization Trade-offs

MongoDB document constraints:

  • Maximum document size: 16MB
  • Maximum nesting depth: 100 levels
  • No hard limit on array length (constrained by 16MB total)

When embedding breaks:

Flight with all bookings embedded:

{
  "flight_id": 456,
  "flight_number": "UA123",
  "departure_airport": "LAX",
  "bookings": [
    {...},  // 1KB per booking
    {...},  // 200 bookings = 200KB
    {...}   // Still under 16MB
  ]
}

Problems at scale:

  • Works for 200 bookings (200KB < 16MB)
  • Each booking addition rewrites entire 200KB document
  • Concurrent booking updates conflict (document-level locking)
  • 180-seat plane sold out: 180 document rewrites
  • Reading flight details loads unnecessary 200KB

When to reference instead of embed:

  • Unbounded growth (user’s order history over years)
  • Independent access patterns (query bookings without flight)
  • High update frequency (seat assignments change constantly)
  • Size exceeds reasonable threshold (>100KB per document)

Reference pattern (normalized approach):

// Flight document (small, stable)
{
  "flight_id": 456,
  "flight_number": "UA123",
  "departure_airport": "LAX",
  "arrival_airport": "JFK",
  "scheduled_departure": ISODate("2025-02-15T14:30:00Z"),
  "seats_available": 12,
  "total_seats": 180
}
// Size: 1KB

// Booking documents (separate collection)
{
  "booking_id": 789,
  "flight_id": 456,  // Reference
  "passenger_id": 123,
  "seat": "12A"
}
// 180 separate documents

Query patterns:

// Get flight: 1KB fetch
db.flights.findOne({flight_id: 456})

// Get bookings: Separate query
db.bookings.find({flight_id: 456})

// Add booking: Small update
db.bookings.insertOne({...})
db.flights.updateOne(
  {flight_id: 456},
  {$inc: {seats_available: -1}}
)

Reference when size or update patterns break embedding

MongoDB in Production - Replica Sets

Single server limitations:

  • Hardware failure → data loss
  • Maintenance → downtime
  • No geographic distribution
  • Read throughput limited to single machine

Replica set architecture:

  • Primary node: Handles all writes
  • Secondary nodes (2+): Replicate from primary
  • Automatic failover: Secondary elected if primary fails
  • Production minimum: 3 nodes (1 primary + 2 secondaries)

Write path:

// Application writes to primary
db.bookings.insertOne({...})

// Primary writes to oplog (operation log)
// Secondaries tail oplog and apply operations
// Replication asynchronous by default

Write concern controls durability:

// Wait for majority acknowledgment (2 of 3 nodes)
db.bookings.insertOne(
  {...},
  {writeConcern: {w: "majority", wtimeout: 5000}}
)
// Returns when 2 nodes confirm write
// If timeout: write may succeed but client gets error

Replication lag:

  • Typical: < 100ms (local network)
  • Under heavy write load: Can reach seconds
  • Network partition: Unbounded until resolved

Document Database Limitations - Cross-Collection Queries

Problem: Analytics across referenced data

Using normalized reference pattern:

// Flights collection
{flight_id: 456, departure_airport: "LAX", ...}

// Bookings collection (separate)
{booking_id: 789, flight_id: 456, passenger_id: 123, price: 450}

// Passengers collection (separate)
{passenger_id: 123, email: "alice@example.com", ...}

Query: “Total revenue by departure airport”

Relational approach:

SELECT f.departure_airport, SUM(b.price)
FROM flights f
JOIN bookings b ON f.flight_id = b.flight_id
GROUP BY f.departure_airport;
-- Single query, executes in ~50ms

**MongoDB \(lookup (LEFT JOIN):** ```javascript db.flights.aggregate([ {\)lookup: { from: “bookings”, localField: “flight_id”, foreignField: “flight_id”, as: “bookings” }}, {\(unwind: "\)bookings”}, {\(group: { _id: "\)departure_airport”, total_revenue: {\(sum: "\)bookings.price”} }} ]) // Executes in ~450ms vs ~50ms embedded

:::

::: {.column width="45%"}
**Performance difference:**

**$lookup implementation:**

- Nested loop join (no hash join optimization)
- For each flight, scan bookings collection
- 10K flights × 100K bookings = 1B comparisons
- No query optimizer reordering

**Workaround: Denormalize**
```javascript
// Store departure_airport in booking
{
  booking_id: 789,
  flight_id: 456,
  departure_airport: "LAX",  // Duplicated
  price: 450
}

// Now simple aggregation
db.bookings.aggregate([
  {$group: {
    _id: "$departure_airport",
    total_revenue: {$sum: "$price"}
  }}
])
// Executes in ~25ms

Cost of denormalization:

  • Airport code duplicated in 500K bookings
  • Flight route change: Update 500K documents
  • Acceptable if routes rarely change

Reference patterns sacrifice cross-collection query performance

Document Database Limitations - Consistency Guarantees

Problem: Seat inventory management

Relational ACID transaction:

BEGIN TRANSACTION;
  -- Check seat available
  SELECT available_seats FROM flights
  WHERE flight_id = 456 FOR UPDATE;
  -- available_seats = 1

  -- Book seat
  INSERT INTO bookings (flight_id, passenger_id, seat)
  VALUES (456, 123, '12A');

  -- Decrement inventory
  UPDATE flights SET available_seats = available_seats - 1
  WHERE flight_id = 456;
COMMIT;

Two concurrent transactions:

  • Transaction A checks: 1 seat available
  • Transaction B checks: 1 seat available
  • Both proceed? No - FOR UPDATE lock prevents race

MongoDB without transactions:

// Check availability (race condition)
flight = db.flights.findOne({flight_id: 456})
if (flight.available_seats > 0) {
  // Another booking can execute here
  db.bookings.insertOne({flight_id: 456, ...})
  db.flights.updateOne(
    {flight_id: 456},
    {$inc: {available_seats: -1}}
  )
}

Race condition: Both bookings succeed, seats go negative

Solution: Atomic update with condition

result = db.flights.updateOne(
  {flight_id: 456, available_seats: {$gt: 0}},
  {
    $inc: {available_seats: -1},
    $push: {recent_bookings: booking_id}
  }
)

if (result.modifiedCount === 1) {
  // Success - seat was available and decremented
  db.bookings.insertOne({...})
} else {
  // Failed - no seats available
}

Works for single-document atomicity, but:

  • Booking still inserted after seat check
  • If second insert fails, seat already decremented
  • Inconsistent state possible

Multi-document transactions (MongoDB 4.0+):

session.startTransaction()
db.flights.updateOne({...}, {$inc: ...}, {session})
db.bookings.insertOne({...}, {session})
session.commitTransaction()

Performance cost:

  • Transactions 2-3× slower than single operations
  • Requires replica set coordination
  • Contention on hot documents

Document model assumes independent documents - cross-document consistency expensive

Wide-Column Stores

Wide-Column Model - Tables Without Fixed Schema

Sparse data problem:

  • Flight events: delay, diversion, emergency
  • Relational: Fixed columns, mostly NULL
  • 1M flights × 20 event columns = 19M NULL values (95%)

Wide-column approach:

  • Rows can have different columns
  • Store only non-NULL values
  • Columns grouped into column families
  • Physical storage: Column-oriented, not row-oriented

Examples: Apache Cassandra, HBase, Google Bigtable

Flight 12345 (delay event):

row_key: "flight_12345"
  standard: flight_number="UA123", departure="LAX", arrival="JFK"
  events: delay_reason="weather", delay_minutes=45

Flight 12346 (emergency event):

row_key: "flight_12346"
  standard: flight_number="AA456", departure="JFK", arrival="SFO"
  events: emergency_type="medical", diverted_airport="DEN"

Different columns per row, no NULL storage

Storage: 75% reduction by eliminating NULL columns

Data Model - Row Key, Column Families, Columns

Three-level hierarchy:

1. Row key:

  • Unique identifier
  • Determines partition (which nodes store data)
  • Sorted lexicographically
  • Example: sensor:42000:2025-01-15

2. Column family:

  • Logical grouping of related columns
  • Defined at table creation
  • Physical storage unit: columns in family stored together
  • Examples: metadata, readings, events

3. Column:

  • Key-value pair within family
  • Column name is part of the data (not schema)
  • Each row can have different columns
  • Example: readings:temperature=22.5

Physical storage example:

row_key="sensor:42000:2025-01-15T10:00"
  metadata:location = "Building A, Floor 3"
  metadata:sensor_type = "indoor_temp"
  readings:temperature = 22.5
  readings:humidity = 45.0
  readings:timestamp = "2025-01-15T10:00:00Z"

Different sensor might have:

row_key="sensor:42001:2025-01-15T10:00"
  metadata:location = "Building B, Floor 1"
  metadata:sensor_type = "motion"
  readings:x_accel = 0.05
  readings:y_accel = 0.12
  readings:z_accel = 9.81

Column families provide structure, columns provide flexibility

Time-Series Data - Natural Fit for Wide-Column

Scenario: Server monitoring metrics

  • 10,000 servers
  • Metrics every 10 seconds: CPU, memory, disk I/O, network
  • Sparse: I/O metrics only when disk active (~30% of time)
  • Retention: 90 days

Data volume:

  • 10,000 servers × 6 samples/minute = 60,000 writes/minute
  • 86.4M samples/day
  • 90 days = 7.78B samples

Row key design: server_id:timestamp

server_042:2025-01-15T10:30:00
server_042:2025-01-15T10:30:10
server_042:2025-01-15T10:30:20

Lexicographic sort → time-adjacent data stored adjacently

Query: “Server 42, last hour”

Start: server_042:2025-01-15T09:30:00
End:   server_042:2025-01-15T10:30:00

Sequential read of 360 consecutive row keys

Column families:

system: cpu_percent=45.2, memory_mb=8192, uptime_sec=345600
disk:   read_iops=120, write_iops=80, queue_depth=3
        (only when disk active)
network: rx_bytes=1048576, tx_bytes=524288, connections=42

Time-ordered row keys enable efficient range queries

Storage efficiency (1M samples): Relational with NULLs: 1000 MB, Wide-Column sparse: 400 MB (60% reduction)

Cassandra Architecture - Distributed by Design

Peer-to-peer: All nodes equal

  • No primary-replica distinction
  • Any node can handle any request
  • Consistent hashing distributes row keys

Partitioning:

  • Row key hashed to token (0 to 2^63-1)
  • Token ring: Nodes own token ranges
  • Replication factor N=3: Each row stored on 3 nodes

Write path:

  1. Client sends write to any node (coordinator)

  2. Coordinator determines replicas from hash(row_key)

  3. Write sent to N=3 replica nodes

  4. Coordinator waits based on consistency level:

    • ONE: Wait for 1 acknowledgment (5ms)
    • QUORUM: Wait for 2 of 3 (10ms, parallel writes)
    • ALL: Wait for all 3 (30ms)

Write durability on each node:

Write arrives
  ↓
Commit log (sequential append, durable)
  ↓
Memtable (in-memory, fast)
  ↓
[Background: Flush memtable → SSTable on disk]

Commit log ensures no data loss if node crashes

Read path:

  • Query replicas (number based on consistency)
  • If replicas disagree: Return latest (by timestamp)
  • Background read repair reconciles replicas

Hash determines primary node, N consecutive nodes store replicas

Compound Primary Key - Partition and Clustering

Primary key = (partition key, clustering columns)

Partition key: Determines node placement (hash distribution)

Clustering columns: Sort order within partition (range queries)

Table definition:

CREATE TABLE sensor_readings (
    sensor_id INT,
    timestamp TIMESTAMP,
    temperature FLOAT,
    humidity FLOAT,
    PRIMARY KEY (sensor_id, timestamp)
);
  • sensor_id = partition key (which node)
  • timestamp = clustering column (sort order)

Physical layout on Node 3:

Partition: sensor_id=42000
  timestamp=2025-01-15T10:00:00, temp=22.5, humidity=45
  timestamp=2025-01-15T10:01:00, temp=22.6, humidity=46
  timestamp=2025-01-15T10:02:00, temp=22.4, humidity=45
  ... (sorted by timestamp)

Partition: sensor_id=42001
  timestamp=2025-01-15T10:00:00, temp=19.5, humidity=50
  ...

Efficient query (uses partition key + clustering):

SELECT * FROM sensor_readings
WHERE sensor_id = 42000
  AND timestamp >= '2025-01-15T10:00:00'
  AND timestamp < '2025-01-15T11:00:00';

Single node, sequential read of 60 rows

Inefficient query (missing partition key):

SELECT * FROM sensor_readings
WHERE timestamp > '2025-01-15T10:00:00';

Must query ALL nodes, scatter-gather

Query cost: O(1) partition + O(log M + K) vs O(N nodes)

Denormalization Required - No JOINs

Cassandra has no JOIN operation

Data for query must exist in single partition

Relational approach (JOINs):

-- users table
CREATE TABLE users (
  user_id INT PRIMARY KEY,
  name VARCHAR(100),
  email VARCHAR(100)
);

-- activities table
CREATE TABLE activities (
  activity_id INT PRIMARY KEY,
  user_id INT,
  activity_type VARCHAR(50),
  timestamp TIMESTAMP,
  details TEXT
);

-- Query with JOIN
SELECT a.*, u.name
FROM activities a
JOIN users u ON a.user_id = u.user_id
WHERE a.timestamp > NOW() - INTERVAL '24 hours';

Cassandra approach (denormalized):

CREATE TABLE user_activities (
    user_id INT,
    timestamp TIMESTAMP,
    activity_type TEXT,
    user_name TEXT,     -- Denormalized
    user_email TEXT,    -- Denormalized
    details TEXT,
    PRIMARY KEY (user_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

Query:

SELECT * FROM user_activities
WHERE user_id = 12345
  AND timestamp > '2025-01-15T00:00:00';

Single partition, no JOIN

Update cost:

  • User name changes → update all activity records
  • User with 10,000 activities: 10,000 updates
  • Trade-off: Query simplicity for update complexity

Acceptable when reads >> writes (activity log read 1000×, name changes rarely)

Multiple Query Patterns - Multiple Tables

Cannot optimize single table for multiple partition keys

Solution: Duplicate data in tables with different primary keys

Query pattern 1: Posts by user (timeline)

CREATE TABLE posts_by_user (
    user_id INT,
    post_timestamp TIMESTAMP,
    post_id UUID,
    content TEXT,
    likes INT,
    PRIMARY KEY (user_id, post_timestamp)
) WITH CLUSTERING ORDER BY (post_timestamp DESC);

-- Efficient: Single partition
SELECT * FROM posts_by_user
WHERE user_id = 12345
LIMIT 20;

Query pattern 2: Posts by tag (search)

CREATE TABLE posts_by_tag (
    tag TEXT,
    post_timestamp TIMESTAMP,
    post_id UUID,
    user_id INT,      -- Duplicated
    content TEXT,     -- Duplicated
    likes INT,        -- Duplicated
    PRIMARY KEY (tag, post_timestamp)
) WITH CLUSTERING ORDER BY (post_timestamp DESC);

-- Efficient: Single partition
SELECT * FROM posts_by_tag
WHERE tag = 'nosql'
LIMIT 20;

Write path:

  • Application writes to both tables
  • Post with 5 tags: 1 write to posts_by_user + 5 writes to posts_by_tag
  • Total: 6 writes for 1 post

Storage: 6× duplication for 5-tag post

Trade: Storage and write complexity for query performance

TTL - Automatic Data Expiration

Time-series data has natural lifecycle

Old data less valuable, must be removed

Manual deletion: Expensive

DELETE FROM sensor_readings
WHERE timestamp < '2025-01-15' - INTERVAL '90 days';

Scans entire table, expensive on billions of rows

TTL (time to live): Automatic expiration

INSERT INTO sensor_readings
  (sensor_id, timestamp, temperature, humidity)
VALUES (42000, '2025-01-15T10:00:00', 22.5, 45)
USING TTL 7776000;  -- 90 days in seconds

-- After 90 days: Row automatically deleted

Implementation:

  • TTL stored as metadata with each column
  • Compaction removes expired data during background merge
  • Tombstone created at expiration (soft delete)
  • No read-time overhead: Expired data not returned

Use cases:

  • Monitoring data: 90-day retention
  • Session data: 30-minute expiration
  • Cache data: 5-minute expiration
  • Rate limiting counters: Hourly expiration

Storage calculation:

  • 10,000 sensors × 8,640 readings/day = 86.4M rows/day
  • Without TTL: Unbounded growth
  • With 90-day TTL: Steady state at 7.78B rows

Graph Databases

Graph Model - Nodes, Edges, Properties

Social network queries require expensive self-JOINs:

-- Friends-of-friends
SELECT DISTINCT u.name
FROM connections c1
  JOIN connections c2 ON c1.friend_id = c2.user_id
  JOIN users u ON c2.friend_id = u.id
WHERE c1.user_id = 123;

Graph model stores relationships directly:

  • Nodes: Entities with properties
  • Edges: Relationships between nodes, can have properties
  • Query: Traverse edges, no JOIN operations

Example:

  • Node: Person {id: 123, name: “Alice”, city: “Boston”}
  • Edge: (Alice)-[:KNOWS {since: 2020}]->(Bob)
  • Query “friends of friends” = follow 2 edges

Native graph databases store adjacency (pointers between nodes), not foreign keys

Path Traversal - Friends of Friends in Neo4j

Query: Find friends-of-friends who live in Boston

Cypher (Neo4j query language):

MATCH (me:Person {id: 123})-[:KNOWS]->(friend)
      -[:KNOWS]->(fof:Person)
WHERE fof.city = 'Boston'
  AND fof.id <> 123
RETURN fof.name, fof.city

Pattern matching: (node)-[edge]->(node) describes path structure

Relational equivalent:

SELECT DISTINCT u.name, u.city
FROM connections c1
  JOIN connections c2 ON c1.friend_id = c2.user_id
  JOIN users u ON c2.friend_id = u.id
WHERE c1.user_id = 123
  AND u.city = 'Boston'
  AND c2.friend_id <> 123;

Performance difference:

  • Relational: O(n²) - must JOIN entire connections table twice
  • Graph: O(degree²) - follows edges from specific node

For node with 100 friends, each friend has 100 friends:

  • Graph: Explores ~10,000 nodes
  • SQL: Must process millions of rows (even with indexes)

Graph complexity depends on local structure (node degree), not global table size

Variable-Length Paths - Connection Distance

Query: How many hops separate two people? Find shortest path.

Cypher:

MATCH path = shortestPath(
  (a:Person {id: 123})-[:KNOWS*]-(b:Person {id: 789})
)
RETURN length(path)
  • [:KNOWS*] matches 1 or more KNOWS edges
  • Finds shortest connection: 2 hops, 3 hops, etc.
  • Bidirectional breadth-first search from both ends

Return path details:

RETURN [node IN nodes(path) | node.name] AS connection_path
-- Result: ["Alice", "Bob", "Carol", "Dave"]

SQL equivalent:

  • Requires recursive CTE (PostgreSQL, Oracle)
  • Limited depth, complex syntax
  • Poor performance on large graphs
WITH RECURSIVE paths AS (
  SELECT user_id, friend_id, 1 AS depth,
         ARRAY[user_id, friend_id] AS path
  FROM connections
  WHERE user_id = 123
  UNION ALL
  SELECT p.user_id, c.friend_id, p.depth + 1,
         p.path || c.friend_id
  FROM paths p
  JOIN connections c ON p.friend_id = c.user_id
  WHERE p.depth < 6
    AND NOT c.friend_id = ANY(p.path)
)
SELECT MIN(depth) FROM paths WHERE friend_id = 789;

Use case: LinkedIn shows connection path and degree of separation

Recommendation - Collaborative Filtering

Problem: Recommend products based on friend purchases

Query: Products bought by friends-of-friends but not by me or my direct friends

MATCH (me:Person {id: 123})-[:KNOWS*1..2]->(friend)
MATCH (friend)-[:PURCHASED]->(product:Product)
WHERE NOT (me)-[:PURCHASED]->(product)
  AND NOT (me)-[:KNOWS]->()-[:PURCHASED]->(product)
RETURN product.name, COUNT(friend) as recommendation_score
ORDER BY recommendation_score DESC
LIMIT 10

Query breakdown:

  • [:KNOWS*1..2] = 1 or 2 hops (friends + friends-of-friends)
  • Find products purchased by anyone in that set
  • Filter out: products I’ve bought
  • Filter out: products my direct friends bought
  • Count recommendations from multiple sources

Performance:

  • Traverses ~10,000 relationships (100 friends × 100 fof)
  • Filters and aggregates during traversal
  • Returns top 10 products
  • Completes in ~50ms

Relational equivalent:

  • 3 self-JOINs on connections table
  • JOIN with purchases table
  • Multiple subqueries for filtering
  • Scans millions of rows
  • Completes in ~5 seconds

Results:

  • Product A: 3 friends-of-friends purchased → score 3
  • Product B: 2 friends-of-friends purchased → score 2

Relational requires 3 self-JOINs + product JOIN + filtering subqueries

Knowledge Graphs - Multi-Type Relationships

Movie database with heterogeneous relationships:

Schema (property graph model):

Nodes:
  Person (name, birth_year)
  Movie (title, year, rating)
  Genre (name)

Edges:
  (Person)-[:ACTED_IN {role: "Batman"}]->(Movie)
  (Person)-[:DIRECTED]->(Movie)
  (Movie)-[:BELONGS_TO_GENRE]->(Genre)

Query: Actors who worked with directors who directed Sci-Fi movies

MATCH (actor:Person)-[:ACTED_IN]->(m1:Movie)
  <-[:DIRECTED]-(director:Person)
MATCH (director)-[:DIRECTED]->(m2:Movie)
  -[:BELONGS_TO_GENRE]->(g:Genre {name: 'Sci-Fi'})
WHERE m1 <> m2
RETURN DISTINCT actor.name, director.name, m2.title

Query breakdown:

  • Find actor-director collaborations
  • Find other movies directed by same director
  • Filter for Sci-Fi genre
  • Different relationship types (ACTED_IN, DIRECTED, BELONGS_TO_GENRE)

Relational equivalent:

  • 5 tables: people, movies, genres, movie_actors, movie_directors, movie_genres
  • 4 JOINs with filtering on relationship types
  • Complex foreign key navigation

Heterogeneous graphs: Multiple node types, multiple edge types with different semantics

Property Graph Storage - Index-Free Adjacency

Native graph databases store direct pointers between nodes:

Conceptual storage structure:

Node 123 (Alice):
  properties: {name: "Alice", city: "Boston"}
  outgoing_edges: [ptr→456, ptr→789]
  incoming_edges: [ptr→234]

Node 456 (Bob):
  properties: {name: "Bob", city: "NYC"}
  outgoing_edges: [ptr→123, ptr→999]
  incoming_edges: [ptr→123, ptr→555]

Traversal cost:

  • Follow edge: O(1) pointer dereference
  • No index lookup required
  • No JOIN operation
  • Cache-efficient: Related nodes stored nearby on disk

Relational comparison:

  • JOIN requires: Index lookup O(log n) or table scan O(n)
  • Even with B-tree indexes: Multiple random disk seeks
  • Each hop: Seek to index, seek to data
  • Graph: Sequential read following pointers

Physical layout:

  • Nodes clustered by locality
  • Edges stored with source node
  • Adjacency list per node
  • Disk reads proportional to result size, not table size

Graph databases optimize for traversal, relational optimizes for joins

Graph Query Complexity - Local vs Global

Fast queries (local traversal):

Performance independent of graph size, depends on local structure:

  • “Friends of person X”: O(degree)

    • Node with 100 friends: examine 100 nodes
    • Graph size irrelevant
  • “Friends-of-friends of X”: O(degree²)

    • 100 friends, each with 100 friends: examine 10,000 nodes
    • Still independent of total graph size
  • “Shortest path between A and B”: O(degree × hops)

    • 6 degrees of separation: explore ~1,000,000 nodes
    • But only if path exists

Slow queries (global aggregation):

Must scan entire graph:

  • “Count all nodes”: O(n)
  • “Find all nodes with city = ‘Boston’”: O(n) without index
  • “Average node degree”: O(n + m) where m = edges

Performance example (1M node graph):

  • “Friends of Alice”: Examines 100 nodes (0.01% of graph)
  • “All people named Alice”: Examines 1M nodes (100% of graph)

When to use graph databases:

  • Relationship queries dominate workload
  • Traversal depth varies (1-6 hops)
  • Local neighborhood matters more than global statistics

Graph databases excel at local traversal, not global aggregation

Fraud Detection - Connected Account Rings

Scenario: Credit card fraud detection - identify accounts sharing resources

Data model:

Nodes:
  Account (account_id, status)
  Device (device_fingerprint)
  IPAddress (ip)
  PhysicalAddress (street, city)

Edges:
  (Account)-[:USED_DEVICE]->(Device)
  (Account)-[:ORIGINATED_FROM]->(IPAddress)
  (Account)-[:REGISTERED_AT]->(PhysicalAddress)

Fraud detection query:

Find accounts connected to flagged account through shared resources (2 hops):

MATCH (suspicious:Account {flagged: true})
MATCH (suspicious)-[:USED_DEVICE|ORIGINATED_FROM
                    |REGISTERED_AT*1..2]-(related:Account)
WHERE related.account_id <> suspicious.account_id
RETURN related.account_id,
       COUNT(*) as connection_strength
ORDER BY connection_strength DESC

Pattern detected:

  • Suspicious account shares device with 3 accounts
  • Those 3 accounts all share same IP address
  • Flag entire ring for review

Timing:

  • Graph: 50ms to explore 2-hop neighborhood
  • Relational: 5s for equivalent multi-JOIN query

Relational approach:

  • 7 tables: accounts, devices, account_devices, ips, account_ips, addresses, account_addresses
  • Complex multi-JOIN to find shared resources
  • Adding new connection type requires schema change

Graph Database Limitations

Inefficient without indexes:

Global aggregations require full graph scan:

// Count all users - O(n)
MATCH (u:User) RETURN COUNT(u)

// Average friend count - O(n + m)
MATCH (u:User)-[:KNOWS]->(f)
RETURN AVG(COUNT(f))

// Find by property without index - O(n)
MATCH (u:User {email: 'alice@example.com'})
RETURN u

Without property index: scans all User nodes

Large traversal results:

Small world problem - 6 degrees reaches most of graph:

// All people within 6 hops - returns millions
MATCH (me:User {id: 123})-[:KNOWS*1..6]-(connected)
RETURN COUNT(DISTINCT connected)

Kevin Bacon number: 6 hops connects to entire Hollywood

Limited transactions:

  • Neo4j: Full ACID transactions supported
  • Distributed graphs (JanusGraph, Neptune): Eventually consistent
  • Cross-partition transactions limited

Write hotspots:

Celebrity node with 10M followers:

  • Every follower edge stored on celebrity node
  • Hot spot for writes and reads
  • Sharding difficult - high-degree nodes don’t partition well

Graph databases optimize for relationship traversal, not aggregation

Polyglot Persistence

Single Database Assumption Breaks Under Heterogeneous Workload

Traditional architecture: Single PostgreSQL database

Handles everything:

  • User accounts, transactions, sessions
  • Social connections, recommendations
  • Logs, analytics queries

Works at small scale (thousands of users)

Observed failure at 100K users:

  • Session lookups (10K req/sec) compete with order writes (100 req/sec)
  • Friends-of-friends graph queries lock tables during analytics scans
  • Log inserts (1M/day) bloat primary database
  • Index maintenance slows all operations

Not a scale problem - a workload mismatch problem:

  • Shopping cart read: Needs <5ms latency, key-value lookup
  • Order write: Needs ACID transaction, relational integrity
  • Product recommendation: Needs graph traversal, relationship queries
  • Analytics query: Scans entire order history, full table aggregation

PostgreSQL optimized for transactions, not for all four patterns simultaneously

Query patterns interfere when forced through single storage model

E-commerce Platform - Data Flow Across Stores

Production system: 500K users, 50K orders/day

User completes checkout:

PostgreSQL (ACID transaction):

BEGIN;
  INSERT INTO orders (order_id, user_id, total, timestamp)
  VALUES (12345, 789, 99.99, NOW());

  INSERT INTO order_items (order_id, product_id, quantity, price)
  VALUES (12345, 101, 2, 49.99);

  UPDATE products SET inventory = inventory - 2
  WHERE product_id = 101 AND inventory >= 2;

  -- If inventory check fails, entire transaction rolls back
COMMIT;

Latency: 50ms | Consistency: Immediate (ACID)

Redis (ephemeral data):

# Clear shopping cart (no longer needed)
redis.delete(f"cart:{user_id}")

Latency: 1ms | Consistency: Immediate (single key)

Neo4j (graph relationship - async):

// Seconds later, from background worker
MATCH (u:User {id: 789}), (p:Product {id: 101})
CREATE (u)-[:PURCHASED {timestamp: 1234567890}]->(p)

Latency: 20ms | Consistency: Eventual (5 second delay acceptable)

Application routes operations to appropriate database

Caching Layer Reduces Primary Database Load

Problem: Product pages receive 5K views/minute

Each view queries PostgreSQL:

SELECT p.name, p.price, p.inventory,
       c.name as category,
       array_agg(i.url) as images
FROM products p
  JOIN categories c ON p.category_id = c.id
  JOIN product_images i ON p.id = i.product_id
WHERE p.id = 12345
GROUP BY p.id, c.name;

Load: 5K queries/minute = 83 queries/second to PostgreSQL

Product data changes rarely (once per day)

Read-through cache pattern:

def get_product_details(product_id):
    cache_key = f"product:{product_id}"

    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)  # 1ms cache hit

    # Cache miss - query database
    product = postgres.query(query_string, product_id)  # 50ms

    # Cache for 1 hour
    redis.setex(cache_key, 3600, json.dumps(product))
    return product

Measured performance:

  • Cache hit rate: 95%
  • Average latency: 0.95 × 1ms + 0.05 × 50ms = 3.45ms
  • PostgreSQL load: 5K × 0.05 = 250 queries/minute
  • Database load reduced 20×

Analytics queries run on read replica:

  • Revenue reports scan millions of rows
  • Run on replica: Primary unaffected
  • Replica lags 5-10 seconds (acceptable for reports)

Data Synchronization Without Distributed Transactions

Problem: User changes email address

Must update:

  • PostgreSQL users table (primary copy)
  • Neo4j User node (has email property for queries)

No distributed transaction across databases

Approach 1: Dual write

def update_email(user_id, new_email):
    # PostgreSQL first (source of truth)
    postgres.execute(
        "UPDATE users SET email=? WHERE id=?",
        new_email, user_id
    )

    # Neo4j second
    try:
        neo4j.run(
            "MATCH (u:User {id: $uid}) SET u.email = $email",
            uid=user_id, email=new_email
        )
    except Exception as e:
        log.error(f"Neo4j update failed: {e}")
        retry_queue.push({
            "user_id": user_id,
            "email": new_email
        })

Failure scenario:

  • PostgreSQL succeeds at t=0
  • Neo4j fails (network partition)
  • Retry succeeds at t=5
  • Neo4j shows stale email for 5 seconds
  • Acceptable for social graph queries

Approach 2: Event stream

# Application writes to PostgreSQL only
def update_email(user_id, new_email):
    postgres.execute("UPDATE users SET email=? WHERE id=?")
    kafka.publish("user.updated", {
        "user_id": user_id, "email": new_email
    })

# Separate consumer updates Neo4j
for event in kafka.consume("user.updated"):
    neo4j.run("MATCH (u:User {id: $uid}) SET u.email = $email")

Operational complexity:

  • Monitor: PostgreSQL metrics ≠ Redis metrics ≠ Neo4j metrics
  • Backups: Each database independent
  • Upgrades: PostgreSQL 14→15, Redis 6→7 separately

Polyglot Persistence in Production Architecture

Social e-commerce platform: 500K users, 50K orders/day

PostgreSQL (ACID transactions):

  • Tables: users, orders, order_items, products, inventory
  • Write: INSERT orders, UPDATE inventory atomically
  • Read: Point lookups by ID (indexed)
  • Query latency: 50ms (JOIN queries)
  • Load: 500 transactions/minute
  • Consistency: Immediate (ACID guarantees)

Redis (ephemeral cache):

  • Data: sessions, shopping carts, product cache
  • Write: SET with TTL, DELETE on invalidation
  • Read: GET by exact key
  • Query latency: 1ms
  • Load: 10K requests/second
  • Consistency: Single-key immediate

Neo4j (graph traversal):

  • Data: (User)-[:KNOWS]->(User), (User)-[:PURCHASED]->(Product)
  • Write: CREATE edge (async from PostgreSQL)
  • Read: MATCH paths, 2-3 hop traversal
  • Query latency: 20ms (friends-of-friends)
  • Load: 100 traversals/minute
  • Consistency: Eventual (5 second lag acceptable)

PostgreSQL read replica (analytics):

  • Same data as primary
  • Replication lag: 5-10 seconds
  • Query: Full table scans, aggregations
  • Query latency: 30 seconds (acceptable for reports)
  • Load: 10 reports/hour

Trade-off measured:

  • Performance: 3.45ms average page load
  • vs 200ms with single PostgreSQL
  • Complexity: 4 systems to operate
  • Team: Managed services or in-house expertise

Migration pattern observed:

  • Start: Single PostgreSQL
  • Pain: Session queries block transactions
  • Add: Redis for sessions (quick win)
  • Later: Neo4j for recommendations (higher value justifies complexity)

Not about database choice - about workload isolation

DynamoDB - Managed Wide-Column at Scale

DynamoDB - AWS Implementation of Wide-Column Model

DynamoDB: AWS managed database service that follows wide-column data model

Amazon’s fully managed database service:

  • No cluster management (contrast with Cassandra 5-node setup)
  • Automatic scaling based on traffic
  • Built-in multi-region replication
  • Pay-per-request pricing: $1.25 per million reads, $6.25 per million writes

Production usage:

  • Amazon.com: Shopping cart (50M requests/sec peak)
  • Lyft: Trip and location data
  • Snapchat: User stories storage

Trade-off vs Cassandra:

  • Managed service abstracts distribution complexity
  • Limited query flexibility (no custom partition strategies)
  • Operational simplicity for query constraints

Cassandra: Operator manages cluster. DynamoDB: AWS manages distribution.

Table Structure - Partition Key and Sort Key

Primary key structure determines data distribution and query patterns

Partition key (required):

  • Hash determines which partition stores item
  • Same concept as Cassandra partition key
  • Distribution: hash(partition_key) mod num_partitions

Sort key (optional):

  • Orders items within partition
  • Supports range queries on sort key
  • Physical storage: Items sorted by sort key on disk

Composite primary key = (partition key, sort key)

Sensor monitoring table:

{
    "TableName": "sensor_readings",
    "KeySchema": [
        {"AttributeName": "sensor_id", "KeyType": "HASH"},
        {"AttributeName": "timestamp", "KeyType": "RANGE"}
    ],
    "AttributeDefinitions": [
        {"AttributeName": "sensor_id", "AttributeType": "N"},
        {"AttributeName": "timestamp", "AttributeType": "N"}
    ]
}

Items for sensor_id=42000:

{"sensor_id": 42000, "timestamp": 1704034800, "temperature": 22.5}
{"sensor_id": 42000, "timestamp": 1704034810, "temperature": 22.6}
{"sensor_id": 42000, "timestamp": 1704034820, "temperature": 22.4}

All stored in same partition, sorted by timestamp

Partition key determines storage location, sort key supports range queries

Item Structure - Sparse Storage

Items in same table can have different attributes

Wide-column stores only non-NULL values. DynamoDB implements this - no wasted storage on missing attributes.

Flight events with varying attributes:

Delay event:

{
    "flight_id": 12345,
    "event_time": 1704034800,
    "event_type": "delay",
    "delay_minutes": 45,
    "delay_reason": "weather"
}

Emergency event:

{
    "flight_id": 12346,
    "event_time": 1704038400,
    "event_type": "emergency",
    "diverted_airport": "DEN",
    "emergency_type": "medical"
}

Normal departure (no special attributes):

{
    "flight_id": 12347,
    "event_time": 1704042000,
    "event_type": "departure"
}

Storage efficiency measurement (1M flight events):

  • Relational approach: 20 event columns × 1M rows = 20M cells

    • 95% events have no special attributes → 19M NULL values
    • Storage: 850 MB (including NULL overhead)
  • DynamoDB: Only stores present attributes

    • 50K delays (5 attributes each) + 5K emergencies (5 attributes each) + 945K normal (3 attributes each)
    • Storage: 280 MB (67% reduction)

Sparse storage eliminates NULL overhead

Query Patterns - Partition Key Requirement

Query performance depends on partition key specification

Like Cassandra, DynamoDB requires partition key for efficient queries. Without partition key, expensive full-table scan required.

Efficient query (partition key specified):

from boto3.dynamodb.conditions import Key

response = table.query(
    KeyConditionExpression=
        Key('sensor_id').eq(42000) &
        Key('timestamp').between(1704034800, 1704038400)
)

Execution:

  • Accesses single partition (sensor_id=42000)
  • Range scan on timestamp within partition
  • Returns 360 items (1 hour at 10-second intervals)
  • Latency: 8ms
  • Billed for ~360 items read

Inefficient scan (no partition key):

from boto3.dynamodb.conditions import Attr

response = table.scan(
    FilterExpression=Attr('temperature').gt(25.0)
)

Execution:

  • Reads every item in every partition
  • Applies temperature filter after reading from storage
  • 1M total items, 100K match filter
  • Latency: 2400ms (300× slower)
  • Billed for 1M items read (not 100K returned)

Partition key specification: 300× latency difference, 2800× items read

Global Secondary Index - Alternative Access Patterns

Problem: Table partitioned by sensor_id cannot efficiently query by other attributes

sensor_readings table:

  • Partition key: sensor_id
  • Sort key: timestamp
  • Efficient: “Readings for sensor 42000”
  • Inefficient: “All sensors with low battery” → requires scan

Solution: Global Secondary Index (GSI)

GSI creates alternate partition key on different attribute:

table.update(
    GlobalSecondaryIndexUpdates=[{
        'Create': {
            'IndexName': 'battery-status-index',
            'KeySchema': [
                {'AttributeName': 'battery_status', 'KeyType': 'HASH'},
                {'AttributeName': 'sensor_id', 'KeyType': 'RANGE'}
            ],
            'Projection': {'ProjectionType': 'ALL'}
        }
    }]
)

GSI structure:

  • New partition key: battery_status (values: LOW, NORMAL, FULL)
  • New sort key: sensor_id
  • DynamoDB maintains GSI automatically on every write

Query using GSI:

response = table.query(
    IndexName='battery-status-index',
    KeyConditionExpression=Key('battery_status').eq('LOW')
)

Returns all low-battery sensors in single partition access

GSI maintained automatically, doubles write cost

Boto3 Query Operations - Query vs Scan

Query requires partition key, scan does not

Setup:

import boto3
from boto3.dynamodb.conditions import Key, Attr

dynamodb = boto3.resource('dynamodb', region_name='us-west-2')
table = dynamodb.Table('sensor_readings')

Query with partition key and sort key range:

import time

current_time = int(time.time())
one_hour_ago = current_time - 3600

response = table.query(
    KeyConditionExpression=
        Key('sensor_id').eq(42000) &
        Key('timestamp').gt(one_hour_ago)
)

items = response['Items']
# Returns: List of items matching condition
# Example: 360 items (1 per 10 seconds for 1 hour)

Performance: Single partition access, 8ms latency

Query with additional filter (applied after retrieval):

response = table.query(
    KeyConditionExpression=Key('sensor_id').eq(42000),
    FilterExpression=Attr('temperature').gt(25.0)
)

Execution:

  • Reads all items for sensor_id=42000 (e.g., 10,000 items)
  • Filters temperature > 25.0 in memory
  • Returns 2,000 items (20% match filter)
  • Billed for 10,000 items read (not 2,000 returned)

FilterExpression reduces network transfer, not storage reads or cost

Scan entire table (avoid in production):

response = table.scan(
    FilterExpression=Attr('battery_level').lt(10)
)

items = response['Items']

Execution on 1M item table:

  • Reads every item in table (1M items)
  • Filter applied after reading from storage
  • Returns 50K items (5% match filter)
  • Latency: 3200ms
  • Billed for 1M items scanned

Pagination for large result sets:

items = []
last_key = None

while True:
    if last_key:
        response = table.query(
            KeyConditionExpression=Key('sensor_id').eq(42000),
            ExclusiveStartKey=last_key
        )
    else:
        response = table.query(
            KeyConditionExpression=Key('sensor_id').eq(42000)
        )

    items.extend(response['Items'])

    last_key = response.get('LastEvaluatedKey')
    if not last_key:
        break

DynamoDB returns maximum 1 MB per request. Use LastEvaluatedKey to continue.

Performance comparison (1M item table):

  • Query with partition key: 8ms, ~100 items read
  • Query with filter: 8ms, 10K items read (entire partition)
  • Scan: 3200ms, 1M items read (entire table)

Boto3 Item Operations

Put item (insert or replace):

table.put_item(
    Item={
        'sensor_id': 42000,
        'timestamp': 1704034800,
        'temperature': 22.5,
        'humidity': 45.0,
        'battery_level': 87
    }
)

If item with same primary key exists, replaces entirely (not merge).

Get item by primary key:

response = table.get_item(
    Key={
        'sensor_id': 42000,
        'timestamp': 1704034800
    }
)

if 'Item' in response:
    item = response['Item']
    print(f"Temperature: {item['temperature']}")
else:
    print("Item not found")

get_item returns single item, not list. Consistent read by default (can specify ConsistentRead=False for eventually consistent, half cost).

Update specific attributes:

table.update_item(
    Key={
        'sensor_id': 42000,
        'timestamp': 1704034800
    },
    UpdateExpression='SET temperature = :temp, humidity = :hum',
    ExpressionAttributeValues={
        ':temp': 23.1,
        ':hum': 46.5
    }
)

Updates only specified attributes, leaves others unchanged.

Conditional update (optimistic locking):

table.update_item(
    Key={
        'sensor_id': 42000,
        'timestamp': 1704034800
    },
    UpdateExpression='SET battery_level = :new_level',
    ConditionExpression='battery_level > :threshold',
    ExpressionAttributeValues={
        ':new_level': 85,
        ':threshold': 80
    }
)

Update succeeds only if battery_level > 80. Raises ConditionalCheckFailedException if condition false.

Delete item:

table.delete_item(
    Key={
        'sensor_id': 42000,
        'timestamp': 1704034800
    }
)

Batch operations (up to 25 items):

with table.batch_writer() as batch:
    for i in range(100):
        batch.put_item(
            Item={
                'sensor_id': 42000,
                'timestamp': 1704034800 + i * 10,
                'temperature': 22.0 + i * 0.1
            }
        )

batch_writer automatically handles batching, retries, and backoff. Reduces network round trips for bulk writes.

Cost: Same per-item cost as individual operations. Network latency reduction proportional to batch size.

AWS CLI - Table Inspection and Queries

List tables in region:

aws dynamodb list-tables --region us-west-2

Output:

{
    "TableNames": [
        "sensor_readings",
        "user_sessions",
        "flight_events"
    ]
}

Describe table structure:

aws dynamodb describe-table \
    --table-name sensor_readings \
    --region us-west-2

Returns:

  • KeySchema: Partition key and sort key definitions
  • AttributeDefinitions: Data types (N=number, S=string, B=binary)
  • ProvisionedThroughput or BillingMode
  • GlobalSecondaryIndexes: GSI definitions if present
  • ItemCount: Approximate item count (updated every 6 hours)
  • TableSizeBytes: Storage size
  • TableStatus: ACTIVE, CREATING, DELETING

Get specific item:

aws dynamodb get-item \
    --table-name sensor_readings \
    --key '{
        "sensor_id": {"N": "42000"},
        "timestamp": {"N": "1704034800"}
    }' \
    --region us-west-2

Note: CLI requires explicit type tags (N, S, B)

Query with CLI:

aws dynamodb query \
    --table-name sensor_readings \
    --key-condition-expression "sensor_id = :sid AND #ts > :time" \
    --expression-attribute-names '{"#ts": "timestamp"}' \
    --expression-attribute-values '{
        ":sid": {"N": "42000"},
        ":time": {"N": "1704034800"}
    }' \
    --region us-west-2

expression-attribute-names required when attribute name is reserved word (timestamp).

Output (truncated):

{
    "Items": [
        {
            "sensor_id": {"N": "42000"},
            "timestamp": {"N": "1704034810"},
            "temperature": {"N": "22.5"},
            "humidity": {"N": "45.0"}
        },
        ...
    ],
    "Count": 360,
    "ScannedCount": 360
}

Scan with filter:

aws dynamodb scan \
    --table-name sensor_readings \
    --filter-expression "battery_level < :threshold" \
    --expression-attribute-values '{":threshold": {"N": "10"}}' \
    --region us-west-2

Count: Items returned. ScannedCount: Items examined. If filter used, ScannedCount > Count (charged for ScannedCount).

Capacity Modes - On-Demand vs Provisioned

On-demand mode: Pay per request

Pricing (us-west-2):

  • $1.25 per million read requests
  • $6.25 per million write requests
  • Storage: $0.25 per GB-month

Capacity unit definition:

  • 1 read request = up to 4 KB
  • 1 write request = up to 1 KB
  • Larger items consume multiple requests

Example cost calculation:

  • 10,000 reads/day (each 2 KB) = 300K reads/month
  • 2,000 writes/day (each 0.5 KB) = 60K writes/month
  • Storage: 5 GB
  • Monthly cost: (0.3 × $1.25) + (0.06 × $6.25) + (5 × $0.25) = $0.38 + $0.38 + $1.25 = $2.01

When to use on-demand:

  • Development and testing (unpredictable traffic)
  • Spiky workloads (traffic varies 10× between peak and baseline)
  • New applications (unknown traffic patterns)

Provisioned mode: Reserve capacity

Specify capacity units:

  • Read Capacity Units (RCU): Strongly consistent reads/sec

    • 1 RCU = 1 read/sec of up to 4 KB
    • Eventually consistent reads: 2× capacity (1 RCU = 2 reads/sec)
  • Write Capacity Units (WCU): Writes/sec of up to 1 KB

Pricing:

  • $0.00065/hour per RCU = $0.47/month per RCU
  • $0.00325/hour per WCU = $2.35/month per WCU

Provisioned capacity example:

Reserve 10 RCU + 5 WCU:

  • Monthly cost: (10 × $0.47) + (5 × $2.35) = $4.70 + $11.75 = $16.45
  • Capacity: 10 reads/sec + 5 writes/sec
  • Handles: 25.9M reads/month + 13M writes/month

If traffic exceeds provisioned capacity: ProvisionedThroughputExceededException (throttling). Application must retry with exponential backoff.

Auto-scaling:

table.update(
    BillingMode='PROVISIONED',
    ProvisionedThroughput={
        'ReadCapacityUnits': 10,
        'WriteCapacityUnits': 5
    }
)

# Enable auto-scaling (via AWS Application Auto Scaling)
# Target: 70% utilization
# Min: 5 RCU/WCU
# Max: 100 RCU/WCU

Auto-scaling adjusts capacity based on utilization, responds within 2-5 minutes.

Break-even analysis:

Provisioned cheaper for sustained traffic, on-demand cheaper for spiky/low-volume

DynamoDB Limitations

No server-side aggregations

Cannot compute:

# Invalid - no aggregation support
SELECT AVG(temperature) FROM sensor_readings WHERE sensor_id = 42000

Must retrieve all items and compute in application:

response = table.query(
    KeyConditionExpression=Key('sensor_id').eq(42000)
)

temperatures = [item['temperature'] for item in response['Items']]
avg_temp = sum(temperatures) / len(temperatures)

Cost: Charged for reading all items, even though only computing single aggregate value.

No joins

Cannot combine tables:

# Invalid - no JOIN support
SELECT s.*, m.location
FROM sensor_readings s
JOIN sensor_metadata m ON s.sensor_id = m.sensor_id

Must query both tables separately:

# Query 1: Get readings
readings = table_readings.query(
    KeyConditionExpression=Key('sensor_id').eq(42000)
)

# Query 2: Get metadata
metadata = table_metadata.get_item(Key={'sensor_id': 42000})

# Application joins results
for reading in readings['Items']:
    reading['location'] = metadata['Item']['location']

Alternative: Denormalize location into readings table (duplicate data, no update anomalies if location rarely changes).

Query flexibility limited

Must specify partition key:

# Invalid - no partition key
table.query(
    KeyConditionExpression=Key('timestamp').gt(1704034800)
)
# Error: Query key condition not supported

Without partition key, must use scan (reads entire table).

Item size limit: 400 KB

Cannot store:

# Invalid - item exceeds 400 KB
table.put_item(Item={
    'sensor_id': 42000,
    'timestamp': 1704034800,
    'readings': [list of 10,000 measurements]  # Too large
})

Must split large items or store in S3 with reference in DynamoDB.

Transaction limitations

TransactWriteItems supports up to 25 items:

table.meta.client.transact_write_items(
    TransactItems=[
        {'Put': {'TableName': 'sensor_readings', 'Item': {...}}},
        {'Update': {'TableName': 'sensor_metadata', 'Key': {...}, ...}},
        # ... up to 25 operations total
    ]
)
  • 4 MB total transaction size
  • 2× cost vs non-transactional writes
  • All items must be in same region

When to use relational database instead:

  • Complex analytical queries (aggregations, JOINs)
  • Ad-hoc queries (access patterns unknown at design time)
  • Strong multi-item consistency required across many items
  • Schema changes frequent

Data Modeling - Sensor Monitoring Design

Requirements:

  • 1,000 sensors, readings every 10 seconds (100 writes/sec)
  • Query pattern 1: “Last hour for sensor X” (100 queries/sec)
  • Query pattern 2: “All sensors with temperature > 30°C” (1 query/min)
  • Query pattern 3: “Daily average per sensor” (batch job, 1/day)

Table design:

{
    "TableName": "sensor_readings",
    "KeySchema": [
        {"AttributeName": "sensor_id", "KeyType": "HASH"},
        {"AttributeName": "timestamp", "KeyType": "RANGE"}
    ],
    "AttributeDefinitions": [
        {"AttributeName": "sensor_id", "AttributeType": "N"},
        {"AttributeName": "timestamp", "AttributeType": "N"},
        {"AttributeName": "temp_bucket", "AttributeType": "S"}
    ],
    "GlobalSecondaryIndexes": [{
        "IndexName": "temperature-index",
        "KeySchema": [
            {"AttributeName": "temp_bucket", "KeyType": "HASH"},
            {"AttributeName": "timestamp", "KeyType": "RANGE"}
        ]
    }]
}

Rationale:

Query pattern 1 (100/sec): Partition key = sensor_id

  • Single partition access per query
  • Range on timestamp within partition (last hour = 360 items)
  • Latency: 8ms
  • Cost: 100 queries/sec × 2 RCU each = 200 RCU sustained

Query pattern 2 (1/min): GSI on temperature bucket

  • Partition GSI by temperature ranges: 0-10, 10-20, 20-30, 30-40
  • Store temp_bucket attribute: “30-40” when temperature in that range
  • Query GSI for temp_bucket = “30-40”
  • Returns all sensors in range across all partitions
  • Latency: 25ms, Cost: 50 RCU per query
  • Acceptable: 1/min × 50 RCU = 0.83 RCU sustained

Query pattern 3 (1/day): Export to S3, aggregate with Athena

  • DynamoDB Export to S3 (no read capacity consumed)
  • S3 contains Parquet files with all readings
  • Athena SQL query:
SELECT sensor_id,
       DATE(from_unixtime(timestamp)) as date,
       AVG(temperature) as avg_temp
FROM sensor_readings_export
GROUP BY sensor_id, DATE(from_unixtime(timestamp))
  • Cost: Athena query ~$0.05 per TB scanned
  • No DynamoDB RCU consumed

Alternative (wrong) design:

Partition key = date, Sort key = sensor_id + timestamp

{
    "partition_key": "2025-01-15",
    "sort_key": "42000#1704034800"
}

Problem: Query pattern 1 requires scanning entire day partition

  • 1,000 sensors × 8,640 readings/day = 8.64M items per partition
  • Query for single sensor hour: Must scan 8.64M items to find 360
  • Latency: 4500ms (vs 8ms with correct design)
  • Cost: 8640 RCU per query (vs 2 RCU)

Design principle: Partition key must match highest-frequency query pattern. GSI handles secondary patterns at write cost increase.

Cost calculation:

  • Writes: 100/sec × 1 KB each = 100 WCU
  • Query pattern 1 reads: 200 RCU
  • Query pattern 2 reads: 0.83 RCU
  • GSI writes: 100 WCU (doubles write cost)
  • Total: 200 RCU + 200 WCU

Provisioned mode: (200 × $0.47) + (200 × $2.35) = $94 + $470 = $564/month

On-demand mode: 259M reads × $1.25/M + 259M writes × $6.25/M = $324 + $1619 = $1943/month

Provisioned mode 3.4× cheaper for sustained traffic.

Single-Table Design - Multiple Entities in One Table

Advanced pattern: Store multiple entity types in single table

Wide-column model supports heterogeneous rows. DynamoDB extends this - different entity types in same table using composite key structure.

Multi-table approach (relational thinking):

users_table:        PK: user_id
orders_table:       PK: order_id
products_table:     PK: product_id
order_items_table:  PK: order_id, SK: product_id

Query “user profile + recent 10 orders + line items for each order”:

  • Query 1: users_table → 1 item
  • Query 2: orders_table (GSI on user_id) → 10 items
  • Query 3: order_items_table (10 separate queries) → 50 items
  • Total: 12 requests, 45ms latency, 15 RCU

Single-table approach:

application_table:
  PK: USER#alice,      SK: PROFILE           → user data
  PK: USER#alice,      SK: ORDER#2025-01-15  → order summary
  PK: ORDER#12345,     SK: METADATA          → order details
  PK: ORDER#12345,     SK: ITEM#widget       → line item

Query “user profile + recent orders”:

table.query(
    KeyConditionExpression=Key('PK').eq('USER#alice')
)

Returns user profile + all orders in single request (11 items), 8ms latency, 3 RCU

Trade-off:

  • Benefit: Fewer queries (12→1), lower latency (45ms→8ms), lower cost (15→3 RCU)
  • Cost: Must know access patterns at design time, cannot query “all orders” efficiently (no partition key)

Single query retrieves related entities, no GSI overhead

Single-Table Design - Access Pattern Implementation

Concrete example: E-commerce order system

Table structure with composite keys encoding entity type and relationships:

User entity:

{
    "PK": "USER#alice",
    "SK": "PROFILE",
    "email": "alice@example.com",
    "name": "Alice Chen",
    "created": "2024-01-15"
}

User’s orders (sorted by date):

{
    "PK": "USER#alice",
    "SK": "ORDER#2025-01-15#12345",
    "order_id": "12345",
    "total": 125.50,
    "status": "shipped"
}

Order metadata (alternate access pattern):

{
    "PK": "ORDER#12345",
    "SK": "METADATA",
    "user_id": "alice",
    "order_date": "2025-01-15",
    "shipping_address": "123 Main St"
}

Order line items:

{
    "PK": "ORDER#12345",
    "SK": "ITEM#widget",
    "product_name": "Widget Pro",
    "quantity": 2,
    "price": 50.00
}

Access pattern 1: User profile + recent orders

response = table.query(
    KeyConditionExpression=
        Key('PK').eq('USER#alice') &
        Key('SK').begins_with('ORDER#')
)

Returns:

  • All orders for user (sorted by date in SK)
  • Single partition access
  • Latency: 8ms, Cost: 3 RCU (reads ~10 items)

Access pattern 2: Order details + line items

response = table.query(
    KeyConditionExpression=Key('PK').eq('ORDER#12345')
)

Returns:

  • METADATA item + all ITEM# items
  • Single partition access
  • Latency: 5ms, Cost: 1 RCU (reads 4 items)

Performance comparison:

Multi-table design:

  • Pattern 1: Query users + query orders (with GSI) = 15ms
  • Pattern 2: Query orders + query order_items = 12ms
  • GSI on orders.user_id: Doubles write cost
  • Total latency: 27ms

Single-table design:

  • Pattern 1: Single query = 8ms
  • Pattern 2: Single query = 5ms
  • No GSI required
  • Total latency: 13ms (52% reduction)

Sort key ordering (ORDER#2025-01-15#12345) provides chronological sorting without additional index. Latest orders retrieved first by reversing sort direction.

Constraint: Cannot efficiently query “all orders across all users” (no partition key). Single-table design optimizes for known access patterns, not ad-hoc queries.

DynamoDB in System Architecture

DynamoDB complements relational databases

PostgreSQL RDS (relational):

  • User accounts: id, email, password_hash, created_at
  • Subscription plans: plan_id, name, price_cents, features
  • Payment transactions: txn_id, user_id, amount, status, timestamp
  • Referential integrity: Foreign keys, CASCADE deletes
  • ACID transactions for payment processing

DynamoDB (wide-column):

  • User activity logs: user_id + timestamp → action, resource, metadata
  • API rate limiting: api_key + window_start → request_count, TTL
  • Session storage: session_token → user_id, login_time, expires_at
  • Real-time sensor data: device_id + timestamp → measurements
  • High write throughput, key-based access, TTL for automatic expiration

S3 (object storage):

  • Raw log files (archived from DynamoDB)
  • User-uploaded files
  • ML model artifacts
  • Static website assets

Data flow example (IoT sensor platform):

  1. Sensor HTTP POST → API Gateway → Lambda function
  2. Lambda writes to DynamoDB (sensor_id, timestamp, readings)
    • Write latency: 5ms
    • Cost: 1 WCU per reading
  3. DynamoDB Stream triggers Lambda for anomaly detection
    • Stream provides ordered change log
    • Lambda processes each reading within 100ms
  4. Anomaly detected → Lambda writes alert to PostgreSQL
    • alerts table: alert_id, sensor_id, severity, timestamp, description
    • Triggers notification to operations team
  5. Nightly batch job: DynamoDB → S3 Export
    • Parquet format for analytics
    • Athena queries for daily/weekly aggregates
  6. Monthly aggregates → PostgreSQL for dashboard queries

Each storage system handles workload it optimizes for