Homework #3: Relational Databases and NoSQL Data Modeling
EE 547: Fall 2025
Assigned: 21 October
Due: Monday, 03 November at 23:59
Submission: Gradescope via GitHub repository
- Docker Desktop must be installed and running on your machine
- AWS CLI configured with valid credentials for Problem 2
- PostgreSQL Docker image for Problem 1
- Use data files from HW#1 and HW#2 outputs
Overview
This assignment covers relational databases (PostgreSQL) and NoSQL systems (DynamoDB). You will design schemas, implement queries, and deploy database-backed services.
Getting Started
Download the dataset: hw3-data.zip
Extract the files:
unzip hw3-data.zip
cd hw3-dataThe data/ directory contains CSV files for Problem 1.
Problem 1: Metro Transit Database (Relational SQL)
Use only the following packages:
- psycopg2-binary (PostgreSQL adapter for Python)
- Python standard library modules (json, sys, os, csv, datetime, argparse)
Do not use ORMs (SQLAlchemy, Django ORM), query builders, or pandas. Write raw SQL.
Design and query a relational database for a city metro transit system tracking lines, stops, trips, and ridership.
Part A: Schema Design
You will design a database schema based on these entity descriptions. The CSV data files are provided.
Entities:
- Lines - Transit routes (e.g., Route 2, Route 20)
- Has: name, vehicle type
- Stops - Stop locations
- Has: name, latitude, longitude
- Line Stops - Which stops each line serves, in order
- Links line to stops with sequence number and time offset from start
- A stop can be on multiple lines (transfer stations)
- Trips - Scheduled vehicle runs
- Has: line, departure time, vehicle ID
- Stop Events - Actual arrivals during trips
- Has: trip, stop, scheduled time, actual time, passengers on/off
Relationships:
- “A line has many stops (ordered)” → line_stops table
- “A trip runs on one line” → foreign key
- “A trip has many stop events” → foreign key
Your Task:
Create schema.sql defining all five tables. You must decide:
- Primary keys (natural vs surrogate keys)
- Foreign key constraints
- CHECK constraints (e.g.,
passengers_on >= 0) - UNIQUE constraints where appropriate
Example starter (you complete this):
CREATE TABLE lines (
line_id SERIAL PRIMARY KEY, -- or use line_name as natural key?
line_name VARCHAR(50) NOT NULL UNIQUE,
vehicle_type VARCHAR(10) CHECK (vehicle_type IN ('rail', 'bus'))
);
CREATE TABLE stops (
-- Your design here
);
CREATE TABLE line_stops (
-- Your design here
-- Must handle: line_id, stop_id, sequence_number, time_offset_minutes
);
-- Continue for trips and stop_eventsPart B: Data Loading
Provided CSV files:
lines.csv- 5 routesstops.csv- ~105 stopsline_stops.csv- ~105 combinationstrips.csv- ~525 tripsstop_events.csv- ~11,000 events
CSV Formats:
lines.csv:
line_name,vehicle_type
Route 2,bus
Route 4,bus
Route 20,bus
stops.csv:
stop_name,latitude,longitude
Wilshire / Veteran,34.057616,-118.447888
Le Conte / Broxton,34.063594,-118.446732
line_stops.csv:
line_name,stop_name,sequence,time_offset
Route 20,Wilshire / Veteran,1,0
Route 20,Le Conte / Broxton,2,5
trips.csv:
trip_id,line_name,scheduled_departure,vehicle_id
T0001,Route 2,2025-10-01 06:00:00,V101
stop_events.csv:
trip_id,stop_name,scheduled,actual,passengers_on,passengers_off
T0001,Le Conte / Broxton,2025-10-01 06:00:00,2025-10-01 06:00:00,29,0
T0001,Le Conte / Westwood,2025-10-01 06:02:00,2025-10-01 06:02:00,25,29
Create load_data.py that:
- Connects to PostgreSQL
- Runs
schema.sqlto create tables - Loads CSVs in correct order (handle foreign keys)
- Reports statistics
python load_data.py --host localhost --dbname transit --user postgres --password pass --datadir data/Example output:
Connected to transit@localhost
Creating schema...
Tables created: lines, stops, line_stops, trips, stop_events
Loading data/lines.csv... 5 rows
Loading data/stops.csv... 105 rows
Loading data/line_stops.csv... 105 rows
Loading data/trips.csv... 525 rows
Loading data/stop_events.csv... 11,025 rows
Total: 11,765 rows loaded
Part C: Query Implementation
Create queries.py that implements 10 SQL queries.
Your script must support:
python queries.py --query Q1 --dbname transit --format json
python queries.py --all --dbname transitRequired Queries:
Q1: List all stops on Route 20 in order
-- Output: stop_name, sequence, time_offset
SELECT ...
FROM line_stops ls
JOIN lines l ON ...
JOIN stops s ON ...
WHERE l.line_name = 'Route 20'
ORDER BY ls.sequence;Q2: Trips during morning rush (7-9 AM)
-- Output: trip_id, line_name, scheduled_departureQ3: Transfer stops (stops on 2+ routes)
-- Output: stop_name, line_count
-- Uses: GROUP BY, HAVINGQ4: Complete route for trip T0001
-- Output: All stops for specific trip in order
-- Multi-table JOINQ5: Routes serving both Wilshire / Veteran and Le Conte / Broxton
-- Output: line_nameQ6: Average ridership by line
-- Output: line_name, avg_passengers
-- Aggregation across stop_eventsQ7: Top 10 busiest stops
-- Output: stop_name, total_activity
-- total_activity = SUM(passengers_on + passengers_off)Q8: Count delays by line (>2 min late)
-- Output: line_name, delay_count
-- WHERE actual > scheduled + interval '2 minutes'Q9: Trips with 3+ delayed stops
-- Output: trip_id, delayed_stop_count
-- Uses: Subquery or HAVINGQ10: Stops with above-average ridership
-- Output: stop_name, total_boardings
-- Subquery for AVG comparisonOutput format (JSON):
{
"query": "Q1",
"description": "Red Line stops in order",
"results": [
{"stop_name": "Union Station", "sequence": 1, "time_offset": 0},
{"stop_name": "7th St Metro", "sequence": 2, "time_offset": 5}
],
"count": 2
}Part D: Docker Setup
Create docker-compose.yaml:
version: '3.8'
services:
db:
image: postgres:15-alpine
environment:
POSTGRES_DB: transit
POSTGRES_USER: transit
POSTGRES_PASSWORD: transit123
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
adminer:
image: adminer:latest
ports:
- "8080:8080"
depends_on:
- db
volumes:
pgdata:Adminer provides a web UI at http://localhost:8080 to browse tables and run queries (System: PostgreSQL, Server: db, Username: transit, Password: transit123, Database: transit).
Create Dockerfile:
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y postgresql-client && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY schema.sql load_data.py queries.py ./
CMD ["python", "load_data.py"]Create requirements.txt:
psycopg2-binary>=2.9.0
Part E: Build and Run Scripts
Create build.sh:
#!/bin/bash
docker-compose buildCreate run.sh:
#!/bin/bash
echo "Starting PostgreSQL..."
docker-compose up -d db
echo "Waiting for database..."
sleep 3
echo "Loading data..."
docker-compose run --rm app python load_data.py \
--host db --dbname transit --user transit --password transit123 --datadir /app/data
echo ""
echo "Running sample queries..."
docker-compose run --rm app python queries.py --query Q1 --dbname transit
docker-compose run --rm app python queries.py --query Q3 --dbname transitCreate test.sh:
#!/bin/bash
./build.sh
./run.sh
echo ""
echo "Testing all queries..."
for i in {1..10}; do
docker-compose run --rm app python queries.py --query Q$i --dbname transit --format json
done
docker-compose downPart F: README Analysis
Create README.md answering:
- Schema Decisions: Natural vs surrogate keys? Why?
- Constraints: What CHECK/UNIQUE constraints did you add?
- Complex Query: Which query was hardest? Why?
- Foreign Keys: Give example of invalid data they prevent
- When Relational: Why is SQL good for this domain?
Deliverables
Your problem1/ directory must contain:
problem1/
├── schema.sql
├── load_data.py
├── queries.py
├── Dockerfile
├── docker-compose.yaml
├── requirements.txt
├── build.sh
├── run.sh
├── test.sh
└── README.md
All shell scripts must be executable (chmod +x *.sh).
Validation
We will test by:
- Running
./build.sh- must complete without errors - Running
./test.sh- must load data and execute all 10 queries - Verifying schema has proper constraints
- Testing queries return correct results
- Checking README answers all questions
Queries must execute within 500ms on the provided dataset.
Problem 2: ArXiv Paper Discovery with DynamoDB
Use only the following packages:
- boto3 (AWS SDK for Python)
- Python standard library modules (json, sys, os, datetime, re, collections)
Do not use other AWS libraries, NoSQL ORMs, or database abstraction layers beyond boto3.
Build a paper discovery system using AWS DynamoDB that efficiently supports multiple access patterns through schema design and denormalization.
Part A: Schema Design for Access Patterns
Design a DynamoDB table schema that efficiently supports these required query patterns:
- Browse recent papers by category (e.g., “Show me latest ML papers”)
- Find all papers by a specific author
- Get full paper details by arxiv_id
- List papers published in a date range within a category
- Search papers by keyword (extracted from abstract)
Design Requirements:
- Define partition key and sort key for main table
- Design Global Secondary Indexes (GSIs) to support all access patterns
- Implement denormalization strategy for efficient queries
- Document trade-offs in your schema design
Example Schema Structure:
# Main Table Item
{
"PK": "CATEGORY#cs.LG",
"SK": "2023-01-15#2301.12345",
"arxiv_id": "2301.12345",
"title": "Paper Title",
"authors": ["Author1", "Author2"],
"abstract": "Full abstract text...",
"categories": ["cs.LG", "cs.AI"],
"keywords": ["keyword1", "keyword2"],
"published": "2023-01-15T10:30:00Z"
}
# GSI1: Author access
{
"GSI1PK": "AUTHOR#Author1",
"GSI1SK": "2023-01-15",
# ... rest of paper data
}
# Additional GSIs as needed for other access patternsPart B: Data Loading Script
Create load_data.py that loads ArXiv papers from your HW#1 Problem 2 output (papers.json) into DynamoDB.
Your script must accept these command line arguments:
python load_data.py <papers_json_path> <table_name> [--region REGION]Required Operations:
- Create DynamoDB table with appropriate partition/sort keys
- Create GSIs for alternate access patterns
- Transform paper data from HW#1 format to DynamoDB items
- Extract keywords from abstracts (top 10 most frequent words, excluding stopwords)
- Implement denormalization:
- Papers in multiple categories → multiple items
- Multiple authors → items for each author (GSI)
- Multiple keywords → items for each keyword (GSI)
- Batch write items to DynamoDB (use batch_write_item for efficiency)
- Report statistics:
- Number of papers loaded
- Total DynamoDB items created
- Denormalization factor (items/paper ratio)
Example Output:
Creating DynamoDB table: arxiv-papers
Creating GSIs: AuthorIndex, PaperIdIndex, KeywordIndex
Loading papers from papers.json...
Extracting keywords from abstracts...
Loaded 157 papers
Created 2,345 DynamoDB items (denormalized)
Denormalization factor: 14.9x
Storage breakdown:
- Category items: 314 (2.0 per paper avg)
- Author items: 785 (5.0 per paper avg)
- Keyword items: 1,570 (10.0 per paper avg)
- Paper ID items: 157 (1.0 per paper)
Keyword Extraction: Use the following stopwords list:
STOPWORDS = {
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
'can', 'this', 'that', 'these', 'those', 'we', 'our', 'use', 'using',
'based', 'approach', 'method', 'paper', 'propose', 'proposed', 'show'
}Part C: Query Implementation
Create query_papers.py that implements queries for all five access patterns.
Your script must support these commands:
# Query 1: Recent papers in category
python query_papers.py recent <category> [--limit 20] [--table TABLE]
# Query 2: Papers by author
python query_papers.py author <author_name> [--table TABLE]
# Query 3: Get paper by ID
python query_papers.py get <arxiv_id> [--table TABLE]
# Query 4: Papers in date range
python query_papers.py daterange <category> <start_date> <end_date> [--table TABLE]
# Query 5: Papers by keyword
python query_papers.py keyword <keyword> [--limit 20] [--table TABLE]Query Implementations:
def query_recent_in_category(table_name, category, limit=20):
"""
Query 1: Browse recent papers in category.
Uses: Main table partition key query with sort key descending.
"""
response = dynamodb.Table(table_name).query(
KeyConditionExpression=Key('PK').eq(f'CATEGORY#{category}'),
ScanIndexForward=False,
Limit=limit
)
return response['Items']
def query_papers_by_author(table_name, author_name):
"""
Query 2: Find all papers by author.
Uses: GSI1 (AuthorIndex) partition key query.
"""
response = dynamodb.Table(table_name).query(
IndexName='AuthorIndex',
KeyConditionExpression=Key('GSI1PK').eq(f'AUTHOR#{author_name}')
)
return response['Items']
def get_paper_by_id(table_name, arxiv_id):
"""
Query 3: Get specific paper by ID.
Uses: GSI2 (PaperIdIndex) for direct lookup.
"""
response = dynamodb.Table(table_name).query(
IndexName='PaperIdIndex',
KeyConditionExpression=Key('GSI2PK').eq(f'PAPER#{arxiv_id}')
)
return response['Items'][0] if response['Items'] else None
def query_papers_in_date_range(table_name, category, start_date, end_date):
"""
Query 4: Papers in category within date range.
Uses: Main table with composite sort key range query.
"""
response = dynamodb.Table(table_name).query(
KeyConditionExpression=(
Key('PK').eq(f'CATEGORY#{category}') &
Key('SK').between(f'{start_date}#', f'{end_date}#zzzzzzz')
)
)
return response['Items']
def query_papers_by_keyword(table_name, keyword, limit=20):
"""
Query 5: Papers containing keyword.
Uses: GSI3 (KeywordIndex) partition key query.
"""
response = dynamodb.Table(table_name).query(
IndexName='KeywordIndex',
KeyConditionExpression=Key('GSI3PK').eq(f'KEYWORD#{keyword.lower()}'),
ScanIndexForward=False,
Limit=limit
)
return response['Items']Output Format:
All queries must output JSON to stdout:
{
"query_type": "recent_in_category",
"parameters": {
"category": "cs.LG",
"limit": 20
},
"results": [
{
"arxiv_id": "2301.12345",
"title": "Paper Title",
"authors": ["Author1", "Author2"],
"published": "2023-01-15T10:30:00Z",
"categories": ["cs.LG"]
}
],
"count": 20,
"execution_time_ms": 12
}Part D: API Server with DynamoDB Backend
Create api_server.py that exposes query functionality via HTTP endpoints.
Required Endpoints:
GET /papers/recent?category={category}&limit={limit}- Returns recent papers in category
- Default limit: 20
GET /papers/author/{author_name}- Returns all papers by author
GET /papers/{arxiv_id}- Returns full paper details by ID
GET /papers/search?category={category}&start={date}&end={date}- Returns papers in date range
GET /papers/keyword/{keyword}?limit={limit}- Returns papers matching keyword
- Default limit: 20
Implementation Requirements:
- Use only Python standard library
http.server(no Flask/FastAPI) - Accept port number as command line argument (default 8080)
- Return JSON responses with proper HTTP status codes
- Handle errors gracefully (404 for not found, 500 for server errors)
- Log requests to stdout
Example Request/Response:
curl "http://localhost:8080/papers/recent?category=cs.LG&limit=5"{
"category": "cs.LG",
"papers": [
{
"arxiv_id": "2310.12345",
"title": "Recent ML Paper",
"authors": ["Author One", "Author Two"],
"published": "2023-10-15T10:30:00Z"
}
],
"count": 5
}Part E: EC2 Deployment
Deploy your API server to AWS EC2 and configure it to use DynamoDB.
Deployment Steps:
Launch EC2 instance:
- Instance type: t3.micro or t3.small
- OS: Amazon Linux 2023 or Ubuntu 22.04
- Security group: Allow inbound HTTP (port 80 or custom port)
Configure IAM role with DynamoDB permissions:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "dynamodb:Query", "dynamodb:Scan", "dynamodb:GetItem", "dynamodb:BatchGetItem" ], "Resource": "arn:aws:dynamodb:*:*:table/arxiv-papers*" } ] }Install dependencies on EC2:
sudo yum install python3 python3-pip -y # Amazon Linux pip3 install boto3Deploy and run server:
# Upload api_server.py to EC2 scp -i key.pem api_server.py ec2-user@<public-ip>:~ # SSH to instance ssh -i key.pem ec2-user@<public-ip> # Run server (use screen or systemd for persistence) python3 api_server.py 8080Test from local machine:
curl "http://<ec2-public-ip>:8080/papers/recent?category=cs.LG&limit=5"
Create deploy.sh:
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Usage: $0 <key_file> <ec2_public_ip>"
exit 1
fi
KEY_FILE="$1"
EC2_IP="$2"
echo "Deploying to EC2 instance: $EC2_IP"
# Copy files
scp -i "$KEY_FILE" api_server.py ec2-user@"$EC2_IP":~
scp -i "$KEY_FILE" requirements.txt ec2-user@"$EC2_IP":~
# Install dependencies and start server
ssh -i "$KEY_FILE" ec2-user@"$EC2_IP" << 'EOF'
pip3 install -r requirements.txt
# Kill existing server if running
pkill -f api_server.py
# Start server in background
nohup python3 api_server.py 8080 > server.log 2>&1 &
echo "Server started. Check with: curl http://localhost:8080/papers/recent?category=cs.LG"
EOF
echo "Deployment complete"
echo "Test with: curl http://$EC2_IP:8080/papers/recent?category=cs.LG"Part F: Analysis and Documentation
Create README.md in your problem2/ directory that answers:
- Schema Design Decisions:
- Why did you choose your partition key structure?
- How many GSIs did you create and why?
- What denormalization trade-offs did you make?
- Denormalization Analysis:
- Average number of DynamoDB items per paper
- Storage multiplication factor
- Which access patterns caused the most duplication?
- Query Limitations:
- What queries are NOT efficiently supported by your schema?
- Examples: “Count total papers by author”, “Most cited papers globally”
- Why are these difficult in DynamoDB?
- When to Use DynamoDB:
- Based on this exercise, when would you choose DynamoDB over PostgreSQL?
- What are the key trade-offs?
- EC2 Deployment:
- Your EC2 instance public IP
- IAM role ARN used
- Any challenges encountered during deployment
Deliverables
Your problem2/ directory must contain:
problem2/
├── load_data.py
├── query_papers.py
├── api_server.py
├── deploy.sh
├── requirements.txt
└── README.md
requirements.txt:
boto3>=1.28.0
All scripts must be executable and handle errors gracefully.
Validation
We will test your solution by:
- Running
load_data.pywith sample ArXiv papers - Testing all five query patterns with
query_papers.py - Starting your API server and testing all endpoints
- Verifying your EC2 deployment is accessible
- Checking your README answers all analysis questions
- Validating denormalization is implemented correctly
- Testing query performance for various access patterns
Your API server must respond to all endpoints within 200ms for queries on tables with up to 500 papers.
Submission Requirements
Your GitHub repository must follow this exact structure:
ee547-hw3-[username]/
├── problem1/
│ └── [files for problem 1]
├── problem2/
│ ├── load_data.py
│ ├── query_papers.py
│ ├── requirements.txt
│ └── README.md
└── README.md
The README.md in your repository root must contain:
- Your full name and USC email address
- Instructions to run each problem if they differ from specifications
- AWS region used for Problem 2
- Any design decisions or trade-offs made
Before submitting, ensure:
- All Python scripts run without errors
- AWS credentials are configured correctly (Problem 2)
- All required output files are generated
- JSON output is valid and matches specified formats
- README documents your design decisions