Homework #4: Relational Databases and NoSQL Data Modeling

EE 547: Spring 2026

Assignment Details

Assigned: 25 February
Due: Tuesday, 10 March at 23:59

Gradescope: Homework 4 | How to Submit

Requirements

Docker Desktop must be installed and running on your machine
Python 3.11+ required
AWS CLI configured with valid credentials
Use only Python standard library modules unless explicitly permitted

Overview

This assignment covers relational databases (PostgreSQL) and NoSQL systems (DynamoDB). You will design schemas, load data, and implement queries against both.

Getting Started

Download the starter code: hw4-starter.zip

unzip hw4-starter.zip
cd hw4-starter

The q1/data/ directory contains CSV files for Problem 1.

Problem 1: Metro Transit Database (Relational SQL)

Design and query a relational database for a city metro transit system tracking lines, stops, trips, and ridership.

Use only psycopg2-binary (PostgreSQL adapter for Python) and Python standard library modules (json, sys, os, csv, datetime, argparse). Do not use ORMs (SQLAlchemy, Django ORM), query builders, or pandas. Write raw SQL.

Part A: Schema Design

You will design a database schema based on these entity descriptions. The CSV data files are provided in data/.

Entities:

Lines - Transit routes (e.g., Route 2, Route 20)
- Has: name, vehicle type
Stops - Stop locations
- Has: name, latitude, longitude
Line Stops - Which stops each line serves, in order
- Links line to stops with sequence number and time offset from start
- A stop can be on multiple lines (transfer stations)
Trips - Scheduled vehicle runs
- Has: line, departure time, vehicle ID
Stop Events - Actual arrivals during trips
- Has: trip, stop, scheduled time, actual time, passengers on/off

Relationships:

“A line has many stops (ordered)” → line_stops table
“A trip runs on one line” → foreign key
“A trip has many stop events” → foreign key

Your Task:

Create schema.sql defining all five tables. You must decide:

Primary keys (natural vs surrogate keys)
Foreign key constraints
CHECK constraints (e.g., passengers_on >= 0)
UNIQUE constraints where appropriate

Example starter (you complete this):

CREATE TABLE lines (
    line_id SERIAL PRIMARY KEY,  -- or use line_name as natural key?
    line_name VARCHAR(50) NOT NULL UNIQUE,
    vehicle_type VARCHAR(10) CHECK (vehicle_type IN ('rail', 'bus'))
);

CREATE TABLE stops (
    -- Your design here
);

CREATE TABLE line_stops (
    -- Your design here
    -- Must handle: line_id, stop_id, sequence_number, time_offset_minutes
);

-- Continue for trips and stop_events

Part B: Data Loading

Provided CSV files (in data/):

lines.csv - 5 routes
stops.csv - ~105 stops
line_stops.csv - ~105 combinations
trips.csv - ~525 trips
stop_events.csv - ~11,000 events

CSV Formats:

lines.csv:

line_name,vehicle_type
Route 2,bus
Route 20,bus

stops.csv:

stop_name,latitude,longitude
Wilshire / Veteran,34.057616,-118.447888
Le Conte / Broxton,34.063594,-118.446732

line_stops.csv:

line_name,stop_name,sequence,time_offset
Route 20,Wilshire / Veteran,1,0
Route 20,Le Conte / Broxton,2,5

trips.csv:

trip_id,line_name,scheduled_departure,vehicle_id
T0001,Route 2,2025-10-01 06:00:00,V101

stop_events.csv:

trip_id,stop_name,scheduled,actual,passengers_on,passengers_off
T0001,Le Conte / Broxton,2025-10-01 06:00:00,2025-10-01 06:00:00,29,0
T0001,Le Conte / Westwood,2025-10-01 06:02:00,2025-10-01 06:02:00,25,29

Create load_data.py that:

Connects to PostgreSQL
Runs schema.sql to create tables
Loads CSVs in correct order (handle foreign keys)
Reports statistics

python load_data.py --datadir data/

Connection parameters default to the docker-compose environment (same as queries.py).

Example output:

Connected to transit@localhost
Creating schema...
Tables created: lines, stops, line_stops, trips, stop_events

Loading data/lines.csv... 5 rows
Loading data/stops.csv... 105 rows
Loading data/line_stops.csv... 105 rows
Loading data/trips.csv... 525 rows
Loading data/stop_events.csv... 11,025 rows

Total: 11,765 rows loaded

Part C: Query Implementation

Create queries.py that implements 10 SQL queries.

Your script must support:

python queries.py Q1 --format json
python queries.py all

Connection parameters default to the docker-compose environment (--host db, --dbname transit, --user transit, --password transit123). Override with flags if running outside compose:

python queries.py Q1 --host localhost --dbname transit --user transit --password transit123

Required Queries:

Q1: List all stops on Route 20 in order

-- Output: stop_name, sequence, time_offset
SELECT ...
FROM line_stops ls
JOIN lines l ON ...
JOIN stops s ON ...
WHERE l.line_name = 'Route 20'
ORDER BY ls.sequence;

Q2: Trips during morning rush (7-9 AM)

-- Output: trip_id, line_name, scheduled_departure

Q3: Transfer stops (stops on 2+ routes)

-- Output: stop_name, line_count
-- Uses: GROUP BY, HAVING

Q4: Complete route for trip T0001

-- Output: All stops for specific trip in order
-- Multi-table JOIN

Q5: Routes serving both Wilshire / Veteran and Le Conte / Broxton

-- Output: line_name

Q6: Average ridership by line

-- Output: line_name, avg_passengers
-- Aggregation across stop_events

Q7: Top 10 busiest stops

-- Output: stop_name, total_activity
-- total_activity = SUM(passengers_on + passengers_off)

Q8: Count delays by line (>2 min late)

-- Output: line_name, delay_count
-- WHERE actual > scheduled + interval '2 minutes'

Q9: Trips with 3+ delayed stops

-- Output: trip_id, delayed_stop_count
-- Uses: Subquery or HAVING

Q10: Stops with above-average ridership

-- Output: stop_name, total_boardings
-- Subquery for AVG comparison

Output format (JSON):

{
  "query": "Q1",
  "description": "Route 20 stops in order",
  "results": [
    {"stop_name": "Wilshire / Veteran", "sequence": 1, "time_offset": 0},
    {"stop_name": "Le Conte / Broxton", "sequence": 2, "time_offset": 5}
  ],
  "count": 2
}

Part D: Docker Setup

A Dockerfile, docker-compose.yaml, and requirements.txt are provided in the starter code.

Code: Dockerfile

FROM python:3.11-slim

WORKDIR /app

RUN apt-get update && apt-get install -y postgresql-client && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY schema.sql load_data.py queries.py ./
COPY data/ /app/data/

CMD ["python", "load_data.py"]

Code: docker-compose.yaml

version: '3.8'

services:
  db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: transit
      POSTGRES_USER: transit
      POSTGRES_PASSWORD: transit123
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

  app:
    build: .
    depends_on:
      - db
    volumes:
      - ./data:/app/data:ro

  adminer:
    image: adminer:latest
    ports:
      - "8080:8080"
    depends_on:
      - db

volumes:
  pgdata:

Code: requirements.txt

psycopg2-binary>=2.9.0

The compose file defines three services:

db - PostgreSQL 15 database (user: transit, password: transit123, database: transit)
app - Your Python scripts, built from the Dockerfile
adminer - Web UI at http://localhost:8080 for browsing tables and running queries

Part E: Testing

Build and start the database, then test your scripts:

docker-compose build
docker-compose up -d db
sleep 5

# Load data
docker-compose run --rm app python load_data.py --datadir /app/data

# Run a single query
docker-compose run --rm app python queries.py Q1 --format json

# Run all queries
docker-compose run --rm app python queries.py all

Browse tables interactively at http://localhost:8080 (Adminer).

Part F: README Analysis

Create README.md answering:

Schema Decisions: Natural vs surrogate keys? Why?
Constraints: What CHECK/UNIQUE constraints did you add?
Complex Query: Which query was hardest? Why?
Foreign Keys: Give example of invalid data they prevent
When Relational: Why is SQL good for this domain?

Deliverables

See Submission.

Grading Commands

We will validate your submission by running the following commands from your q1/ directory:

docker-compose build
docker-compose up -d db
sleep 5
docker-compose run --rm app python load_data.py --datadir /app/data
docker-compose run --rm app python queries.py all --format json
docker-compose down

These commands must complete without errors. We will then verify:

Schema has proper constraints (foreign keys, CHECK, UNIQUE)
All 10 queries return correct results
Data loading handles CSV import order correctly
Queries execute within 500ms on the provided dataset
README answers all five analysis questions

Problem 2: ArXiv Paper Discovery with DynamoDB

Build a paper discovery system using AWS DynamoDB that efficiently supports multiple access patterns through schema design and denormalization.

This problem requires boto3 (AWS SDK for Python) and Python standard library modules (json, sys, os, datetime, re, collections). Do not use other AWS libraries, NoSQL ORMs, or database abstraction layers beyond boto3.

Part A: Schema Design for Access Patterns

Design a DynamoDB table schema that efficiently supports these required query patterns:

Browse recent papers by category (e.g., “Show me latest ML papers”)
Find all papers by a specific author
Get full paper details by arxiv_id
List papers published in a date range within a category
Search papers by keyword (extracted from abstract)

Design Requirements:

Define partition key and sort key for main table
Design Global Secondary Indexes (GSIs) to support all access patterns
Implement denormalization strategy for efficient queries
Document trade-offs in your schema design

Example Schema Structure:

# Main Table Item
{
  "PK": "CATEGORY#cs.LG",
  "SK": "2023-01-15#2301.12345",
  "arxiv_id": "2301.12345",
  "title": "Paper Title",
  "authors": ["Author1", "Author2"],
  "abstract": "Full abstract text...",
  "categories": ["cs.LG", "cs.AI"],
  "keywords": ["keyword1", "keyword2"],
  "published": "2023-01-15T10:30:00Z"
}

# GSI1: Author access
{
  "GSI1PK": "AUTHOR#Author1",
  "GSI1SK": "2023-01-15",
  # ... rest of paper data
}

# Additional GSIs as needed for other access patterns

Part B: Data Loading Script

Create load_data.py that loads ArXiv papers from your HW#2 Problem 2 output (papers.json) into DynamoDB.

Your script must accept these command line arguments:

python load_data.py <papers_json_path> <table_name> [--region REGION]

Required Operations:

Create DynamoDB table with appropriate partition/sort keys
Create GSIs for alternate access patterns
Transform paper data from HW#2 format to DynamoDB items
Extract keywords from abstracts (top 10 most frequent words, excluding stopwords)
Implement denormalization:
- Papers in multiple categories → multiple items
- Multiple authors → items for each author (GSI)
- Multiple keywords → items for each keyword (GSI)
Batch write items to DynamoDB (use batch_write_item for efficiency)
Report statistics:
- Number of papers loaded
- Total DynamoDB items created
- Denormalization factor (items/paper ratio)

Example Output:

Creating DynamoDB table: arxiv-papers
Creating GSIs: AuthorIndex, PaperIdIndex, KeywordIndex
Loading papers from papers.json...
Extracting keywords from abstracts...
Loaded 157 papers
Created 2,345 DynamoDB items (denormalized)
Denormalization factor: 14.9x

Storage breakdown:
  - Category items: 314 (2.0 per paper avg)
  - Author items: 785 (5.0 per paper avg)
  - Keyword items: 1,570 (10.0 per paper avg)
  - Paper ID items: 157 (1.0 per paper)

Keyword Extraction:

A stopwords.py is provided in the starter code (same base list from HW#2, plus domain-specific terms for academic papers). Use STOPWORDS from this module when filtering keywords.

Code: stopwords.py

"""
Stopwords for keyword extraction.

Base stopwords from HW#2, plus domain-specific terms for academic papers.
"""

STOPWORDS = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
             'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
             'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
             'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
             'can', 'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it',
             'we', 'they', 'what', 'which', 'who', 'when', 'where', 'why', 'how',
             'all', 'each', 'every', 'both', 'few', 'more', 'most', 'other', 'some',
             'such', 'as', 'also', 'very', 'too', 'only', 'so', 'than', 'not'}

DOMAIN_STOPWORDS = {'use', 'using', 'based', 'approach', 'method',
                    'paper', 'propose', 'proposed', 'show', 'our'}

STOPWORDS = STOPWORDS | DOMAIN_STOPWORDS

Part C: Query Implementation

Create query_papers.py that implements queries for all five access patterns.

Your script must support these commands:

# Query 1: Recent papers in category
python query_papers.py recent <category> [--limit 20] [--table TABLE]

# Query 2: Papers by author
python query_papers.py author <author_name> [--table TABLE]

# Query 3: Get paper by ID
python query_papers.py get <arxiv_id> [--table TABLE]

# Query 4: Papers in date range
python query_papers.py daterange <category> <start_date> <end_date> [--table TABLE]

# Query 5: Papers by keyword
python query_papers.py keyword <keyword> [--limit 20] [--table TABLE]

Each query function should use the appropriate DynamoDB operation:

Query 1: Main table partition key query with sort key descending
Query 2: GSI (AuthorIndex) partition key query
Query 3: GSI (PaperIdIndex) for direct lookup
Query 4: Main table with composite sort key range query (between)
Query 5: GSI (KeywordIndex) partition key query

Output Format:

All queries must output JSON to stdout:

{
  "query_type": "recent_in_category",
  "parameters": {
    "category": "cs.LG",
    "limit": 20
  },
  "results": [
    {
      "arxiv_id": "2301.12345",
      "title": "Paper Title",
      "authors": ["Author1", "Author2"],
      "published": "2023-01-15T10:30:00Z",
      "categories": ["cs.LG"]
    }
  ],
  "count": 20,
  "execution_time_ms": 12
}

Part D: Analysis and Documentation

Create README.md answering:

Schema Design Decisions:
- Why did you choose your partition key structure?
- How many GSIs did you create and why?
- What denormalization trade-offs did you make?
Denormalization Analysis:
- Average number of DynamoDB items per paper
- Storage multiplication factor
- Which access patterns caused the most duplication?
Query Limitations:
- What queries are NOT efficiently supported by your schema?
- Examples: “Count total papers by author”, “Most cited papers globally”
- Why are these difficult in DynamoDB?
When to Use DynamoDB:
- Based on this exercise, when would you choose DynamoDB over PostgreSQL?
- What are the key trade-offs?

Deliverables

See Submission.

Grading Commands

We will validate your submission by running the following commands from your q2/ directory:

pip install -r requirements.txt
python load_data.py papers.json arxiv-papers --region us-west-2
python query_papers.py recent cs.LG --limit 5 --table arxiv-papers
python query_papers.py author "Author Name" --table arxiv-papers
python query_papers.py get 2301.12345 --table arxiv-papers
python query_papers.py daterange cs.LG 2023-01-01 2023-12-31 --table arxiv-papers
python query_papers.py keyword transformer --limit 10 --table arxiv-papers

We will also verify:

DynamoDB table and GSIs are created correctly
All five query patterns return correct results
Denormalization is implemented (multiple items per paper)
README answers all analysis questions

Submission

README.md
q1/
├── schema.sql
├── load_data.py
├── queries.py
├── Dockerfile
├── docker-compose.yaml
└── requirements.txt
q2/
├── load_data.py
├── query_papers.py
├── stopwords.py
├── requirements.txt
└── README.md