hw03-q1

Problem 1: Metro Transit Database (Relational SQL)

Requirements

Use only the following packages:

psycopg2-binary (PostgreSQL adapter for Python)
Python standard library modules (json, sys, os, csv, datetime, argparse)

Do not use ORMs (SQLAlchemy, Django ORM), query builders, or pandas. Write raw SQL.

Design and query a relational database for a city metro transit system tracking lines, stops, trips, and ridership.

Part A: Schema Design

You will design a database schema based on these entity descriptions. The CSV data files are provided.

Entities:

Lines - Transit routes (e.g., Route 2, Route 20)
- Has: name, vehicle type
Stops - Stop locations
- Has: name, latitude, longitude
Line Stops - Which stops each line serves, in order
- Links line to stops with sequence number and time offset from start
- A stop can be on multiple lines (transfer stations)
Trips - Scheduled vehicle runs
- Has: line, departure time, vehicle ID
Stop Events - Actual arrivals during trips
- Has: trip, stop, scheduled time, actual time, passengers on/off

Relationships:

“A line has many stops (ordered)” → line_stops table
“A trip runs on one line” → foreign key
“A trip has many stop events” → foreign key

Your Task:

Create schema.sql defining all five tables. You must decide:

Primary keys (natural vs surrogate keys)
Foreign key constraints
CHECK constraints (e.g., passengers_on >= 0)
UNIQUE constraints where appropriate

Example starter (you complete this):

CREATE TABLE lines (
    line_id SERIAL PRIMARY KEY,  -- or use line_name as natural key?
    line_name VARCHAR(50) NOT NULL UNIQUE,
    vehicle_type VARCHAR(10) CHECK (vehicle_type IN ('rail', 'bus'))
);

CREATE TABLE stops (
    -- Your design here
);

CREATE TABLE line_stops (
    -- Your design here
    -- Must handle: line_id, stop_id, sequence_number, time_offset_minutes
);

-- Continue for trips and stop_events

Part B: Data Loading

Provided CSV files:

lines.csv - 5 routes
stops.csv - ~105 stops
line_stops.csv - ~105 combinations
trips.csv - ~525 trips
stop_events.csv - ~11,000 events

CSV Formats:

lines.csv:

line_name,vehicle_type
Route 2,bus
Route 4,bus
Route 20,bus

stops.csv:

stop_name,latitude,longitude
Wilshire / Veteran,34.057616,-118.447888
Le Conte / Broxton,34.063594,-118.446732

line_stops.csv:

line_name,stop_name,sequence,time_offset
Route 20,Wilshire / Veteran,1,0
Route 20,Le Conte / Broxton,2,5

trips.csv:

trip_id,line_name,scheduled_departure,vehicle_id
T0001,Route 2,2025-10-01 06:00:00,V101

stop_events.csv:

trip_id,stop_name,scheduled,actual,passengers_on,passengers_off
T0001,Le Conte / Broxton,2025-10-01 06:00:00,2025-10-01 06:00:00,29,0
T0001,Le Conte / Westwood,2025-10-01 06:02:00,2025-10-01 06:02:00,25,29

Create load_data.py that:

Connects to PostgreSQL
Runs schema.sql to create tables
Loads CSVs in correct order (handle foreign keys)
Reports statistics

python load_data.py --host localhost --dbname transit --user postgres --password pass --datadir data/

Example output:

Connected to transit@localhost
Creating schema...
Tables created: lines, stops, line_stops, trips, stop_events

Loading data/lines.csv... 5 rows
Loading data/stops.csv... 105 rows
Loading data/line_stops.csv... 105 rows
Loading data/trips.csv... 525 rows
Loading data/stop_events.csv... 11,025 rows

Total: 11,765 rows loaded

Part C: Query Implementation

Create queries.py that implements 10 SQL queries.

Your script must support:

python queries.py --query Q1 --dbname transit --format json
python queries.py --all --dbname transit

Required Queries:

Q1: List all stops on Route 20 in order

-- Output: stop_name, sequence, time_offset
SELECT ...
FROM line_stops ls
JOIN lines l ON ...
JOIN stops s ON ...
WHERE l.line_name = 'Route 20'
ORDER BY ls.sequence;

Q2: Trips during morning rush (7-9 AM)

-- Output: trip_id, line_name, scheduled_departure

Q3: Transfer stops (stops on 2+ routes)

-- Output: stop_name, line_count
-- Uses: GROUP BY, HAVING

Q4: Complete route for trip T0001

-- Output: All stops for specific trip in order
-- Multi-table JOIN

Q5: Routes serving both Wilshire / Veteran and Le Conte / Broxton

-- Output: line_name

Q6: Average ridership by line

-- Output: line_name, avg_passengers
-- Aggregation across stop_events

Q7: Top 10 busiest stops

-- Output: stop_name, total_activity
-- total_activity = SUM(passengers_on + passengers_off)

Q8: Count delays by line (>2 min late)

-- Output: line_name, delay_count
-- WHERE actual > scheduled + interval '2 minutes'

Q9: Trips with 3+ delayed stops

-- Output: trip_id, delayed_stop_count
-- Uses: Subquery or HAVING

Q10: Stops with above-average ridership

-- Output: stop_name, total_boardings
-- Subquery for AVG comparison

Output format (JSON):

{
  "query": "Q1",
  "description": "Red Line stops in order",
  "results": [
    {"stop_name": "Union Station", "sequence": 1, "time_offset": 0},
    {"stop_name": "7th St Metro", "sequence": 2, "time_offset": 5}
  ],
  "count": 2
}

Part D: Docker Setup

Create docker-compose.yaml:

version: '3.8'

services:
  db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: transit
      POSTGRES_USER: transit
      POSTGRES_PASSWORD: transit123
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

  adminer:
    image: adminer:latest
    ports:
      - "8080:8080"
    depends_on:
      - db

volumes:
  pgdata:

Adminer provides a web UI at http://localhost:8080 to browse tables and run queries (System: PostgreSQL, Server: db, Username: transit, Password: transit123, Database: transit).

Create Dockerfile:

FROM python:3.11-slim

WORKDIR /app

RUN apt-get update && apt-get install -y postgresql-client && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY schema.sql load_data.py queries.py ./

CMD ["python", "load_data.py"]

Create requirements.txt:

psycopg2-binary>=2.9.0

Part E: Build and Run Scripts

Create build.sh:

#!/bin/bash
docker-compose build

Create run.sh:

#!/bin/bash

echo "Starting PostgreSQL..."
docker-compose up -d db

echo "Waiting for database..."
sleep 3

echo "Loading data..."
docker-compose run --rm app python load_data.py \
    --host db --dbname transit --user transit --password transit123 --datadir /app/data

echo ""
echo "Running sample queries..."
docker-compose run --rm app python queries.py --query Q1 --dbname transit
docker-compose run --rm app python queries.py --query Q3 --dbname transit

Create test.sh:

#!/bin/bash

./build.sh
./run.sh

echo ""
echo "Testing all queries..."
for i in {1..10}; do
    docker-compose run --rm app python queries.py --query Q$i --dbname transit --format json
done

docker-compose down

Part F: README Analysis

Create README.md answering:

Schema Decisions: Natural vs surrogate keys? Why?
Constraints: What CHECK/UNIQUE constraints did you add?
Complex Query: Which query was hardest? Why?
Foreign Keys: Give example of invalid data they prevent
When Relational: Why is SQL good for this domain?

Deliverables

Your problem1/ directory must contain:

problem1/
├── schema.sql
├── load_data.py
├── queries.py
├── Dockerfile
├── docker-compose.yaml
├── requirements.txt
├── build.sh
├── run.sh
├── test.sh
└── README.md

All shell scripts must be executable (chmod +x *.sh).

Validation

We will test by:

Running ./build.sh - must complete without errors
Running ./test.sh - must load data and execute all 10 queries
Verifying schema has proper constraints
Testing queries return correct results
Checking README answers all questions

Queries must execute within 500ms on the provided dataset.