Problem 1: Metro Transit Database (Relational SQL)
Use only the following packages:
- psycopg2-binary (PostgreSQL adapter for Python)
- Python standard library modules (json, sys, os, csv, datetime, argparse)
Do not use ORMs (SQLAlchemy, Django ORM), query builders, or pandas. Write raw SQL.
Design and query a relational database for a city metro transit system tracking lines, stops, trips, and ridership.
Part A: Schema Design
You will design a database schema based on these entity descriptions. The CSV data files are provided.
Entities:
- Lines - Transit routes (e.g., Route 2, Route 20)
- Has: name, vehicle type
- Stops - Stop locations
- Has: name, latitude, longitude
- Line Stops - Which stops each line serves, in order
- Links line to stops with sequence number and time offset from start
- A stop can be on multiple lines (transfer stations)
- Trips - Scheduled vehicle runs
- Has: line, departure time, vehicle ID
- Stop Events - Actual arrivals during trips
- Has: trip, stop, scheduled time, actual time, passengers on/off
Relationships:
- “A line has many stops (ordered)” → line_stops table
- “A trip runs on one line” → foreign key
- “A trip has many stop events” → foreign key
Your Task:
Create schema.sql defining all five tables. You must decide:
- Primary keys (natural vs surrogate keys)
- Foreign key constraints
- CHECK constraints (e.g.,
passengers_on >= 0) - UNIQUE constraints where appropriate
Example starter (you complete this):
CREATE TABLE lines (
line_id SERIAL PRIMARY KEY, -- or use line_name as natural key?
line_name VARCHAR(50) NOT NULL UNIQUE,
vehicle_type VARCHAR(10) CHECK (vehicle_type IN ('rail', 'bus'))
);
CREATE TABLE stops (
-- Your design here
);
CREATE TABLE line_stops (
-- Your design here
-- Must handle: line_id, stop_id, sequence_number, time_offset_minutes
);
-- Continue for trips and stop_eventsPart B: Data Loading
Provided CSV files:
lines.csv- 5 routesstops.csv- ~105 stopsline_stops.csv- ~105 combinationstrips.csv- ~525 tripsstop_events.csv- ~11,000 events
CSV Formats:
lines.csv:
line_name,vehicle_type
Route 2,bus
Route 4,bus
Route 20,bus
stops.csv:
stop_name,latitude,longitude
Wilshire / Veteran,34.057616,-118.447888
Le Conte / Broxton,34.063594,-118.446732
line_stops.csv:
line_name,stop_name,sequence,time_offset
Route 20,Wilshire / Veteran,1,0
Route 20,Le Conte / Broxton,2,5
trips.csv:
trip_id,line_name,scheduled_departure,vehicle_id
T0001,Route 2,2025-10-01 06:00:00,V101
stop_events.csv:
trip_id,stop_name,scheduled,actual,passengers_on,passengers_off
T0001,Le Conte / Broxton,2025-10-01 06:00:00,2025-10-01 06:00:00,29,0
T0001,Le Conte / Westwood,2025-10-01 06:02:00,2025-10-01 06:02:00,25,29
Create load_data.py that:
- Connects to PostgreSQL
- Runs
schema.sqlto create tables - Loads CSVs in correct order (handle foreign keys)
- Reports statistics
python load_data.py --host localhost --dbname transit --user postgres --password pass --datadir data/Example output:
Connected to transit@localhost
Creating schema...
Tables created: lines, stops, line_stops, trips, stop_events
Loading data/lines.csv... 5 rows
Loading data/stops.csv... 105 rows
Loading data/line_stops.csv... 105 rows
Loading data/trips.csv... 525 rows
Loading data/stop_events.csv... 11,025 rows
Total: 11,765 rows loaded
Part C: Query Implementation
Create queries.py that implements 10 SQL queries.
Your script must support:
python queries.py --query Q1 --dbname transit --format json
python queries.py --all --dbname transitRequired Queries:
Q1: List all stops on Route 20 in order
-- Output: stop_name, sequence, time_offset
SELECT ...
FROM line_stops ls
JOIN lines l ON ...
JOIN stops s ON ...
WHERE l.line_name = 'Route 20'
ORDER BY ls.sequence;Q2: Trips during morning rush (7-9 AM)
-- Output: trip_id, line_name, scheduled_departureQ3: Transfer stops (stops on 2+ routes)
-- Output: stop_name, line_count
-- Uses: GROUP BY, HAVINGQ4: Complete route for trip T0001
-- Output: All stops for specific trip in order
-- Multi-table JOINQ5: Routes serving both Wilshire / Veteran and Le Conte / Broxton
-- Output: line_nameQ6: Average ridership by line
-- Output: line_name, avg_passengers
-- Aggregation across stop_eventsQ7: Top 10 busiest stops
-- Output: stop_name, total_activity
-- total_activity = SUM(passengers_on + passengers_off)Q8: Count delays by line (>2 min late)
-- Output: line_name, delay_count
-- WHERE actual > scheduled + interval '2 minutes'Q9: Trips with 3+ delayed stops
-- Output: trip_id, delayed_stop_count
-- Uses: Subquery or HAVINGQ10: Stops with above-average ridership
-- Output: stop_name, total_boardings
-- Subquery for AVG comparisonOutput format (JSON):
{
"query": "Q1",
"description": "Red Line stops in order",
"results": [
{"stop_name": "Union Station", "sequence": 1, "time_offset": 0},
{"stop_name": "7th St Metro", "sequence": 2, "time_offset": 5}
],
"count": 2
}Part D: Docker Setup
Create docker-compose.yaml:
version: '3.8'
services:
db:
image: postgres:15-alpine
environment:
POSTGRES_DB: transit
POSTGRES_USER: transit
POSTGRES_PASSWORD: transit123
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
adminer:
image: adminer:latest
ports:
- "8080:8080"
depends_on:
- db
volumes:
pgdata:Adminer provides a web UI at http://localhost:8080 to browse tables and run queries (System: PostgreSQL, Server: db, Username: transit, Password: transit123, Database: transit).
Create Dockerfile:
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y postgresql-client && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY schema.sql load_data.py queries.py ./
CMD ["python", "load_data.py"]Create requirements.txt:
psycopg2-binary>=2.9.0
Part E: Build and Run Scripts
Create build.sh:
#!/bin/bash
docker-compose buildCreate run.sh:
#!/bin/bash
echo "Starting PostgreSQL..."
docker-compose up -d db
echo "Waiting for database..."
sleep 3
echo "Loading data..."
docker-compose run --rm app python load_data.py \
--host db --dbname transit --user transit --password transit123 --datadir /app/data
echo ""
echo "Running sample queries..."
docker-compose run --rm app python queries.py --query Q1 --dbname transit
docker-compose run --rm app python queries.py --query Q3 --dbname transitCreate test.sh:
#!/bin/bash
./build.sh
./run.sh
echo ""
echo "Testing all queries..."
for i in {1..10}; do
docker-compose run --rm app python queries.py --query Q$i --dbname transit --format json
done
docker-compose downPart F: README Analysis
Create README.md answering:
- Schema Decisions: Natural vs surrogate keys? Why?
- Constraints: What CHECK/UNIQUE constraints did you add?
- Complex Query: Which query was hardest? Why?
- Foreign Keys: Give example of invalid data they prevent
- When Relational: Why is SQL good for this domain?
Deliverables
Your problem1/ directory must contain:
problem1/
├── schema.sql
├── load_data.py
├── queries.py
├── Dockerfile
├── docker-compose.yaml
├── requirements.txt
├── build.sh
├── run.sh
├── test.sh
└── README.md
All shell scripts must be executable (chmod +x *.sh).
Validation
We will test by:
- Running
./build.sh- must complete without errors - Running
./test.sh- must load data and execute all 10 queries - Verifying schema has proper constraints
- Testing queries return correct results
- Checking README answers all questions
Queries must execute within 500ms on the provided dataset.