Homework #2: Docker, ArXiv API, Multi-Container Pipelines
EE 547: Spring 2026
Assigned: 28 January
Due: Tuesday, 10 February at 23:59
Gradescope: Homework 2 | How to Submit
- Docker Desktop must be installed and running on your machine
- Use only Python standard library modules unless explicitly permitted
Overview
This assignment introduces containerization using Docker. You will build and run containers, manage data persistence through volumes, and create multi-container applications using Docker Compose.
Getting Started
Download the starter code: hw2-starter.zip
unzip hw2-starter.zip
cd hw2-starterProblem 1: Docker Basics – HTTP Data Fetcher
Write a Python application that fetches data from HTTP endpoints, processes the responses, and outputs structured results. You will containerize this application using Docker.
Part A: Python HTTP Fetcher
Create a file fetch_and_process.py that fetches data from URLs and computes statistics about the responses.
Your script must accept exactly two command line arguments:
- Path to an input file containing URLs (one per line)
- Path to output directory
For each URL in the input file, your script must:
- Perform an HTTP GET request to the URL
- Measure the response time in milliseconds
- Capture the HTTP status code
- Calculate the size of the response body in bytes
- Count the number of words in the response (for text responses only)
Your script must write three files to the output directory:
File 1: responses.json - Array of response data:
[
{
"url": "[URL string]",
"status_code": [integer],
"response_time_ms": [float],
"content_length": [integer],
"word_count": [integer or null],
"timestamp": "[ISO-8601 UTC]",
"error": [null or error message string]
},
...
]File 2: summary.json - Aggregate statistics:
{
"total_urls": [integer],
"successful_requests": [integer],
"failed_requests": [integer],
"average_response_time_ms": [float],
"total_bytes_downloaded": [integer],
"status_code_distribution": {
"200": [count],
"404": [count],
...
},
"processing_start": "[ISO-8601 UTC]",
"processing_end": "[ISO-8601 UTC]"
}File 3: errors.log - One line per error:
[ISO-8601 UTC] [URL]: [error message]
Requirements:
- Use only
urllib.requestfor HTTP requests (norequestslibrary) - Use only standard library modules:
sys,json,time,datetime,os,re - For word counting, consider a word as any sequence of alphanumeric characters
- If a request fails (connection error, timeout, etc.), record the error and continue
- Set a timeout of 10 seconds for each request
- If response Content-Type header contains “text”, perform word count; otherwise set to null
- All timestamps must be UTC in ISO-8601 format with ‘Z’ suffix
Part B: Dockerfile
Create a Dockerfile that packages your Python application.
FROM python:3.11-slim
WORKDIR /app
COPY fetch_and_process.py /app/
RUN mkdir -p /data/input /data/output
ENTRYPOINT ["python", "/app/fetch_and_process.py"]
CMD ["/data/input/urls.txt", "/data/output"]The Dockerfile must:
- Use
python:3.11-slimas the base image (no other base image permitted) - Set working directory to
/app - Copy your script to the container
- Create input and output directories at
/data/inputand/data/output - Use ENTRYPOINT for the Python interpreter and script
- Use CMD for default arguments (can be overridden at runtime)
Part C: Building and Running
Build your container image:
docker build -t http-fetcher:latest .
Run your container with input/output volume mounts:
docker run --rm \
-v "$(pwd)/test_urls.txt":/data/input/urls.txt:ro \
-v "$(pwd)/output":/data/output \
http-fetcher:latest
The volume mounts connect your host filesystem to the container:
-v "$(pwd)/test_urls.txt":/data/input/urls.txt:romounts your input file read-only-v "$(pwd)/output":/data/outputmounts the output directory for results--rmremoves the container after execution
On Windows, replace $(pwd) with the full path or %cd% (cmd) / ${PWD} (PowerShell).
Part D: Testing
Create test_urls.txt with the following URLs:
http://httpbin.org/status/200
http://httpbin.org/delay/2
http://httpbin.org/status/404
http://httpbin.org/json
http://httpbin.org/html
https://www.example.com
http://httpbin.org/status/500
http://invalid.url.that.does.not.exist.com
Your application must handle all these cases correctly:
- Successful responses (200)
- Delayed responses (testing timeout behavior)
- Client errors (404)
- Server errors (500)
- JSON responses (Content-Type: application/json)
- HTML responses (Content-Type: text/html)
- Invalid URLs / DNS failures
Validation
We will validate your submission by running the following commands from your q1/ directory:
docker build -t http-fetcher:latest .
docker run --rm \
-v "$(pwd)/test_urls.txt":/data/input/urls.txt:ro \
-v "$(pwd)/output":/data/output \
http-fetcher:latest
These commands must complete without errors. We will then verify:
output/responses.json,output/summary.json, andoutput/errors.logexist- JSON structure and content match the specification
- Correct behavior with different URL lists
Your container must not require network configuration beyond Docker defaults.
Deliverables
See Submission.
Problem 2: ArXiv Paper Metadata Processor
Build a containerized application that fetches paper metadata from the ArXiv API, processes it, and generates structured output.
Part A: ArXiv API Client
Create a file arxiv_processor.py that queries the ArXiv API and extracts paper metadata.
Your script must accept exactly three command line arguments:
- Search query string (e.g., “cat:cs.LG” for machine learning papers)
- Maximum number of results to fetch (integer between 1 and 100)
- Path to output directory
Your script must perform the following operations:
- Query the ArXiv API using the search query
- Fetch up to the specified maximum number of results
- Extract and process metadata for each paper
- Generate text analysis statistics
- Write structured output files
ArXiv API endpoint: http://export.arxiv.org/api/query
Query parameters:
search_query: Your search stringstart: Starting index (0-based)max_results: Maximum results to return
Example API call:
http://export.arxiv.org/api/query?search_query=cat:cs.LG&start=0&max_results=10
The API returns XML (parsing guide). You must parse this XML to extract:
- Paper ID (from the
<id>tag, extract just the ID portion after the last ‘/’) - Title (from
<title>) - Authors (from all
<author><name>tags) - Abstract (from
<summary>) - Categories (from all
<category>tags’termattribute) - Published date (from
<published>) - Updated date (from
<updated>)
Part B: Text Processing
For each paper’s abstract, compute the following:
Word frequency analysis:
- Total word count
- Unique word count
- Top 20 most frequent words (excluding stopwords)
- Average word length
Sentence analysis:
- Total sentence count (split on ‘.’, ‘!’, ‘?’)
- Average words per sentence
- Longest sentence (by word count)
- Shortest sentence (by word count)
Technical term extraction:
- Extract all words containing uppercase letters (e.g., “LSTM”, “GPU”)
- Extract all words containing numbers (e.g., “3D”, “ResNet50”)
- Extract all hyphenated terms (e.g., “state-of-the-art”, “pre-trained”)
Use the stopwords list provided in the starter code:
Code: stopwords.py
"""
Stopwords for text analysis.
"""
STOPWORDS = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
'can', 'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it',
'we', 'they', 'what', 'which', 'who', 'when', 'where', 'why', 'how',
'all', 'each', 'every', 'both', 'few', 'more', 'most', 'other', 'some',
'such', 'as', 'also', 'very', 'too', 'only', 'so', 'than', 'not'}
Part C: Output Files
Your script must write three files to the output directory:
File 1: papers.json - Array of paper metadata:
[
{
"arxiv_id": "[paper ID]",
"title": "[paper title]",
"authors": ["author1", "author2", ...],
"abstract": "[full abstract text]",
"categories": ["cat1", "cat2", ...],
"published": "[ISO-8601 UTC]",
"updated": "[ISO-8601 UTC]",
"abstract_stats": {
"total_words": [integer],
"unique_words": [integer],
"total_sentences": [integer],
"avg_words_per_sentence": [float],
"avg_word_length": [float]
}
},
...
]File 2: corpus_analysis.json - Aggregate analysis across all papers:
{
"query": "[search query used]",
"papers_processed": [integer],
"processing_timestamp": "[ISO-8601 UTC]",
"corpus_stats": {
"total_abstracts": [integer],
"total_words": [integer],
"unique_words_global": [integer],
"avg_abstract_length": [float],
"longest_abstract_words": [integer],
"shortest_abstract_words": [integer]
},
"top_50_words": [
{"word": "[word1]", "frequency": [count], "documents": [count]},
...
],
"technical_terms": {
"uppercase_terms": ["TERM1", "TERM2", ...],
"numeric_terms": ["term1", "term2", ...],
"hyphenated_terms": ["term-1", "term-2", ...]
},
"category_distribution": {
"cs.LG": [count],
"cs.AI": [count],
...
}
}File 3: processing.log - Processing log with one line per event:
[ISO-8601 UTC] Starting ArXiv query: [query]
[ISO-8601 UTC] Fetched [N] results from ArXiv API
[ISO-8601 UTC] Processing paper: [arxiv_id]
[ISO-8601 UTC] Completed processing: [N] papers in [X.XX] seconds
Part D: Error Handling
Your script must handle the following error conditions:
- Network errors: If the ArXiv API is unreachable, write error to log and exit with code 1
- Invalid XML: If the API returns malformed XML, log the error and continue with other papers
- Missing fields: If a paper lacks required fields, skip it and log a warning
- Rate limiting: If you receive HTTP 429, wait 3 seconds and retry (maximum 3 attempts)
Requirements:
- Use only standard library modules:
sys,json,urllib.request,xml.etree.ElementTree,datetime,time,re,os - All word processing must be case-insensitive for frequency counting
- Preserve original case in the output
- Handle Unicode properly (ArXiv abstracts often contain mathematical symbols)
Part E: Dockerfile
Create a Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY arxiv_processor.py stopwords.py /app/
RUN mkdir -p /data/output
ENTRYPOINT ["python", "/app/arxiv_processor.py"]Part F: Building and Running
Build your container image:
docker build -t arxiv-processor:latest .
Run your container:
docker run --rm \
-v "$(pwd)/output":/data/output \
arxiv-processor:latest \
"cat:cs.LG" 10 /data/output
The arguments are passed directly to your Python script:
"cat:cs.LG"- the search query10- maximum results to fetch/data/output- output directory inside the container
On Windows, replace $(pwd) with the full path or %cd% (cmd) / ${PWD} (PowerShell).
Part G: Testing
Test your container with various queries:
# Machine Learning papers
docker run --rm -v "$(pwd)/output_ml":/data/output \
arxiv-processor:latest "cat:cs.LG" 5 /data/output
# Search by author
docker run --rm -v "$(pwd)/output_author":/data/output \
arxiv-processor:latest "au:LeCun" 3 /data/output
# Search by title keyword
docker run --rm -v "$(pwd)/output_title":/data/output \
arxiv-processor:latest "ti:transformer" 10 /data/output
Validation
We will validate your submission by running the following commands from your q2/ directory:
docker build -t arxiv-processor:latest .
docker run --rm \
-v "$(pwd)/output":/data/output \
arxiv-processor:latest \
"cat:cs.LG" 10 /data/output
These commands must complete without errors. We will then verify:
output/papers.json,output/corpus_analysis.json, andoutput/processing.logexist- JSON structure and content match the specification
- Word frequencies are accurate
- Container handles network errors gracefully
Your container must respect ArXiv’s rate limits and terms of service. Do not make more than 1 request per 3 seconds to avoid being blocked.
Deliverables
See Submission.
Problem 3: Multi-Container Text Processing Pipeline with Docker Compose
Build a multi-container application that processes web content through sequential stages. Containers coordinate through a shared filesystem, demonstrating batch processing patterns used in data pipelines.
Architecture
Three containers process data in sequence:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ fetcher │────▶│ processor │────▶│ analyzer │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
/shared/ /shared/ /shared/
└── raw/ └── processed/ └── analysis/
└── status/ └── status/ └── status/
Containers communicate through filesystem markers:
- Each container monitors
/shared/status/for its input signal - Processing stages write completion markers when finished
- Data flows through
/shared/subdirectories
Part A: Container 1 - Data Fetcher
The fetcher is provided in the starter code. Study this code to understand the coordination pattern.
Code: fetcher/fetch.py
#!/usr/bin/env python3
"""
Data Fetcher - downloads URLs and writes to shared volume.
"""
import json
import os
import sys
import time
import urllib.request
from datetime import datetime, timezone
def main():
print(f"[{datetime.now(timezone.utc).isoformat()}] Fetcher starting", flush=True)
# Wait for input file
input_file = "/shared/input/urls.txt"
while not os.path.exists(input_file):
print(f"Waiting for {input_file}...", flush=True)
time.sleep(2)
# Read URLs
with open(input_file, 'r') as f:
urls = [line.strip() for line in f if line.strip()]
# Create output directory
os.makedirs("/shared/raw", exist_ok=True)
os.makedirs("/shared/status", exist_ok=True)
# Fetch each URL
results = []
for i, url in enumerate(urls, 1):
output_file = f"/shared/raw/page_{i}.html"
try:
print(f"Fetching {url}...", flush=True)
with urllib.request.urlopen(url, timeout=10) as response:
content = response.read()
with open(output_file, 'wb') as f:
f.write(content)
results.append({
"url": url,
"file": f"page_{i}.html",
"size": len(content),
"status": "success"
})
except Exception as e:
results.append({
"url": url,
"file": None,
"error": str(e),
"status": "failed"
})
time.sleep(1) # Rate limiting
# Write completion status
status = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"urls_processed": len(urls),
"successful": sum(1 for r in results if r["status"] == "success"),
"failed": sum(1 for r in results if r["status"] == "failed"),
"results": results
}
with open("/shared/status/fetch_complete.json", 'w') as f:
json.dump(status, f, indent=2)
print(f"[{datetime.now(timezone.utc).isoformat()}] Fetcher complete", flush=True)
if __name__ == "__main__":
main()
Create fetcher/Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY fetch.py /app/
CMD ["python", "-u", "/app/fetch.py"]The -u flag disables output buffering to ensure real-time logging.
Part B: Container 2 - HTML Processor
Create processor/process.py that extracts and analyzes text from HTML files.
Required processing operations:
- Wait for
/shared/status/fetch_complete.json - Read all HTML files from
/shared/raw/ - Extract text content using regex (not BeautifulSoup)
- Extract all links (href attributes)
- Extract all images (src attributes)
- Count words, sentences, paragraphs
- Save processed data to
/shared/processed/ - Create
/shared/status/process_complete.json
Text extraction requirements:
def strip_html(html_content):
"""Remove HTML tags and extract text."""
# Remove script and style elements
html_content = re.sub(r'<script[^>]*>.*?</script>', '', html_content, flags=re.DOTALL | re.IGNORECASE)
html_content = re.sub(r'<style[^>]*>.*?</style>', '', html_content, flags=re.DOTALL | re.IGNORECASE)
# Extract links before removing tags
links = re.findall(r'href=[\'"]?([^\'" >]+)', html_content, flags=re.IGNORECASE)
# Extract images
images = re.findall(r'src=[\'"]?([^\'" >]+)', html_content, flags=re.IGNORECASE)
# Remove HTML tags
text = re.sub(r'<[^>]+>', ' ', html_content)
# Clean whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text, links, imagesOutput format for each processed file (/shared/processed/page_N.json):
{
"source_file": "page_N.html",
"text": "[extracted text]",
"statistics": {
"word_count": [integer],
"sentence_count": [integer],
"paragraph_count": [integer],
"avg_word_length": [float]
},
"links": ["url1", "url2", ...],
"images": ["src1", "src2", ...],
"processed_at": "[ISO-8601 UTC]"
}Create processor/Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY process.py /app/
CMD ["python", "-u", "/app/process.py"]Part C: Container 3 - Text Analyzer
Create analyzer/analyze.py that performs corpus-wide analysis.
Required analysis operations:
- Wait for
/shared/status/process_complete.json - Read all processed files from
/shared/processed/ - Compute global statistics:
- Word frequency distribution (top 100 words)
- Document similarity matrix (Jaccard similarity)
- N-gram extraction (bigrams and trigrams)
- Readability metrics
- Save to
/shared/analysis/final_report.json
Similarity calculation:
def jaccard_similarity(doc1_words, doc2_words):
"""Calculate Jaccard similarity between two documents."""
set1 = set(doc1_words)
set2 = set(doc2_words)
intersection = set1.intersection(set2)
union = set1.union(set2)
return len(intersection) / len(union) if union else 0.0Final report structure (/shared/analysis/final_report.json):
{
"processing_timestamp": "[ISO-8601 UTC]",
"documents_processed": [integer],
"total_words": [integer],
"unique_words": [integer],
"top_100_words": [
{"word": "the", "count": 523, "frequency": 0.042},
...
],
"document_similarity": [
{"doc1": "page_1.json", "doc2": "page_2.json", "similarity": 0.234},
...
],
"top_bigrams": [
{"bigram": "machine learning", "count": 45},
...
],
"readability": {
"avg_sentence_length": [float],
"avg_word_length": [float],
"complexity_score": [float]
}
}Create analyzer/Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY analyze.py /app/
CMD ["python", "-u", "/app/analyze.py"]Part D: Docker Compose Configuration
The docker-compose.yaml is provided in the starter code:
Code: docker-compose.yaml
version: '3.8'
services:
fetcher:
build: ./fetcher
container_name: pipeline-fetcher
volumes:
- pipeline-data:/shared
environment:
- PYTHONUNBUFFERED=1
processor:
build: ./processor
container_name: pipeline-processor
volumes:
- pipeline-data:/shared
environment:
- PYTHONUNBUFFERED=1
depends_on:
- fetcher
analyzer:
build: ./analyzer
container_name: pipeline-analyzer
volumes:
- pipeline-data:/shared
environment:
- PYTHONUNBUFFERED=1
depends_on:
- processor
volumes:
pipeline-data:
name: pipeline-shared-data
Note: depends_on ensures start order but does not wait for container completion. Your Python scripts must implement proper waiting logic.
Part E: Orchestration Script
The run_pipeline.sh orchestration script is provided in the starter code. It handles building containers, starting the pipeline, injecting URLs, monitoring for completion, and extracting results.
Code: run_pipeline.sh
#!/bin/bash
#
# Pipeline orchestration script.
#
if [ $# -lt 1 ]; then
echo "Usage: $0 <url1> [url2] [url3] ..."
echo "Example: $0 https://example.com https://wikipedia.org"
exit 1
fi
echo "Starting Multi-Container Pipeline"
echo "================================="
# Clean previous runs
docker-compose down -v 2>/dev/null
# Create temporary directory
TEMP_DIR=$(mktemp -d)
trap "rm -rf $TEMP_DIR" EXIT
# Create URL list
for url in "$@"; do
echo "$url" >> "$TEMP_DIR/urls.txt"
done
echo "URLs to process:"
cat "$TEMP_DIR/urls.txt"
echo ""
# Build containers
echo "Building containers..."
docker-compose build --quiet
# Start pipeline
echo "Starting pipeline..."
docker-compose up -d
# Wait for containers to initialize
sleep 3
# Inject URLs
echo "Injecting URLs..."
docker cp "$TEMP_DIR/urls.txt" pipeline-fetcher:/shared/input/urls.txt
# Monitor completion
echo "Processing..."
MAX_WAIT=300 # 5 minutes timeout
ELAPSED=0
while [ $ELAPSED -lt $MAX_WAIT ]; do
if docker exec pipeline-analyzer test -f /shared/analysis/final_report.json 2>/dev/null; then
echo "Pipeline complete"
break
fi
sleep 5
ELAPSED=$((ELAPSED + 5))
done
if [ $ELAPSED -ge $MAX_WAIT ]; then
echo "Pipeline timeout after ${MAX_WAIT} seconds"
docker-compose logs
docker-compose down
exit 1
fi
# Extract results
mkdir -p output
docker cp pipeline-analyzer:/shared/analysis/final_report.json output/
docker cp pipeline-analyzer:/shared/status output/
# Cleanup
docker-compose down
# Display summary
if [ -f "output/final_report.json" ]; then
echo ""
echo "Results saved to output/final_report.json"
python3 -m json.tool output/final_report.json | head -20
else
echo "Pipeline failed - no output generated"
exit 1
fi
On Windows, you will need to run this script using WSL2, Git Bash, or translate the commands to PowerShell.
Part F: Testing
Create test_urls.txt:
https://www.example.com
https://www.wikipedia.org
https://httpbin.org/html
Test your pipeline:
# Single URL
./run_pipeline.sh https://www.example.com
# Multiple URLs
./run_pipeline.sh https://www.example.com https://www.wikipedia.org https://httpbin.org/htmlDebugging
To diagnose pipeline issues:
View container logs:
docker-compose logs fetcher docker-compose logs processor docker-compose logs analyzerInspect shared volume:
docker run --rm -v pipeline-shared-data:/shared alpine ls -la /shared/Check container status:
docker-compose psEnter running container:
docker exec -it pipeline-fetcher /bin/bash
Validation
We will validate your submission by running the following commands from your q3/ directory:
docker-compose build
./run_pipeline.sh https://www.example.com https://httpbin.org/htmlThese commands must complete without errors. We will then verify:
- Status files appear in correct sequence (
fetch_complete.json,process_complete.json) output/final_report.jsonexists and matches the specification- Containers properly wait for dependencies before processing
- Pipeline handles URL fetch failures gracefully
Deliverables
See Submission.
README.md
q1/
├── fetch_and_process.py
├── Dockerfile
└── test_urls.txt
q2/
├── arxiv_processor.py
├── stopwords.py
└── Dockerfile
q3/
├── docker-compose.yaml
├── test_urls.txt
├── fetcher/
│ ├── Dockerfile
│ └── fetch.py
├── processor/
│ ├── Dockerfile
│ └── process.py
└── analyzer/
├── Dockerfile
└── analyze.py