Homework #1: Docker, ArXiv API, Multi-Container Pipelines
EE 547: Fall 2025
Assigned: 02 September
Due: Monday, 15 September at 23:59
Submission: Gradescope via GitHub repository
- Docker Desktop must be installed and running on your machine
- Use only Python standard library modules unless explicitly permitted
- All shell scripts must be executable (
chmod +x)
Overview
This assignment introduces containerization using Docker. You will build and run containers, manage data persistence through volumes, and create multi-container applications using Docker Compose.
Problem 1: Docker Basics – HTTP Data Fetcher
Write a Python application that fetches data from HTTP endpoints, processes the responses, and outputs structured results. You will containerize this application using Docker.
Part A: Python HTTP Fetcher
Create a file fetch_and_process.py that fetches data from URLs and computes statistics about the responses.
Your script must accept exactly two command line arguments:
- Path to an input file containing URLs (one per line)
- Path to output directory
For each URL in the input file, your script must:
- Perform an HTTP GET request to the URL
- Measure the response time in milliseconds
- Capture the HTTP status code
- Calculate the size of the response body in bytes
- Count the number of words in the response (for text responses only)
Your script must write three files to the output directory:
File 1: responses.json - Array of response data:
[
{
"url": "[URL string]",
"status_code": [integer],
"response_time_ms": [float],
"content_length": [integer],
"word_count": [integer or null],
"timestamp": "[ISO-8601 UTC]",
"error": [null or error message string]
},
...
]File 2: summary.json - Aggregate statistics:
{
"total_urls": [integer],
"successful_requests": [integer],
"failed_requests": [integer],
"average_response_time_ms": [float],
"total_bytes_downloaded": [integer],
"status_code_distribution": {
"200": [count],
"404": [count],
...
},
"processing_start": "[ISO-8601 UTC]",
"processing_end": "[ISO-8601 UTC]"
}File 3: errors.log - One line per error:
[ISO-8601 UTC] [URL]: [error message]
Requirements:
- Use only
urllib.requestfor HTTP requests (norequestslibrary) - Use only standard library modules:
sys,json,time,datetime,os,re - For word counting, consider a word as any sequence of alphanumeric characters
- If a request fails (connection error, timeout, etc.), record the error and continue
- Set a timeout of 10 seconds for each request
- If response Content-Type header contains “text”, perform word count; otherwise set to null
- All timestamps must be UTC in ISO-8601 format with ‘Z’ suffix
Part B: Dockerfile
Create a Dockerfile that packages your Python application.
FROM python:3.11-slim
WORKDIR /app
COPY fetch_and_process.py /app/
RUN mkdir -p /data/input /data/output
ENTRYPOINT ["python", "/app/fetch_and_process.py"]
CMD ["/data/input/urls.txt", "/data/output"]The Dockerfile must:
- Use
python:3.11-slimas the base image (no other base image permitted) - Set working directory to
/app - Copy your script to the container
- Create input and output directories at
/data/inputand/data/output - Use ENTRYPOINT for the Python interpreter and script
- Use CMD for default arguments (can be overridden at runtime)
Part C: Build and Run Scripts
Create build.sh:
#!/bin/bash
docker build -t http-fetcher:latest .Create run.sh:
#!/bin/bash
# Check arguments
if [ $# -ne 2 ]; then
echo "Usage: $0 <input_file> <output_directory>"
exit 1
fi
INPUT_FILE="$1"
OUTPUT_DIR="$2"
# Check if input file exists
if [ ! -f "$INPUT_FILE" ]; then
echo "Error: Input file $INPUT_FILE does not exist"
exit 1
fi
# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"
# Run container
docker run --rm \
--name http-fetcher \
-v "$(realpath $INPUT_FILE)":/data/input/urls.txt:ro \
-v "$(realpath $OUTPUT_DIR)":/data/output \
http-fetcher:latestYour run.sh script must:
- Accept exactly 2 arguments: input file path and output directory path
- Verify the input file exists before running the container
- Create the output directory if it doesn’t exist
- Mount the input file as read-only at
/data/input/urls.txt - Mount the output directory at
/data/output - Use
--rmto remove container after execution - Use
--name http-fetcherfor the container name - Use
realpathto convert relative paths to absolute paths
Part D: Testing
Create test_urls.txt with the following URLs:
http://httpbin.org/status/200
http://httpbin.org/delay/2
http://httpbin.org/status/404
http://httpbin.org/json
http://httpbin.org/html
https://www.example.com
http://httpbin.org/status/500
http://invalid.url.that.does.not.exist.com
Your application must handle all these cases correctly:
- Successful responses (200)
- Delayed responses (testing timeout behavior)
- Client errors (404)
- Server errors (500)
- JSON responses (Content-Type: application/json)
- HTML responses (Content-Type: text/html)
- Invalid URLs / DNS failures
Deliverables
Your problem1/ directory must contain exactly:
problem1/
├── fetch_and_process.py
├── Dockerfile
├── build.sh
├── run.sh
└── test_urls.txt
All shell scripts must be executable (chmod +x *.sh).
Validation
We will test your solution by:
- Running
./build.sh- must complete without errors - Running
./run.sh test_urls.txt output/- must complete without errors - Checking that
output/responses.json,output/summary.json, andoutput/errors.logexist - Validating JSON structure and content
- Running with different URL lists to verify correctness
Your container must not require network configuration beyond Docker defaults. Your container must not run as root user (the python:3.11-slim image already handles this correctly).
Problem 2: ArXiv Paper Metadata Processor
Build a containerized application that fetches paper metadata from the ArXiv API, processes it, and generates structured output.
Part A: ArXiv API Client
Create a file arxiv_processor.py that queries the ArXiv API and extracts paper metadata.
Your script must accept exactly three command line arguments:
- Search query string (e.g., “cat:cs.LG” for machine learning papers)
- Maximum number of results to fetch (integer between 1 and 100)
- Path to output directory
Your script must perform the following operations:
- Query the ArXiv API using the search query
- Fetch up to the specified maximum number of results
- Extract and process metadata for each paper
- Generate text analysis statistics
- Write structured output files
ArXiv API endpoint: http://export.arxiv.org/api/query
Query parameters:
search_query: Your search stringstart: Starting index (0-based)max_results: Maximum results to return
Example API call:
http://export.arxiv.org/api/query?search_query=cat:cs.LG&start=0&max_results=10
The API returns XML (parsing guide). You must parse this XML to extract:
- Paper ID (from the
<id>tag, extract just the ID portion after the last ‘/’) - Title (from
<title>) - Authors (from all
<author><name>tags) - Abstract (from
<summary>) - Categories (from all
<category>tags’termattribute) - Published date (from
<published>) - Updated date (from
<updated>)
Part B: Text Processing
For each paper’s abstract, compute the following:
Word frequency analysis:
- Total word count
- Unique word count
- Top 20 most frequent words (excluding stopwords)
- Average word length
Sentence analysis:
- Total sentence count (split on ‘.’, ‘!’, ‘?’)
- Average words per sentence
- Longest sentence (by word count)
- Shortest sentence (by word count)
Technical term extraction:
- Extract all words containing uppercase letters (e.g., “LSTM”, “GPU”)
- Extract all words containing numbers (e.g., “3D”, “ResNet50”)
- Extract all hyphenated terms (e.g., “state-of-the-art”, “pre-trained”)
Use the following stopwords list:
STOPWORDS = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
'can', 'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it',
'we', 'they', 'what', 'which', 'who', 'when', 'where', 'why', 'how',
'all', 'each', 'every', 'both', 'few', 'more', 'most', 'other', 'some',
'such', 'as', 'also', 'very', 'too', 'only', 'so', 'than', 'not'}Part C: Output Files
Your script must write three files to the output directory:
File 1: papers.json - Array of paper metadata:
[
{
"arxiv_id": "[paper ID]",
"title": "[paper title]",
"authors": ["author1", "author2", ...],
"abstract": "[full abstract text]",
"categories": ["cat1", "cat2", ...],
"published": "[ISO-8601 UTC]",
"updated": "[ISO-8601 UTC]",
"abstract_stats": {
"total_words": [integer],
"unique_words": [integer],
"total_sentences": [integer],
"avg_words_per_sentence": [float],
"avg_word_length": [float]
}
},
...
]File 2: corpus_analysis.json - Aggregate analysis across all papers:
{
"query": "[search query used]",
"papers_processed": [integer],
"processing_timestamp": "[ISO-8601 UTC]",
"corpus_stats": {
"total_abstracts": [integer],
"total_words": [integer],
"unique_words_global": [integer],
"avg_abstract_length": [float],
"longest_abstract_words": [integer],
"shortest_abstract_words": [integer]
},
"top_50_words": [
{"word": "[word1]", "frequency": [count], "documents": [count]},
...
],
"technical_terms": {
"uppercase_terms": ["TERM1", "TERM2", ...],
"numeric_terms": ["term1", "term2", ...],
"hyphenated_terms": ["term-1", "term-2", ...]
},
"category_distribution": {
"cs.LG": [count],
"cs.AI": [count],
...
}
}File 3: processing.log - Processing log with one line per event:
[ISO-8601 UTC] Starting ArXiv query: [query]
[ISO-8601 UTC] Fetched [N] results from ArXiv API
[ISO-8601 UTC] Processing paper: [arxiv_id]
[ISO-8601 UTC] Completed processing: [N] papers in [X.XX] seconds
Part D: Error Handling
Your script must handle the following error conditions:
- Network errors: If the ArXiv API is unreachable, write error to log and exit with code 1
- Invalid XML: If the API returns malformed XML, log the error and continue with other papers
- Missing fields: If a paper lacks required fields, skip it and log a warning
- Rate limiting: If you receive HTTP 429, wait 3 seconds and retry (maximum 3 attempts)
Requirements:
- Use only standard library modules:
sys,json,urllib.request,xml.etree.ElementTree,datetime,time,re,os - All word processing must be case-insensitive for frequency counting
- Preserve original case in the output
- Handle Unicode properly (ArXiv abstracts often contain mathematical symbols)
Part E: Dockerfile
Create a Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY arxiv_processor.py /app/
RUN mkdir -p /data/output
ENTRYPOINT ["python", "/app/arxiv_processor.py"]Part F: Build and Run Scripts
Create build.sh:
#!/bin/bash
docker build -t arxiv-processor:latest .Create run.sh:
#!/bin/bash
# Check arguments
if [ $# -ne 3 ]; then
echo "Usage: $0 <query> <max_results> <output_directory>"
echo "Example: $0 'cat:cs.LG' 10 output/"
exit 1
fi
QUERY="$1"
MAX_RESULTS="$2"
OUTPUT_DIR="$3"
# Validate max_results is a number
if ! [[ "$MAX_RESULTS" =~ ^[0-9]+$ ]]; then
echo "Error: max_results must be a positive integer"
exit 1
fi
# Check max_results is in valid range
if [ "$MAX_RESULTS" -lt 1 ] || [ "$MAX_RESULTS" -gt 100 ]; then
echo "Error: max_results must be between 1 and 100"
exit 1
fi
# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"
# Run container
docker run --rm \
--name arxiv-processor \
-v "$(realpath $OUTPUT_DIR)":/data/output \
arxiv-processor:latest \
"$QUERY" "$MAX_RESULTS" "/data/output"Part G: Testing
Create test.sh:
#!/bin/bash
# Test 1: Machine Learning papers
./run.sh "cat:cs.LG" 5 output_ml/
# Test 2: Search by author
./run.sh "au:LeCun" 3 output_author/
# Test 3: Search by title keyword
./run.sh "ti:transformer" 10 output_title/
# Test 4: Complex query (ML papers about transformers from 2023)
./run.sh "cat:cs.LG AND ti:transformer AND submittedDate:[202301010000 TO 202312312359]" 5 output_complex/
echo "Test completed. Check output directories for results."Deliverables
Your problem2/ directory must contain exactly:
problem2/
├── arxiv_processor.py
├── Dockerfile
├── build.sh
├── run.sh
└── test.sh
Validation
We will test your solution by:
- Running
./build.sh- must complete without errors - Running
./run.sh "cat:cs.LG" 10 output/- must fetch 10 ML papers - Verifying all three output files exist and contain valid JSON
- Checking that word frequencies are accurate
- Testing with various queries to ensure robustness
- Verifying the container handles network errors gracefully
Your container must respect ArXiv’s rate limits and terms of service. Do not make more than 1 request per 3 seconds to avoid being blocked.
Problem 3: Multi-Container Text Processing Pipeline with Docker Compose
Build a multi-container application that processes web content through sequential stages. Containers coordinate through a shared filesystem, demonstrating batch processing patterns used in data pipelines.
Architecture
Three containers process data in sequence:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ fetcher │────▶│ processor │────▶│ analyzer │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
/shared/ /shared/ /shared/
└── raw/ └── processed/ └── analysis/
└── status/ └── status/ └── status/
Containers communicate through filesystem markers:
- Each container monitors
/shared/status/for its input signal - Processing stages write completion markers when finished
- Data flows through
/shared/subdirectories
Part A: Container 1 - Data Fetcher
Create fetcher/fetch.py:
#!/usr/bin/env python3
import json
import os
import sys
import time
import urllib.request
from datetime import datetime, timezone
def main():
print(f"[{datetime.now(timezone.utc).isoformat()}] Fetcher starting", flush=True)
# Wait for input file
input_file = "/shared/input/urls.txt"
while not os.path.exists(input_file):
print(f"Waiting for {input_file}...", flush=True)
time.sleep(2)
# Read URLs
with open(input_file, 'r') as f:
urls = [line.strip() for line in f if line.strip()]
# Create output directory
os.makedirs("/shared/raw", exist_ok=True)
os.makedirs("/shared/status", exist_ok=True)
# Fetch each URL
results = []
for i, url in enumerate(urls, 1):
output_file = f"/shared/raw/page_{i}.html"
try:
print(f"Fetching {url}...", flush=True)
with urllib.request.urlopen(url, timeout=10) as response:
content = response.read()
with open(output_file, 'wb') as f:
f.write(content)
results.append({
"url": url,
"file": f"page_{i}.html",
"size": len(content),
"status": "success"
})
except Exception as e:
results.append({
"url": url,
"file": None,
"error": str(e),
"status": "failed"
})
time.sleep(1) # Rate limiting
# Write completion status
status = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"urls_processed": len(urls),
"successful": sum(1 for r in results if r["status"] == "success"),
"failed": sum(1 for r in results if r["status"] == "failed"),
"results": results
}
with open("/shared/status/fetch_complete.json", 'w') as f:
json.dump(status, f, indent=2)
print(f"[{datetime.now(timezone.utc).isoformat()}] Fetcher complete", flush=True)
if __name__ == "__main__":
main()Create fetcher/Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY fetch.py /app/
CMD ["python", "-u", "/app/fetch.py"]The -u flag disables output buffering to ensure real-time logging.
Part B: Container 2 - HTML Processor
Create processor/process.py that extracts and analyzes text from HTML files.
Required processing operations:
- Wait for
/shared/status/fetch_complete.json - Read all HTML files from
/shared/raw/ - Extract text content using regex (not BeautifulSoup)
- Extract all links (href attributes)
- Extract all images (src attributes)
- Count words, sentences, paragraphs
- Save processed data to
/shared/processed/ - Create
/shared/status/process_complete.json
Text extraction requirements:
def strip_html(html_content):
"""Remove HTML tags and extract text."""
# Remove script and style elements
html_content = re.sub(r'<script[^>]*>.*?</script>', '', html_content, flags=re.DOTALL | re.IGNORECASE)
html_content = re.sub(r'<style[^>]*>.*?</style>', '', html_content, flags=re.DOTALL | re.IGNORECASE)
# Extract links before removing tags
links = re.findall(r'href=[\'"]?([^\'" >]+)', html_content, flags=re.IGNORECASE)
# Extract images
images = re.findall(r'src=[\'"]?([^\'" >]+)', html_content, flags=re.IGNORECASE)
# Remove HTML tags
text = re.sub(r'<[^>]+>', ' ', html_content)
# Clean whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text, links, imagesOutput format for each processed file (/shared/processed/page_N.json):
{
"source_file": "page_N.html",
"text": "[extracted text]",
"statistics": {
"word_count": [integer],
"sentence_count": [integer],
"paragraph_count": [integer],
"avg_word_length": [float]
},
"links": ["url1", "url2", ...],
"images": ["src1", "src2", ...],
"processed_at": "[ISO-8601 UTC]"
}Create processor/Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY process.py /app/
CMD ["python", "-u", "/app/process.py"]Part C: Container 3 - Text Analyzer
Create analyzer/analyze.py that performs corpus-wide analysis.
Required analysis operations:
- Wait for
/shared/status/process_complete.json - Read all processed files from
/shared/processed/ - Compute global statistics:
- Word frequency distribution (top 100 words)
- Document similarity matrix (Jaccard similarity)
- N-gram extraction (bigrams and trigrams)
- Readability metrics
- Save to
/shared/analysis/final_report.json
Similarity calculation:
def jaccard_similarity(doc1_words, doc2_words):
"""Calculate Jaccard similarity between two documents."""
set1 = set(doc1_words)
set2 = set(doc2_words)
intersection = set1.intersection(set2)
union = set1.union(set2)
return len(intersection) / len(union) if union else 0.0Final report structure (/shared/analysis/final_report.json):
{
"processing_timestamp": "[ISO-8601 UTC]",
"documents_processed": [integer],
"total_words": [integer],
"unique_words": [integer],
"top_100_words": [
{"word": "the", "count": 523, "frequency": 0.042},
...
],
"document_similarity": [
{"doc1": "page_1.json", "doc2": "page_2.json", "similarity": 0.234},
...
],
"top_bigrams": [
{"bigram": "machine learning", "count": 45},
...
],
"readability": {
"avg_sentence_length": [float],
"avg_word_length": [float],
"complexity_score": [float]
}
}Create analyzer/Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY analyze.py /app/
CMD ["python", "-u", "/app/analyze.py"]Part D: Docker Compose Configuration
Create docker-compose.yaml:
version: '3.8'
services:
fetcher:
build: ./fetcher
container_name: pipeline-fetcher
volumes:
- pipeline-data:/shared
environment:
- PYTHONUNBUFFERED=1
processor:
build: ./processor
container_name: pipeline-processor
volumes:
- pipeline-data:/shared
environment:
- PYTHONUNBUFFERED=1
depends_on:
- fetcher
analyzer:
build: ./analyzer
container_name: pipeline-analyzer
volumes:
- pipeline-data:/shared
environment:
- PYTHONUNBUFFERED=1
depends_on:
- processor
volumes:
pipeline-data:
name: pipeline-shared-dataNote: depends_on ensures start order but does not wait for container completion. Your Python scripts must implement proper waiting logic.
Part E: Orchestration Script
Create run_pipeline.sh that manages the complete pipeline execution:
#!/bin/bash
if [ $# -lt 1 ]; then
echo "Usage: $0 <url1> [url2] [url3] ..."
echo "Example: $0 https://example.com https://wikipedia.org"
exit 1
fi
echo "Starting Multi-Container Pipeline"
echo "================================="
# Clean previous runs
docker-compose down -v 2>/dev/null
# Create temporary directory
TEMP_DIR=$(mktemp -d)
trap "rm -rf $TEMP_DIR" EXIT
# Create URL list
for url in "$@"; do
echo "$url" >> "$TEMP_DIR/urls.txt"
done
echo "URLs to process:"
cat "$TEMP_DIR/urls.txt"
echo ""
# Build containers
echo "Building containers..."
docker-compose build --quiet
# Start pipeline
echo "Starting pipeline..."
docker-compose up -d
# Wait for containers to initialize
sleep 3
# Inject URLs
echo "Injecting URLs..."
docker cp "$TEMP_DIR/urls.txt" pipeline-fetcher:/shared/input/urls.txt
# Monitor completion
echo "Processing..."
MAX_WAIT=300 # 5 minutes timeout
ELAPSED=0
while [ $ELAPSED -lt $MAX_WAIT ]; do
if docker exec pipeline-analyzer test -f /shared/analysis/final_report.json 2>/dev/null; then
echo "Pipeline complete"
break
fi
sleep 5
ELAPSED=$((ELAPSED + 5))
done
if [ $ELAPSED -ge $MAX_WAIT ]; then
echo "Pipeline timeout after ${MAX_WAIT} seconds"
docker-compose logs
docker-compose down
exit 1
fi
# Extract results
mkdir -p output
docker cp pipeline-analyzer:/shared/analysis/final_report.json output/
docker cp pipeline-analyzer:/shared/status output/
# Cleanup
docker-compose down
# Display summary
if [ -f "output/final_report.json" ]; then
echo ""
echo "Results saved to output/final_report.json"
python3 -m json.tool output/final_report.json | head -20
else
echo "Pipeline failed - no output generated"
exit 1
fiPart F: Testing
Create test_urls.txt:
https://www.example.com
https://www.wikipedia.org
https://httpbin.org/html
Create test.sh:
#!/bin/bash
echo "Test 1: Single URL"
./run_pipeline.sh https://www.example.com
echo ""
echo "Test 2: Multiple URLs from file"
./run_pipeline.sh $(cat test_urls.txt)
echo ""
echo "Test 3: Verify output structure"
python3 -c "
import json
with open('output/final_report.json') as f:
data = json.load(f)
assert 'documents_processed' in data
assert 'top_100_words' in data
assert 'document_similarity' in data
print('Output validation passed')
"Deliverables
Your problem3/ directory structure:
problem3/
├── docker-compose.yaml
├── run_pipeline.sh
├── test.sh
├── test_urls.txt
├── fetcher/
│ ├── Dockerfile
│ └── fetch.py
├── processor/
│ ├── Dockerfile
│ └── process.py
└── analyzer/
├── Dockerfile
└── analyze.py
Debugging
To diagnose pipeline issues:
View container logs:
docker-compose logs fetcher docker-compose logs processor docker-compose logs analyzerInspect shared volume:
docker run --rm -v pipeline-shared-data:/shared alpine ls -la /shared/Check container status:
docker-compose psEnter running container:
docker exec -it pipeline-fetcher /bin/bash
Validation
Your implementation will be tested by:
- Running
docker-compose build- must complete without errors - Executing
./run_pipeline.shwith various URLs - Verifying status files appear in correct sequence
- Validating JSON output structure and content
- Checking that containers properly wait for dependencies
- Testing error handling when URLs fail to download
Submission Requirements
Your GitHub repository must follow this exact structure:
ee547-hw1-[username]/
├── problem1/
│ ├── fetch_and_process.py
│ ├── Dockerfile
│ ├── build.sh
│ ├── run.sh
│ └── test_urls.txt
├── problem2/
│ └── [files for problem 2]
├── problem3/
│ └── [files for problem 3]
└── README.md
The README.md in your repository root must contain:
- Your full name
- USC email address
- Any external libraries used beyond those specified
- Instructions to run each problem if they differ from the assignment specification
Before submitting, ensure: 1. docker build completes without errors for all Dockerfiles 2. All shell scripts are executable and run without modification 3. JSON output is valid and matches the specified format exactly 4. Your repository structure matches the requirement exactly