Homework #2: Docker, ArXiv API, Multi-Container Pipelines

EE 547: Spring 2026

Assignment Details

Assigned: 28 January
Due: Tuesday, 10 February at 23:59

Gradescope: Homework 2 | How to Submit

Requirements
  • Docker Desktop must be installed and running on your machine
  • Use only Python standard library modules unless explicitly permitted

Overview

This assignment introduces containerization using Docker. You will build and run containers, manage data persistence through volumes, and create multi-container applications using Docker Compose.

Getting Started

Download the starter code: hw2-starter.zip

unzip hw2-starter.zip
cd hw2-starter

Problem 1: Docker Basics – HTTP Data Fetcher

Write a Python application that fetches data from HTTP endpoints, processes the responses, and outputs structured results. You will containerize this application using Docker.

Part A: Python HTTP Fetcher

Create a file fetch_and_process.py that fetches data from URLs and computes statistics about the responses.

Your script must accept exactly two command line arguments:

  1. Path to an input file containing URLs (one per line)
  2. Path to output directory

For each URL in the input file, your script must:

  1. Perform an HTTP GET request to the URL
  2. Measure the response time in milliseconds
  3. Capture the HTTP status code
  4. Calculate the size of the response body in bytes
  5. Count the number of words in the response (for text responses only)

Your script must write three files to the output directory:

File 1: responses.json - Array of response data:

[
  {
    "url": "[URL string]",
    "status_code": [integer],
    "response_time_ms": [float],
    "content_length": [integer],
    "word_count": [integer or null],
    "timestamp": "[ISO-8601 UTC]",
    "error": [null or error message string]
  },
  ...
]

File 2: summary.json - Aggregate statistics:

{
  "total_urls": [integer],
  "successful_requests": [integer],
  "failed_requests": [integer],
  "average_response_time_ms": [float],
  "total_bytes_downloaded": [integer],
  "status_code_distribution": {
    "200": [count],
    "404": [count],
    ...
  },
  "processing_start": "[ISO-8601 UTC]",
  "processing_end": "[ISO-8601 UTC]"
}

File 3: errors.log - One line per error:

[ISO-8601 UTC] [URL]: [error message]

Requirements:

  • Use only urllib.request for HTTP requests (no requests library)
  • Use only standard library modules: sys, json, time, datetime, os, re
  • For word counting, consider a word as any sequence of alphanumeric characters
  • If a request fails (connection error, timeout, etc.), record the error and continue
  • Set a timeout of 10 seconds for each request
  • If response Content-Type header contains “text”, perform word count; otherwise set to null
  • All timestamps must be UTC in ISO-8601 format with ‘Z’ suffix

Part B: Dockerfile

Create a Dockerfile that packages your Python application.

FROM python:3.11-slim
WORKDIR /app
COPY fetch_and_process.py /app/
RUN mkdir -p /data/input /data/output
ENTRYPOINT ["python", "/app/fetch_and_process.py"]
CMD ["/data/input/urls.txt", "/data/output"]

The Dockerfile must:

  • Use python:3.11-slim as the base image (no other base image permitted)
  • Set working directory to /app
  • Copy your script to the container
  • Create input and output directories at /data/input and /data/output
  • Use ENTRYPOINT for the Python interpreter and script
  • Use CMD for default arguments (can be overridden at runtime)

Part C: Building and Running

Build your container image:

docker build -t http-fetcher:latest .

Run your container with input/output volume mounts:

docker run --rm \
    -v "$(pwd)/test_urls.txt":/data/input/urls.txt:ro \
    -v "$(pwd)/output":/data/output \
    http-fetcher:latest

The volume mounts connect your host filesystem to the container:

  • -v "$(pwd)/test_urls.txt":/data/input/urls.txt:ro mounts your input file read-only
  • -v "$(pwd)/output":/data/output mounts the output directory for results
  • --rm removes the container after execution

On Windows, replace $(pwd) with the full path or %cd% (cmd) / ${PWD} (PowerShell).

Part D: Testing

Create test_urls.txt with the following URLs:

http://httpbin.org/status/200
http://httpbin.org/delay/2
http://httpbin.org/status/404
http://httpbin.org/json
http://httpbin.org/html
https://www.example.com
http://httpbin.org/status/500
http://invalid.url.that.does.not.exist.com

Your application must handle all these cases correctly:

  • Successful responses (200)
  • Delayed responses (testing timeout behavior)
  • Client errors (404)
  • Server errors (500)
  • JSON responses (Content-Type: application/json)
  • HTML responses (Content-Type: text/html)
  • Invalid URLs / DNS failures

Validation

Grading Commands

We will validate your submission by running the following commands from your q1/ directory:

docker build -t http-fetcher:latest .
docker run --rm \
    -v "$(pwd)/test_urls.txt":/data/input/urls.txt:ro \
    -v "$(pwd)/output":/data/output \
    http-fetcher:latest

These commands must complete without errors. We will then verify:

  • output/responses.json, output/summary.json, and output/errors.log exist
  • JSON structure and content match the specification
  • Correct behavior with different URL lists

Your container must not require network configuration beyond Docker defaults.

Deliverables

See Submission.

Problem 2: ArXiv Paper Metadata Processor

Build a containerized application that fetches paper metadata from the ArXiv API, processes it, and generates structured output.

Part A: ArXiv API Client

Create a file arxiv_processor.py that queries the ArXiv API and extracts paper metadata.

Your script must accept exactly three command line arguments:

  1. Search query string (e.g., “cat:cs.LG” for machine learning papers)
  2. Maximum number of results to fetch (integer between 1 and 100)
  3. Path to output directory

Your script must perform the following operations:

  1. Query the ArXiv API using the search query
  2. Fetch up to the specified maximum number of results
  3. Extract and process metadata for each paper
  4. Generate text analysis statistics
  5. Write structured output files

ArXiv API endpoint: http://export.arxiv.org/api/query

Query parameters:

  • search_query: Your search string
  • start: Starting index (0-based)
  • max_results: Maximum results to return

Example API call:

http://export.arxiv.org/api/query?search_query=cat:cs.LG&start=0&max_results=10

The API returns XML (parsing guide). You must parse this XML to extract:

  • Paper ID (from the <id> tag, extract just the ID portion after the last ‘/’)
  • Title (from <title>)
  • Authors (from all <author><name> tags)
  • Abstract (from <summary>)
  • Categories (from all <category> tags’ term attribute)
  • Published date (from <published>)
  • Updated date (from <updated>)

Part B: Text Processing

For each paper’s abstract, compute the following:

  1. Word frequency analysis:

    • Total word count
    • Unique word count
    • Top 20 most frequent words (excluding stopwords)
    • Average word length
  2. Sentence analysis:

    • Total sentence count (split on ‘.’, ‘!’, ‘?’)
    • Average words per sentence
    • Longest sentence (by word count)
    • Shortest sentence (by word count)
  3. Technical term extraction:

    • Extract all words containing uppercase letters (e.g., “LSTM”, “GPU”)
    • Extract all words containing numbers (e.g., “3D”, “ResNet50”)
    • Extract all hyphenated terms (e.g., “state-of-the-art”, “pre-trained”)

Use the stopwords list provided in the starter code:

Code: stopwords.py
"""
Stopwords for text analysis.
"""

STOPWORDS = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
             'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
             'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
             'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
             'can', 'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it',
             'we', 'they', 'what', 'which', 'who', 'when', 'where', 'why', 'how',
             'all', 'each', 'every', 'both', 'few', 'more', 'most', 'other', 'some',
             'such', 'as', 'also', 'very', 'too', 'only', 'so', 'than', 'not'}

Part C: Output Files

Your script must write three files to the output directory:

File 1: papers.json - Array of paper metadata:

[
  {
    "arxiv_id": "[paper ID]",
    "title": "[paper title]",
    "authors": ["author1", "author2", ...],
    "abstract": "[full abstract text]",
    "categories": ["cat1", "cat2", ...],
    "published": "[ISO-8601 UTC]",
    "updated": "[ISO-8601 UTC]",
    "abstract_stats": {
      "total_words": [integer],
      "unique_words": [integer],
      "total_sentences": [integer],
      "avg_words_per_sentence": [float],
      "avg_word_length": [float]
    }
  },
  ...
]

File 2: corpus_analysis.json - Aggregate analysis across all papers:

{
  "query": "[search query used]",
  "papers_processed": [integer],
  "processing_timestamp": "[ISO-8601 UTC]",
  "corpus_stats": {
    "total_abstracts": [integer],
    "total_words": [integer],
    "unique_words_global": [integer],
    "avg_abstract_length": [float],
    "longest_abstract_words": [integer],
    "shortest_abstract_words": [integer]
  },
  "top_50_words": [
    {"word": "[word1]", "frequency": [count], "documents": [count]},
    ...
  ],
  "technical_terms": {
    "uppercase_terms": ["TERM1", "TERM2", ...],
    "numeric_terms": ["term1", "term2", ...],
    "hyphenated_terms": ["term-1", "term-2", ...]
  },
  "category_distribution": {
    "cs.LG": [count],
    "cs.AI": [count],
    ...
  }
}

File 3: processing.log - Processing log with one line per event:

[ISO-8601 UTC] Starting ArXiv query: [query]
[ISO-8601 UTC] Fetched [N] results from ArXiv API
[ISO-8601 UTC] Processing paper: [arxiv_id]
[ISO-8601 UTC] Completed processing: [N] papers in [X.XX] seconds

Part D: Error Handling

Your script must handle the following error conditions:

  1. Network errors: If the ArXiv API is unreachable, write error to log and exit with code 1
  2. Invalid XML: If the API returns malformed XML, log the error and continue with other papers
  3. Missing fields: If a paper lacks required fields, skip it and log a warning
  4. Rate limiting: If you receive HTTP 429, wait 3 seconds and retry (maximum 3 attempts)

Requirements:

  • Use only standard library modules: sys, json, urllib.request, xml.etree.ElementTree, datetime, time, re, os
  • All word processing must be case-insensitive for frequency counting
  • Preserve original case in the output
  • Handle Unicode properly (ArXiv abstracts often contain mathematical symbols)

Part E: Dockerfile

Create a Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY arxiv_processor.py stopwords.py /app/
RUN mkdir -p /data/output
ENTRYPOINT ["python", "/app/arxiv_processor.py"]

Part F: Building and Running

Build your container image:

docker build -t arxiv-processor:latest .

Run your container:

docker run --rm \
    -v "$(pwd)/output":/data/output \
    arxiv-processor:latest \
    "cat:cs.LG" 10 /data/output

The arguments are passed directly to your Python script:

  • "cat:cs.LG" - the search query
  • 10 - maximum results to fetch
  • /data/output - output directory inside the container

On Windows, replace $(pwd) with the full path or %cd% (cmd) / ${PWD} (PowerShell).

Part G: Testing

Test your container with various queries:

# Machine Learning papers
docker run --rm -v "$(pwd)/output_ml":/data/output \
    arxiv-processor:latest "cat:cs.LG" 5 /data/output

# Search by author
docker run --rm -v "$(pwd)/output_author":/data/output \
    arxiv-processor:latest "au:LeCun" 3 /data/output

# Search by title keyword
docker run --rm -v "$(pwd)/output_title":/data/output \
    arxiv-processor:latest "ti:transformer" 10 /data/output

Validation

Grading Commands

We will validate your submission by running the following commands from your q2/ directory:

docker build -t arxiv-processor:latest .
docker run --rm \
    -v "$(pwd)/output":/data/output \
    arxiv-processor:latest \
    "cat:cs.LG" 10 /data/output

These commands must complete without errors. We will then verify:

  • output/papers.json, output/corpus_analysis.json, and output/processing.log exist
  • JSON structure and content match the specification
  • Word frequencies are accurate
  • Container handles network errors gracefully

Your container must respect ArXiv’s rate limits and terms of service. Do not make more than 1 request per 3 seconds to avoid being blocked.

Deliverables

See Submission.

Problem 3: Multi-Container Text Processing Pipeline with Docker Compose

Build a multi-container application that processes web content through sequential stages. Containers coordinate through a shared filesystem, demonstrating batch processing patterns used in data pipelines.

Architecture

Three containers process data in sequence:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   fetcher   │────▶│  processor  │────▶│  analyzer   │
└─────────────┘     └─────────────┘     └─────────────┘
       │                   │                   │
       ▼                   ▼                   ▼
    /shared/            /shared/            /shared/
    └── raw/           └── processed/      └── analysis/
    └── status/        └── status/         └── status/

Containers communicate through filesystem markers:

  • Each container monitors /shared/status/ for its input signal
  • Processing stages write completion markers when finished
  • Data flows through /shared/ subdirectories

Part A: Container 1 - Data Fetcher

The fetcher is provided in the starter code. Study this code to understand the coordination pattern.

Code: fetcher/fetch.py
#!/usr/bin/env python3
"""
Data Fetcher - downloads URLs and writes to shared volume.
"""
import json
import os
import sys
import time
import urllib.request
from datetime import datetime, timezone

def main():
    print(f"[{datetime.now(timezone.utc).isoformat()}] Fetcher starting", flush=True)

    # Wait for input file
    input_file = "/shared/input/urls.txt"
    while not os.path.exists(input_file):
        print(f"Waiting for {input_file}...", flush=True)
        time.sleep(2)

    # Read URLs
    with open(input_file, 'r') as f:
        urls = [line.strip() for line in f if line.strip()]

    # Create output directory
    os.makedirs("/shared/raw", exist_ok=True)
    os.makedirs("/shared/status", exist_ok=True)

    # Fetch each URL
    results = []
    for i, url in enumerate(urls, 1):
        output_file = f"/shared/raw/page_{i}.html"
        try:
            print(f"Fetching {url}...", flush=True)
            with urllib.request.urlopen(url, timeout=10) as response:
                content = response.read()
                with open(output_file, 'wb') as f:
                    f.write(content)
            results.append({
                "url": url,
                "file": f"page_{i}.html",
                "size": len(content),
                "status": "success"
            })
        except Exception as e:
            results.append({
                "url": url,
                "file": None,
                "error": str(e),
                "status": "failed"
            })
        time.sleep(1)  # Rate limiting

    # Write completion status
    status = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "urls_processed": len(urls),
        "successful": sum(1 for r in results if r["status"] == "success"),
        "failed": sum(1 for r in results if r["status"] == "failed"),
        "results": results
    }

    with open("/shared/status/fetch_complete.json", 'w') as f:
        json.dump(status, f, indent=2)

    print(f"[{datetime.now(timezone.utc).isoformat()}] Fetcher complete", flush=True)

if __name__ == "__main__":
    main()

Create fetcher/Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY fetch.py /app/
CMD ["python", "-u", "/app/fetch.py"]

The -u flag disables output buffering to ensure real-time logging.

Part B: Container 2 - HTML Processor

Create processor/process.py that extracts and analyzes text from HTML files.

Required processing operations:

  1. Wait for /shared/status/fetch_complete.json
  2. Read all HTML files from /shared/raw/
  3. Extract text content using regex (not BeautifulSoup)
  4. Extract all links (href attributes)
  5. Extract all images (src attributes)
  6. Count words, sentences, paragraphs
  7. Save processed data to /shared/processed/
  8. Create /shared/status/process_complete.json

Text extraction requirements:

def strip_html(html_content):
    """Remove HTML tags and extract text."""
    # Remove script and style elements
    html_content = re.sub(r'<script[^>]*>.*?</script>', '', html_content, flags=re.DOTALL | re.IGNORECASE)
    html_content = re.sub(r'<style[^>]*>.*?</style>', '', html_content, flags=re.DOTALL | re.IGNORECASE)

    # Extract links before removing tags
    links = re.findall(r'href=[\'"]?([^\'" >]+)', html_content, flags=re.IGNORECASE)

    # Extract images
    images = re.findall(r'src=[\'"]?([^\'" >]+)', html_content, flags=re.IGNORECASE)

    # Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', html_content)

    # Clean whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text, links, images

Output format for each processed file (/shared/processed/page_N.json):

{
    "source_file": "page_N.html",
    "text": "[extracted text]",
    "statistics": {
        "word_count": [integer],
        "sentence_count": [integer],
        "paragraph_count": [integer],
        "avg_word_length": [float]
    },
    "links": ["url1", "url2", ...],
    "images": ["src1", "src2", ...],
    "processed_at": "[ISO-8601 UTC]"
}

Create processor/Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY process.py /app/
CMD ["python", "-u", "/app/process.py"]

Part C: Container 3 - Text Analyzer

Create analyzer/analyze.py that performs corpus-wide analysis.

Required analysis operations:

  1. Wait for /shared/status/process_complete.json
  2. Read all processed files from /shared/processed/
  3. Compute global statistics:
    • Word frequency distribution (top 100 words)
    • Document similarity matrix (Jaccard similarity)
    • N-gram extraction (bigrams and trigrams)
    • Readability metrics
  4. Save to /shared/analysis/final_report.json

Similarity calculation:

def jaccard_similarity(doc1_words, doc2_words):
    """Calculate Jaccard similarity between two documents."""
    set1 = set(doc1_words)
    set2 = set(doc2_words)
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union) if union else 0.0

Final report structure (/shared/analysis/final_report.json):

{
    "processing_timestamp": "[ISO-8601 UTC]",
    "documents_processed": [integer],
    "total_words": [integer],
    "unique_words": [integer],
    "top_100_words": [
        {"word": "the", "count": 523, "frequency": 0.042},
        ...
    ],
    "document_similarity": [
        {"doc1": "page_1.json", "doc2": "page_2.json", "similarity": 0.234},
        ...
    ],
    "top_bigrams": [
        {"bigram": "machine learning", "count": 45},
        ...
    ],
    "readability": {
        "avg_sentence_length": [float],
        "avg_word_length": [float],
        "complexity_score": [float]
    }
}

Create analyzer/Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY analyze.py /app/
CMD ["python", "-u", "/app/analyze.py"]

Part D: Docker Compose Configuration

The docker-compose.yaml is provided in the starter code:

Code: docker-compose.yaml
version: '3.8'

services:
  fetcher:
    build: ./fetcher
    container_name: pipeline-fetcher
    volumes:
      - pipeline-data:/shared
    environment:
      - PYTHONUNBUFFERED=1

  processor:
    build: ./processor
    container_name: pipeline-processor
    volumes:
      - pipeline-data:/shared
    environment:
      - PYTHONUNBUFFERED=1
    depends_on:
      - fetcher

  analyzer:
    build: ./analyzer
    container_name: pipeline-analyzer
    volumes:
      - pipeline-data:/shared
    environment:
      - PYTHONUNBUFFERED=1
    depends_on:
      - processor

volumes:
  pipeline-data:
    name: pipeline-shared-data

Note: depends_on ensures start order but does not wait for container completion. Your Python scripts must implement proper waiting logic.

Part E: Orchestration Script

The run_pipeline.sh orchestration script is provided in the starter code. It handles building containers, starting the pipeline, injecting URLs, monitoring for completion, and extracting results.

Code: run_pipeline.sh
#!/bin/bash
#
# Pipeline orchestration script.
#

if [ $# -lt 1 ]; then
    echo "Usage: $0 <url1> [url2] [url3] ..."
    echo "Example: $0 https://example.com https://wikipedia.org"
    exit 1
fi

echo "Starting Multi-Container Pipeline"
echo "================================="

# Clean previous runs
docker-compose down -v 2>/dev/null

# Create temporary directory
TEMP_DIR=$(mktemp -d)
trap "rm -rf $TEMP_DIR" EXIT

# Create URL list
for url in "$@"; do
    echo "$url" >> "$TEMP_DIR/urls.txt"
done

echo "URLs to process:"
cat "$TEMP_DIR/urls.txt"
echo ""

# Build containers
echo "Building containers..."
docker-compose build --quiet

# Start pipeline
echo "Starting pipeline..."
docker-compose up -d

# Wait for containers to initialize
sleep 3

# Inject URLs
echo "Injecting URLs..."
docker cp "$TEMP_DIR/urls.txt" pipeline-fetcher:/shared/input/urls.txt

# Monitor completion
echo "Processing..."
MAX_WAIT=300  # 5 minutes timeout
ELAPSED=0

while [ $ELAPSED -lt $MAX_WAIT ]; do
    if docker exec pipeline-analyzer test -f /shared/analysis/final_report.json 2>/dev/null; then
        echo "Pipeline complete"
        break
    fi
    sleep 5
    ELAPSED=$((ELAPSED + 5))
done

if [ $ELAPSED -ge $MAX_WAIT ]; then
    echo "Pipeline timeout after ${MAX_WAIT} seconds"
    docker-compose logs
    docker-compose down
    exit 1
fi

# Extract results
mkdir -p output
docker cp pipeline-analyzer:/shared/analysis/final_report.json output/
docker cp pipeline-analyzer:/shared/status output/

# Cleanup
docker-compose down

# Display summary
if [ -f "output/final_report.json" ]; then
    echo ""
    echo "Results saved to output/final_report.json"
    python3 -m json.tool output/final_report.json | head -20
else
    echo "Pipeline failed - no output generated"
    exit 1
fi

On Windows, you will need to run this script using WSL2, Git Bash, or translate the commands to PowerShell.

Part F: Testing

Create test_urls.txt:

https://www.example.com
https://www.wikipedia.org
https://httpbin.org/html

Test your pipeline:

# Single URL
./run_pipeline.sh https://www.example.com

# Multiple URLs
./run_pipeline.sh https://www.example.com https://www.wikipedia.org https://httpbin.org/html

Debugging

To diagnose pipeline issues:

  1. View container logs:

    docker-compose logs fetcher
    docker-compose logs processor
    docker-compose logs analyzer
  2. Inspect shared volume:

    docker run --rm -v pipeline-shared-data:/shared alpine ls -la /shared/
  3. Check container status:

    docker-compose ps
  4. Enter running container:

    docker exec -it pipeline-fetcher /bin/bash

Validation

Grading Commands

We will validate your submission by running the following commands from your q3/ directory:

docker-compose build
./run_pipeline.sh https://www.example.com https://httpbin.org/html

These commands must complete without errors. We will then verify:

  • Status files appear in correct sequence (fetch_complete.json, process_complete.json)
  • output/final_report.json exists and matches the specification
  • Containers properly wait for dependencies before processing
  • Pipeline handles URL fetch failures gracefully

Deliverables

See Submission.


Submission
README.md
q1/
├── fetch_and_process.py
├── Dockerfile
└── test_urls.txt
q2/
├── arxiv_processor.py
├── stopwords.py
└── Dockerfile
q3/
├── docker-compose.yaml
├── test_urls.txt
├── fetcher/
│   ├── Dockerfile
│   └── fetch.py
├── processor/
│   ├── Dockerfile
│   └── process.py
└── analyzer/
    ├── Dockerfile
    └── analyze.py