Homework #1: Docker, ArXiv API, Multi-Container Pipelines

EE 547: Fall 2025

Assignment Details

Assigned: 02 September
Due: Monday, 15 September at 23:59

Submission: Gradescope via GitHub repository

Requirements

Docker Desktop must be installed and running on your machine
Use only Python standard library modules unless explicitly permitted
All shell scripts must be executable (chmod +x)

Overview

This assignment introduces containerization using Docker. You will build and run containers, manage data persistence through volumes, and create multi-container applications using Docker Compose.

Problem 1: Docker Basics – HTTP Data Fetcher

Write a Python application that fetches data from HTTP endpoints, processes the responses, and outputs structured results. You will containerize this application using Docker.

Part A: Python HTTP Fetcher

Create a file fetch_and_process.py that fetches data from URLs and computes statistics about the responses.

Your script must accept exactly two command line arguments:

Path to an input file containing URLs (one per line)
Path to output directory

For each URL in the input file, your script must:

Perform an HTTP GET request to the URL
Measure the response time in milliseconds
Capture the HTTP status code
Calculate the size of the response body in bytes
Count the number of words in the response (for text responses only)

Your script must write three files to the output directory:

File 1: responses.json - Array of response data:

[
  {
    "url": "[URL string]",
    "status_code": [integer],
    "response_time_ms": [float],
    "content_length": [integer],
    "word_count": [integer or null],
    "timestamp": "[ISO-8601 UTC]",
    "error": [null or error message string]
  },
  ...
]

File 2: summary.json - Aggregate statistics:

{
  "total_urls": [integer],
  "successful_requests": [integer],
  "failed_requests": [integer],
  "average_response_time_ms": [float],
  "total_bytes_downloaded": [integer],
  "status_code_distribution": {
    "200": [count],
    "404": [count],
    ...
  },
  "processing_start": "[ISO-8601 UTC]",
  "processing_end": "[ISO-8601 UTC]"
}

File 3: errors.log - One line per error:

[ISO-8601 UTC] [URL]: [error message]

Requirements:

Use only urllib.request for HTTP requests (no requests library)
Use only standard library modules: sys, json, time, datetime, os, re
For word counting, consider a word as any sequence of alphanumeric characters
If a request fails (connection error, timeout, etc.), record the error and continue
Set a timeout of 10 seconds for each request
If response Content-Type header contains “text”, perform word count; otherwise set to null
All timestamps must be UTC in ISO-8601 format with ‘Z’ suffix

Part B: Dockerfile

Create a Dockerfile that packages your Python application.

FROM python:3.11-slim
WORKDIR /app
COPY fetch_and_process.py /app/
RUN mkdir -p /data/input /data/output
ENTRYPOINT ["python", "/app/fetch_and_process.py"]
CMD ["/data/input/urls.txt", "/data/output"]

The Dockerfile must:

Use python:3.11-slim as the base image (no other base image permitted)
Set working directory to /app
Copy your script to the container
Create input and output directories at /data/input and /data/output
Use ENTRYPOINT for the Python interpreter and script
Use CMD for default arguments (can be overridden at runtime)

Part C: Build and Run Scripts

Create build.sh:

#!/bin/bash
docker build -t http-fetcher:latest .

Create run.sh:

#!/bin/bash

# Check arguments
if [ $# -ne 2 ]; then
    echo "Usage: $0 <input_file> <output_directory>"
    exit 1
fi

INPUT_FILE="$1"
OUTPUT_DIR="$2"

# Check if input file exists
if [ ! -f "$INPUT_FILE" ]; then
    echo "Error: Input file $INPUT_FILE does not exist"
    exit 1
fi

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Run container
docker run --rm \
    --name http-fetcher \
    -v "$(realpath $INPUT_FILE)":/data/input/urls.txt:ro \
    -v "$(realpath $OUTPUT_DIR)":/data/output \
    http-fetcher:latest

Your run.sh script must:

Accept exactly 2 arguments: input file path and output directory path
Verify the input file exists before running the container
Create the output directory if it doesn’t exist
Mount the input file as read-only at /data/input/urls.txt
Mount the output directory at /data/output
Use --rm to remove container after execution
Use --name http-fetcher for the container name
Use realpath to convert relative paths to absolute paths

Part D: Testing

Create test_urls.txt with the following URLs:

http://httpbin.org/status/200
http://httpbin.org/delay/2
http://httpbin.org/status/404
http://httpbin.org/json
http://httpbin.org/html
https://www.example.com
http://httpbin.org/status/500
http://invalid.url.that.does.not.exist.com

Your application must handle all these cases correctly:

Successful responses (200)
Delayed responses (testing timeout behavior)
Client errors (404)
Server errors (500)
JSON responses (Content-Type: application/json)
HTML responses (Content-Type: text/html)
Invalid URLs / DNS failures

Deliverables

Your problem1/ directory must contain exactly:

problem1/
├── fetch_and_process.py
├── Dockerfile
├── build.sh
├── run.sh
└── test_urls.txt

All shell scripts must be executable (chmod +x *.sh).

Validation

We will test your solution by:

Running ./build.sh - must complete without errors
Running ./run.sh test_urls.txt output/ - must complete without errors
Checking that output/responses.json, output/summary.json, and output/errors.log exist
Validating JSON structure and content
Running with different URL lists to verify correctness

Your container must not require network configuration beyond Docker defaults. Your container must not run as root user (the python:3.11-slim image already handles this correctly).

Problem 2: ArXiv Paper Metadata Processor

Build a containerized application that fetches paper metadata from the ArXiv API, processes it, and generates structured output.

Part A: ArXiv API Client

Create a file arxiv_processor.py that queries the ArXiv API and extracts paper metadata.

Your script must accept exactly three command line arguments:

Search query string (e.g., “cat:cs.LG” for machine learning papers)
Maximum number of results to fetch (integer between 1 and 100)
Path to output directory

Your script must perform the following operations:

Query the ArXiv API using the search query
Fetch up to the specified maximum number of results
Extract and process metadata for each paper
Generate text analysis statistics
Write structured output files

ArXiv API endpoint: http://export.arxiv.org/api/query

Query parameters:

search_query: Your search string
start: Starting index (0-based)
max_results: Maximum results to return

Example API call:

http://export.arxiv.org/api/query?search_query=cat:cs.LG&start=0&max_results=10

The API returns XML (parsing guide). You must parse this XML to extract:

Paper ID (from the <id> tag, extract just the ID portion after the last ‘/’)
Title (from <title>)
Authors (from all <author><name> tags)
Abstract (from <summary>)
Categories (from all <category> tags’ term attribute)
Published date (from <published>)
Updated date (from <updated>)

Part B: Text Processing

For each paper’s abstract, compute the following:

Word frequency analysis:
- Total word count
- Unique word count
- Top 20 most frequent words (excluding stopwords)
- Average word length
Sentence analysis:
- Total sentence count (split on ‘.’, ‘!’, ‘?’)
- Average words per sentence
- Longest sentence (by word count)
- Shortest sentence (by word count)
Technical term extraction:
- Extract all words containing uppercase letters (e.g., “LSTM”, “GPU”)
- Extract all words containing numbers (e.g., “3D”, “ResNet50”)
- Extract all hyphenated terms (e.g., “state-of-the-art”, “pre-trained”)

Use the following stopwords list:

STOPWORDS = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
             'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
             'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
             'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
             'can', 'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it',
             'we', 'they', 'what', 'which', 'who', 'when', 'where', 'why', 'how',
             'all', 'each', 'every', 'both', 'few', 'more', 'most', 'other', 'some',
             'such', 'as', 'also', 'very', 'too', 'only', 'so', 'than', 'not'}

Part C: Output Files

Your script must write three files to the output directory:

File 1: papers.json - Array of paper metadata:

[
  {
    "arxiv_id": "[paper ID]",
    "title": "[paper title]",
    "authors": ["author1", "author2", ...],
    "abstract": "[full abstract text]",
    "categories": ["cat1", "cat2", ...],
    "published": "[ISO-8601 UTC]",
    "updated": "[ISO-8601 UTC]",
    "abstract_stats": {
      "total_words": [integer],
      "unique_words": [integer],
      "total_sentences": [integer],
      "avg_words_per_sentence": [float],
      "avg_word_length": [float]
    }
  },
  ...
]

File 2: corpus_analysis.json - Aggregate analysis across all papers:

{
  "query": "[search query used]",
  "papers_processed": [integer],
  "processing_timestamp": "[ISO-8601 UTC]",
  "corpus_stats": {
    "total_abstracts": [integer],
    "total_words": [integer],
    "unique_words_global": [integer],
    "avg_abstract_length": [float],
    "longest_abstract_words": [integer],
    "shortest_abstract_words": [integer]
  },
  "top_50_words": [
    {"word": "[word1]", "frequency": [count], "documents": [count]},
    ...
  ],
  "technical_terms": {
    "uppercase_terms": ["TERM1", "TERM2", ...],
    "numeric_terms": ["term1", "term2", ...],
    "hyphenated_terms": ["term-1", "term-2", ...]
  },
  "category_distribution": {
    "cs.LG": [count],
    "cs.AI": [count],
    ...
  }
}

File 3: processing.log - Processing log with one line per event:

[ISO-8601 UTC] Starting ArXiv query: [query]
[ISO-8601 UTC] Fetched [N] results from ArXiv API
[ISO-8601 UTC] Processing paper: [arxiv_id]
[ISO-8601 UTC] Completed processing: [N] papers in [X.XX] seconds

Part D: Error Handling

Your script must handle the following error conditions:

Network errors: If the ArXiv API is unreachable, write error to log and exit with code 1
Invalid XML: If the API returns malformed XML, log the error and continue with other papers
Missing fields: If a paper lacks required fields, skip it and log a warning
Rate limiting: If you receive HTTP 429, wait 3 seconds and retry (maximum 3 attempts)

Requirements:

Use only standard library modules: sys, json, urllib.request, xml.etree.ElementTree, datetime, time, re, os
All word processing must be case-insensitive for frequency counting
Preserve original case in the output
Handle Unicode properly (ArXiv abstracts often contain mathematical symbols)

Part E: Dockerfile

Create a Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY arxiv_processor.py /app/
RUN mkdir -p /data/output
ENTRYPOINT ["python", "/app/arxiv_processor.py"]

Part F: Build and Run Scripts

Create build.sh:

#!/bin/bash
docker build -t arxiv-processor:latest .

Create run.sh:

#!/bin/bash

# Check arguments
if [ $# -ne 3 ]; then
    echo "Usage: $0 <query> <max_results> <output_directory>"
    echo "Example: $0 'cat:cs.LG' 10 output/"
    exit 1
fi

QUERY="$1"
MAX_RESULTS="$2"
OUTPUT_DIR="$3"

# Validate max_results is a number
if ! [[ "$MAX_RESULTS" =~ ^[0-9]+$ ]]; then
    echo "Error: max_results must be a positive integer"
    exit 1
fi

# Check max_results is in valid range
if [ "$MAX_RESULTS" -lt 1 ] || [ "$MAX_RESULTS" -gt 100 ]; then
    echo "Error: max_results must be between 1 and 100"
    exit 1
fi

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Run container
docker run --rm \
    --name arxiv-processor \
    -v "$(realpath $OUTPUT_DIR)":/data/output \
    arxiv-processor:latest \
    "$QUERY" "$MAX_RESULTS" "/data/output"

Part G: Testing

Create test.sh:

#!/bin/bash

# Test 1: Machine Learning papers
./run.sh "cat:cs.LG" 5 output_ml/

# Test 2: Search by author
./run.sh "au:LeCun" 3 output_author/

# Test 3: Search by title keyword
./run.sh "ti:transformer" 10 output_title/

# Test 4: Complex query (ML papers about transformers from 2023)
./run.sh "cat:cs.LG AND ti:transformer AND submittedDate:[202301010000 TO 202312312359]" 5 output_complex/

echo "Test completed. Check output directories for results."

Deliverables

Your problem2/ directory must contain exactly:

problem2/
├── arxiv_processor.py
├── Dockerfile
├── build.sh
├── run.sh
└── test.sh

Validation

We will test your solution by:

Running ./build.sh - must complete without errors
Running ./run.sh "cat:cs.LG" 10 output/ - must fetch 10 ML papers
Verifying all three output files exist and contain valid JSON
Checking that word frequencies are accurate
Testing with various queries to ensure robustness
Verifying the container handles network errors gracefully

Your container must respect ArXiv’s rate limits and terms of service. Do not make more than 1 request per 3 seconds to avoid being blocked.

Problem 3: Multi-Container Text Processing Pipeline with Docker Compose

Build a multi-container application that processes web content through sequential stages. Containers coordinate through a shared filesystem, demonstrating batch processing patterns used in data pipelines.

Architecture

Three containers process data in sequence:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   fetcher   │────▶│  processor  │────▶│  analyzer   │
└─────────────┘     └─────────────┘     └─────────────┘
       │                   │                   │
       ▼                   ▼                   ▼
    /shared/            /shared/            /shared/
    └── raw/           └── processed/      └── analysis/
    └── status/        └── status/         └── status/

Containers communicate through filesystem markers:

Each container monitors /shared/status/ for its input signal
Processing stages write completion markers when finished
Data flows through /shared/ subdirectories

Part A: Container 1 - Data Fetcher

Create fetcher/fetch.py:

#!/usr/bin/env python3
import json
import os
import sys
import time
import urllib.request
from datetime import datetime, timezone

def main():
    print(f"[{datetime.now(timezone.utc).isoformat()}] Fetcher starting", flush=True)
    
    # Wait for input file
    input_file = "/shared/input/urls.txt"
    while not os.path.exists(input_file):
        print(f"Waiting for {input_file}...", flush=True)
        time.sleep(2)
    
    # Read URLs
    with open(input_file, 'r') as f:
        urls = [line.strip() for line in f if line.strip()]
    
    # Create output directory
    os.makedirs("/shared/raw", exist_ok=True)
    os.makedirs("/shared/status", exist_ok=True)
    
    # Fetch each URL
    results = []
    for i, url in enumerate(urls, 1):
        output_file = f"/shared/raw/page_{i}.html"
        try:
            print(f"Fetching {url}...", flush=True)
            with urllib.request.urlopen(url, timeout=10) as response:
                content = response.read()
                with open(output_file, 'wb') as f:
                    f.write(content)
            results.append({
                "url": url,
                "file": f"page_{i}.html",
                "size": len(content),
                "status": "success"
            })
        except Exception as e:
            results.append({
                "url": url,
                "file": None,
                "error": str(e),
                "status": "failed"
            })
        time.sleep(1)  # Rate limiting
    
    # Write completion status
    status = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "urls_processed": len(urls),
        "successful": sum(1 for r in results if r["status"] == "success"),
        "failed": sum(1 for r in results if r["status"] == "failed"),
        "results": results
    }
    
    with open("/shared/status/fetch_complete.json", 'w') as f:
        json.dump(status, f, indent=2)
    
    print(f"[{datetime.now(timezone.utc).isoformat()}] Fetcher complete", flush=True)

if __name__ == "__main__":
    main()

Create fetcher/Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY fetch.py /app/
CMD ["python", "-u", "/app/fetch.py"]

The -u flag disables output buffering to ensure real-time logging.

Part B: Container 2 - HTML Processor

Create processor/process.py that extracts and analyzes text from HTML files.

Required processing operations:

Wait for /shared/status/fetch_complete.json
Read all HTML files from /shared/raw/
Extract text content using regex (not BeautifulSoup)
Extract all links (href attributes)
Extract all images (src attributes)
Count words, sentences, paragraphs
Save processed data to /shared/processed/
Create /shared/status/process_complete.json

Text extraction requirements:

def strip_html(html_content):
    """Remove HTML tags and extract text."""
    # Remove script and style elements
    html_content = re.sub(r'<script[^>]*>.*?</script>', '', html_content, flags=re.DOTALL | re.IGNORECASE)
    html_content = re.sub(r'<style[^>]*>.*?</style>', '', html_content, flags=re.DOTALL | re.IGNORECASE)
    
    # Extract links before removing tags
    links = re.findall(r'href=[\'"]?([^\'" >]+)', html_content, flags=re.IGNORECASE)
    
    # Extract images
    images = re.findall(r'src=[\'"]?([^\'" >]+)', html_content, flags=re.IGNORECASE)
    
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', html_content)
    
    # Clean whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text, links, images

Output format for each processed file (/shared/processed/page_N.json):

{
    "source_file": "page_N.html",
    "text": "[extracted text]",
    "statistics": {
        "word_count": [integer],
        "sentence_count": [integer],
        "paragraph_count": [integer],
        "avg_word_length": [float]
    },
    "links": ["url1", "url2", ...],
    "images": ["src1", "src2", ...],
    "processed_at": "[ISO-8601 UTC]"
}

Create processor/Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY process.py /app/
CMD ["python", "-u", "/app/process.py"]

Part C: Container 3 - Text Analyzer

Create analyzer/analyze.py that performs corpus-wide analysis.

Required analysis operations:

Wait for /shared/status/process_complete.json
Read all processed files from /shared/processed/
Compute global statistics:
- Word frequency distribution (top 100 words)
- Document similarity matrix (Jaccard similarity)
- N-gram extraction (bigrams and trigrams)
- Readability metrics
Save to /shared/analysis/final_report.json

Similarity calculation:

def jaccard_similarity(doc1_words, doc2_words):
    """Calculate Jaccard similarity between two documents."""
    set1 = set(doc1_words)
    set2 = set(doc2_words)
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union) if union else 0.0

Final report structure (/shared/analysis/final_report.json):

{
    "processing_timestamp": "[ISO-8601 UTC]",
    "documents_processed": [integer],
    "total_words": [integer],
    "unique_words": [integer],
    "top_100_words": [
        {"word": "the", "count": 523, "frequency": 0.042},
        ...
    ],
    "document_similarity": [
        {"doc1": "page_1.json", "doc2": "page_2.json", "similarity": 0.234},
        ...
    ],
    "top_bigrams": [
        {"bigram": "machine learning", "count": 45},
        ...
    ],
    "readability": {
        "avg_sentence_length": [float],
        "avg_word_length": [float],
        "complexity_score": [float]
    }
}

Create analyzer/Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY analyze.py /app/
CMD ["python", "-u", "/app/analyze.py"]

Part D: Docker Compose Configuration

Create docker-compose.yaml:

version: '3.8'

services:
  fetcher:
    build: ./fetcher
    container_name: pipeline-fetcher
    volumes:
      - pipeline-data:/shared
    environment:
      - PYTHONUNBUFFERED=1

  processor:
    build: ./processor
    container_name: pipeline-processor
    volumes:
      - pipeline-data:/shared
    environment:
      - PYTHONUNBUFFERED=1
    depends_on:
      - fetcher

  analyzer:
    build: ./analyzer
    container_name: pipeline-analyzer
    volumes:
      - pipeline-data:/shared
    environment:
      - PYTHONUNBUFFERED=1
    depends_on:
      - processor

volumes:
  pipeline-data:
    name: pipeline-shared-data

Note: depends_on ensures start order but does not wait for container completion. Your Python scripts must implement proper waiting logic.

Part E: Orchestration Script

Create run_pipeline.sh that manages the complete pipeline execution:

#!/bin/bash

if [ $# -lt 1 ]; then
    echo "Usage: $0 <url1> [url2] [url3] ..."
    echo "Example: $0 https://example.com https://wikipedia.org"
    exit 1
fi

echo "Starting Multi-Container Pipeline"
echo "================================="

# Clean previous runs
docker-compose down -v 2>/dev/null

# Create temporary directory
TEMP_DIR=$(mktemp -d)
trap "rm -rf $TEMP_DIR" EXIT

# Create URL list
for url in "$@"; do
    echo "$url" >> "$TEMP_DIR/urls.txt"
done

echo "URLs to process:"
cat "$TEMP_DIR/urls.txt"
echo ""

# Build containers
echo "Building containers..."
docker-compose build --quiet

# Start pipeline
echo "Starting pipeline..."
docker-compose up -d

# Wait for containers to initialize
sleep 3

# Inject URLs
echo "Injecting URLs..."
docker cp "$TEMP_DIR/urls.txt" pipeline-fetcher:/shared/input/urls.txt

# Monitor completion
echo "Processing..."
MAX_WAIT=300  # 5 minutes timeout
ELAPSED=0

while [ $ELAPSED -lt $MAX_WAIT ]; do
    if docker exec pipeline-analyzer test -f /shared/analysis/final_report.json 2>/dev/null; then
        echo "Pipeline complete"
        break
    fi
    sleep 5
    ELAPSED=$((ELAPSED + 5))
done

if [ $ELAPSED -ge $MAX_WAIT ]; then
    echo "Pipeline timeout after ${MAX_WAIT} seconds"
    docker-compose logs
    docker-compose down
    exit 1
fi

# Extract results
mkdir -p output
docker cp pipeline-analyzer:/shared/analysis/final_report.json output/
docker cp pipeline-analyzer:/shared/status output/

# Cleanup
docker-compose down

# Display summary
if [ -f "output/final_report.json" ]; then
    echo ""
    echo "Results saved to output/final_report.json"
    python3 -m json.tool output/final_report.json | head -20
else
    echo "Pipeline failed - no output generated"
    exit 1
fi

Part F: Testing

Create test_urls.txt:

https://www.example.com
https://www.wikipedia.org
https://httpbin.org/html

Create test.sh:

#!/bin/bash

echo "Test 1: Single URL"
./run_pipeline.sh https://www.example.com

echo ""
echo "Test 2: Multiple URLs from file"
./run_pipeline.sh $(cat test_urls.txt)

echo ""
echo "Test 3: Verify output structure"
python3 -c "
import json
with open('output/final_report.json') as f:
    data = json.load(f)
    assert 'documents_processed' in data
    assert 'top_100_words' in data
    assert 'document_similarity' in data
    print('Output validation passed')
"

Deliverables

Your problem3/ directory structure:

problem3/
├── docker-compose.yaml
├── run_pipeline.sh
├── test.sh
├── test_urls.txt
├── fetcher/
│   ├── Dockerfile
│   └── fetch.py
├── processor/
│   ├── Dockerfile
│   └── process.py
└── analyzer/
    ├── Dockerfile
    └── analyze.py

Debugging

To diagnose pipeline issues:

View container logs:

docker-compose logs fetcher
docker-compose logs processor
docker-compose logs analyzer

Inspect shared volume:

docker run --rm -v pipeline-shared-data:/shared alpine ls -la /shared/

Check container status:
```
docker-compose ps
```

Enter running container:

docker exec -it pipeline-fetcher /bin/bash

Validation

Your implementation will be tested by:

Running docker-compose build - must complete without errors
Executing ./run_pipeline.sh with various URLs
Verifying status files appear in correct sequence
Validating JSON output structure and content
Checking that containers properly wait for dependencies
Testing error handling when URLs fail to download

Submission Requirements

Your GitHub repository must follow this exact structure:

ee547-hw1-[username]/
├── problem1/
│   ├── fetch_and_process.py
│   ├── Dockerfile
│   ├── build.sh
│   ├── run.sh
│   └── test_urls.txt
├── problem2/
│   └── [files for problem 2]
├── problem3/
│   └── [files for problem 3]
└── README.md

The README.md in your repository root must contain:

Your full name
USC email address
Any external libraries used beyond those specified
Instructions to run each problem if they differ from the assignment specification

Testing Your Submission

Before submitting, ensure: 1. docker build completes without errors for all Dockerfiles 2. All shell scripts are executable and run without modification 3. JSON output is valid and matches the specified format exactly 4. Your repository structure matches the requirement exactly