hw01-q1

Problem 1: Docker Basics – HTTP Data Fetcher

Write a Python application that fetches data from HTTP endpoints, processes the responses, and outputs structured results. You will containerize this application using Docker.

Part A: Python HTTP Fetcher

Create a file fetch_and_process.py that fetches data from URLs and computes statistics about the responses.

Your script must accept exactly two command line arguments:

Path to an input file containing URLs (one per line)
Path to output directory

For each URL in the input file, your script must:

Perform an HTTP GET request to the URL
Measure the response time in milliseconds
Capture the HTTP status code
Calculate the size of the response body in bytes
Count the number of words in the response (for text responses only)

Your script must write three files to the output directory:

File 1: responses.json - Array of response data:

[
  {
    "url": "[URL string]",
    "status_code": [integer],
    "response_time_ms": [float],
    "content_length": [integer],
    "word_count": [integer or null],
    "timestamp": "[ISO-8601 UTC]",
    "error": [null or error message string]
  },
  ...
]

File 2: summary.json - Aggregate statistics:

{
  "total_urls": [integer],
  "successful_requests": [integer],
  "failed_requests": [integer],
  "average_response_time_ms": [float],
  "total_bytes_downloaded": [integer],
  "status_code_distribution": {
    "200": [count],
    "404": [count],
    ...
  },
  "processing_start": "[ISO-8601 UTC]",
  "processing_end": "[ISO-8601 UTC]"
}

File 3: errors.log - One line per error:

[ISO-8601 UTC] [URL]: [error message]

Requirements:

Use only urllib.request for HTTP requests (no requests library)
Use only standard library modules: sys, json, time, datetime, os, re
For word counting, consider a word as any sequence of alphanumeric characters
If a request fails (connection error, timeout, etc.), record the error and continue
Set a timeout of 10 seconds for each request
If response Content-Type header contains “text”, perform word count; otherwise set to null
All timestamps must be UTC in ISO-8601 format with ‘Z’ suffix

Part B: Dockerfile

Create a Dockerfile that packages your Python application.

FROM python:3.11-slim
WORKDIR /app
COPY fetch_and_process.py /app/
RUN mkdir -p /data/input /data/output
ENTRYPOINT ["python", "/app/fetch_and_process.py"]
CMD ["/data/input/urls.txt", "/data/output"]

The Dockerfile must:

Use python:3.11-slim as the base image (no other base image permitted)
Set working directory to /app
Copy your script to the container
Create input and output directories at /data/input and /data/output
Use ENTRYPOINT for the Python interpreter and script
Use CMD for default arguments (can be overridden at runtime)

Part C: Build and Run Scripts

Create build.sh:

#!/bin/bash
docker build -t http-fetcher:latest .

Create run.sh:

#!/bin/bash

# Check arguments
if [ $# -ne 2 ]; then
    echo "Usage: $0 <input_file> <output_directory>"
    exit 1
fi

INPUT_FILE="$1"
OUTPUT_DIR="$2"

# Check if input file exists
if [ ! -f "$INPUT_FILE" ]; then
    echo "Error: Input file $INPUT_FILE does not exist"
    exit 1
fi

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Run container
docker run --rm \
    --name http-fetcher \
    -v "$(realpath $INPUT_FILE)":/data/input/urls.txt:ro \
    -v "$(realpath $OUTPUT_DIR)":/data/output \
    http-fetcher:latest

Your run.sh script must:

Accept exactly 2 arguments: input file path and output directory path
Verify the input file exists before running the container
Create the output directory if it doesn’t exist
Mount the input file as read-only at /data/input/urls.txt
Mount the output directory at /data/output
Use --rm to remove container after execution
Use --name http-fetcher for the container name
Use realpath to convert relative paths to absolute paths

Part D: Testing

Create test_urls.txt with the following URLs:

http://httpbin.org/status/200
http://httpbin.org/delay/2
http://httpbin.org/status/404
http://httpbin.org/json
http://httpbin.org/html
https://www.example.com
http://httpbin.org/status/500
http://invalid.url.that.does.not.exist.com

Your application must handle all these cases correctly:

Successful responses (200)
Delayed responses (testing timeout behavior)
Client errors (404)
Server errors (500)
JSON responses (Content-Type: application/json)
HTML responses (Content-Type: text/html)
Invalid URLs / DNS failures

Deliverables

Your problem1/ directory must contain exactly:

problem1/
├── fetch_and_process.py
├── Dockerfile
├── build.sh
├── run.sh
└── test_urls.txt

All shell scripts must be executable (chmod +x *.sh).

Validation

We will test your solution by:

Running ./build.sh - must complete without errors
Running ./run.sh test_urls.txt output/ - must complete without errors
Checking that output/responses.json, output/summary.json, and output/errors.log exist
Validating JSON structure and content
Running with different URL lists to verify correctness

Your container must not require network configuration beyond Docker defaults. Your container must not run as root user (the python:3.11-slim image already handles this correctly).