Problem 1: Docker Basics – HTTP Data Fetcher

Write a Python application that fetches data from HTTP endpoints, processes the responses, and outputs structured results. You will containerize this application using Docker.

Part A: Python HTTP Fetcher

Create a file fetch_and_process.py that fetches data from URLs and computes statistics about the responses.

Your script must accept exactly two command line arguments:

  1. Path to an input file containing URLs (one per line)
  2. Path to output directory

For each URL in the input file, your script must:

  1. Perform an HTTP GET request to the URL
  2. Measure the response time in milliseconds
  3. Capture the HTTP status code
  4. Calculate the size of the response body in bytes
  5. Count the number of words in the response (for text responses only)

Your script must write three files to the output directory:

File 1: responses.json - Array of response data:

[
  {
    "url": "[URL string]",
    "status_code": [integer],
    "response_time_ms": [float],
    "content_length": [integer],
    "word_count": [integer or null],
    "timestamp": "[ISO-8601 UTC]",
    "error": [null or error message string]
  },
  ...
]

File 2: summary.json - Aggregate statistics:

{
  "total_urls": [integer],
  "successful_requests": [integer],
  "failed_requests": [integer],
  "average_response_time_ms": [float],
  "total_bytes_downloaded": [integer],
  "status_code_distribution": {
    "200": [count],
    "404": [count],
    ...
  },
  "processing_start": "[ISO-8601 UTC]",
  "processing_end": "[ISO-8601 UTC]"
}

File 3: errors.log - One line per error:

[ISO-8601 UTC] [URL]: [error message]

Requirements:

  • Use only urllib.request for HTTP requests (no requests library)
  • Use only standard library modules: sys, json, time, datetime, os, re
  • For word counting, consider a word as any sequence of alphanumeric characters
  • If a request fails (connection error, timeout, etc.), record the error and continue
  • Set a timeout of 10 seconds for each request
  • If response Content-Type header contains “text”, perform word count; otherwise set to null
  • All timestamps must be UTC in ISO-8601 format with ‘Z’ suffix

Part B: Dockerfile

Create a Dockerfile that packages your Python application.

FROM python:3.11-slim
WORKDIR /app
COPY fetch_and_process.py /app/
RUN mkdir -p /data/input /data/output
ENTRYPOINT ["python", "/app/fetch_and_process.py"]
CMD ["/data/input/urls.txt", "/data/output"]

The Dockerfile must:

  • Use python:3.11-slim as the base image (no other base image permitted)
  • Set working directory to /app
  • Copy your script to the container
  • Create input and output directories at /data/input and /data/output
  • Use ENTRYPOINT for the Python interpreter and script
  • Use CMD for default arguments (can be overridden at runtime)

Part C: Build and Run Scripts

Create build.sh:

#!/bin/bash
docker build -t http-fetcher:latest .

Create run.sh:

#!/bin/bash

# Check arguments
if [ $# -ne 2 ]; then
    echo "Usage: $0 <input_file> <output_directory>"
    exit 1
fi

INPUT_FILE="$1"
OUTPUT_DIR="$2"

# Check if input file exists
if [ ! -f "$INPUT_FILE" ]; then
    echo "Error: Input file $INPUT_FILE does not exist"
    exit 1
fi

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Run container
docker run --rm \
    --name http-fetcher \
    -v "$(realpath $INPUT_FILE)":/data/input/urls.txt:ro \
    -v "$(realpath $OUTPUT_DIR)":/data/output \
    http-fetcher:latest

Your run.sh script must:

  • Accept exactly 2 arguments: input file path and output directory path
  • Verify the input file exists before running the container
  • Create the output directory if it doesn’t exist
  • Mount the input file as read-only at /data/input/urls.txt
  • Mount the output directory at /data/output
  • Use --rm to remove container after execution
  • Use --name http-fetcher for the container name
  • Use realpath to convert relative paths to absolute paths

Part D: Testing

Create test_urls.txt with the following URLs:

http://httpbin.org/status/200
http://httpbin.org/delay/2
http://httpbin.org/status/404
http://httpbin.org/json
http://httpbin.org/html
https://www.example.com
http://httpbin.org/status/500
http://invalid.url.that.does.not.exist.com

Your application must handle all these cases correctly:

  • Successful responses (200)
  • Delayed responses (testing timeout behavior)
  • Client errors (404)
  • Server errors (500)
  • JSON responses (Content-Type: application/json)
  • HTML responses (Content-Type: text/html)
  • Invalid URLs / DNS failures

Deliverables

Your problem1/ directory must contain exactly:

problem1/
├── fetch_and_process.py
├── Dockerfile
├── build.sh
├── run.sh
└── test_urls.txt

All shell scripts must be executable (chmod +x *.sh).

Validation

We will test your solution by:

  1. Running ./build.sh - must complete without errors
  2. Running ./run.sh test_urls.txt output/ - must complete without errors
  3. Checking that output/responses.json, output/summary.json, and output/errors.log exist
  4. Validating JSON structure and content
  5. Running with different URL lists to verify correctness

Your container must not require network configuration beyond Docker defaults. Your container must not run as root user (the python:3.11-slim image already handles this correctly).