Problem 1: Docker Basics – HTTP Data Fetcher
Write a Python application that fetches data from HTTP endpoints, processes the responses, and outputs structured results. You will containerize this application using Docker.
Part A: Python HTTP Fetcher
Create a file fetch_and_process.py that fetches data from URLs and computes statistics about the responses.
Your script must accept exactly two command line arguments:
- Path to an input file containing URLs (one per line)
- Path to output directory
For each URL in the input file, your script must:
- Perform an HTTP GET request to the URL
- Measure the response time in milliseconds
- Capture the HTTP status code
- Calculate the size of the response body in bytes
- Count the number of words in the response (for text responses only)
Your script must write three files to the output directory:
File 1: responses.json - Array of response data:
[
{
"url": "[URL string]",
"status_code": [integer],
"response_time_ms": [float],
"content_length": [integer],
"word_count": [integer or null],
"timestamp": "[ISO-8601 UTC]",
"error": [null or error message string]
},
...
]File 2: summary.json - Aggregate statistics:
{
"total_urls": [integer],
"successful_requests": [integer],
"failed_requests": [integer],
"average_response_time_ms": [float],
"total_bytes_downloaded": [integer],
"status_code_distribution": {
"200": [count],
"404": [count],
...
},
"processing_start": "[ISO-8601 UTC]",
"processing_end": "[ISO-8601 UTC]"
}File 3: errors.log - One line per error:
[ISO-8601 UTC] [URL]: [error message]
Requirements:
- Use only
urllib.requestfor HTTP requests (norequestslibrary) - Use only standard library modules:
sys,json,time,datetime,os,re - For word counting, consider a word as any sequence of alphanumeric characters
- If a request fails (connection error, timeout, etc.), record the error and continue
- Set a timeout of 10 seconds for each request
- If response Content-Type header contains “text”, perform word count; otherwise set to null
- All timestamps must be UTC in ISO-8601 format with ‘Z’ suffix
Part B: Dockerfile
Create a Dockerfile that packages your Python application.
FROM python:3.11-slim
WORKDIR /app
COPY fetch_and_process.py /app/
RUN mkdir -p /data/input /data/output
ENTRYPOINT ["python", "/app/fetch_and_process.py"]
CMD ["/data/input/urls.txt", "/data/output"]The Dockerfile must:
- Use
python:3.11-slimas the base image (no other base image permitted) - Set working directory to
/app - Copy your script to the container
- Create input and output directories at
/data/inputand/data/output - Use ENTRYPOINT for the Python interpreter and script
- Use CMD for default arguments (can be overridden at runtime)
Part C: Build and Run Scripts
Create build.sh:
#!/bin/bash
docker build -t http-fetcher:latest .Create run.sh:
#!/bin/bash
# Check arguments
if [ $# -ne 2 ]; then
echo "Usage: $0 <input_file> <output_directory>"
exit 1
fi
INPUT_FILE="$1"
OUTPUT_DIR="$2"
# Check if input file exists
if [ ! -f "$INPUT_FILE" ]; then
echo "Error: Input file $INPUT_FILE does not exist"
exit 1
fi
# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"
# Run container
docker run --rm \
--name http-fetcher \
-v "$(realpath $INPUT_FILE)":/data/input/urls.txt:ro \
-v "$(realpath $OUTPUT_DIR)":/data/output \
http-fetcher:latestYour run.sh script must:
- Accept exactly 2 arguments: input file path and output directory path
- Verify the input file exists before running the container
- Create the output directory if it doesn’t exist
- Mount the input file as read-only at
/data/input/urls.txt - Mount the output directory at
/data/output - Use
--rmto remove container after execution - Use
--name http-fetcherfor the container name - Use
realpathto convert relative paths to absolute paths
Part D: Testing
Create test_urls.txt with the following URLs:
http://httpbin.org/status/200
http://httpbin.org/delay/2
http://httpbin.org/status/404
http://httpbin.org/json
http://httpbin.org/html
https://www.example.com
http://httpbin.org/status/500
http://invalid.url.that.does.not.exist.com
Your application must handle all these cases correctly:
- Successful responses (200)
- Delayed responses (testing timeout behavior)
- Client errors (404)
- Server errors (500)
- JSON responses (Content-Type: application/json)
- HTML responses (Content-Type: text/html)
- Invalid URLs / DNS failures
Deliverables
Your problem1/ directory must contain exactly:
problem1/
├── fetch_and_process.py
├── Dockerfile
├── build.sh
├── run.sh
└── test_urls.txt
All shell scripts must be executable (chmod +x *.sh).
Validation
We will test your solution by:
- Running
./build.sh- must complete without errors - Running
./run.sh test_urls.txt output/- must complete without errors - Checking that
output/responses.json,output/summary.json, andoutput/errors.logexist - Validating JSON structure and content
- Running with different URL lists to verify correctness
Your container must not require network configuration beyond Docker defaults. Your container must not run as root user (the python:3.11-slim image already handles this correctly).