Problem 1: HTTP API Server for ArXiv Papers
Use only Python standard library modules:
http.server(standard library)urllib.parse(standard library)json(standard library)re(standard library)sys,os,datetime(standard library)
Do not use flask, django, fastapi, requests, or any other web frameworks or external HTTP libraries.
Build a containerized HTTP server that serves ArXiv paper metadata from your HW#1 Problem 2 output.
Part A: Data Source
Your server must load ArXiv paper data from HW#1 Problem 2 output files:
papers.json- Array of paper metadata with abstracts and statisticscorpus_analysis.json- Global corpus analysis with word frequencies
Part B: HTTP Server Implementation
Create arxiv_server.py that implements a basic HTTP server with the following endpoints:
Required Endpoints:
GET /papers- Return list of all papers[ { "arxiv_id": "2301.12345", "title": "Paper Title", "authors": ["Author One", "Author Two"], "categories": ["cs.LG", "cs.AI"] }, ... ]GET /papers/{arxiv_id}- Return full paper details{ "arxiv_id": "2301.12345", "title": "Paper Title", "authors": ["Author One", "Author Two"], "abstract": "Full abstract text...", "categories": ["cs.LG", "cs.AI"], "published": "2023-01-15T10:30:00Z", "abstract_stats": { "total_words": 150, "unique_words": 85, "total_sentences": 8 } }GET /search?q={query}- Search papers by title and abstract{ "query": "machine learning", "results": [ { "arxiv_id": "2301.12345", "title": "Paper Title", "match_score": 3, "matches_in": ["title", "abstract"] } ] }GET /stats- Return corpus statistics{ "total_papers": 20, "total_words": 15000, "unique_words": 2500, "top_10_words": [ {"word": "model", "frequency": 145}, {"word": "data", "frequency": 132} ], "category_distribution": { "cs.LG": 12, "cs.AI": 8 } }
Error Handling:
- Return HTTP 404 for unknown paper IDs or invalid endpoints
- Return HTTP 400 for malformed search queries
- Return HTTP 500 for server errors with JSON error message
Part C: Implementation Requirements
Your server must:
Command Line Arguments: Accept port number as argument (default 8080)
python arxiv_server.py [port]Data Loading: Load JSON data at startup, handle missing files gracefully
Search Implementation:
- Case-insensitive search in titles and abstracts
- Count term frequency as match score
- Support multi-word queries (search for all terms)
Logging: Print requests to stdout in format:
[2025-09-16 14:30:22] GET /papers - 200 OK (15 results) [2025-09-16 14:30:25] GET /papers/invalid-id - 404 Not Found
Part D: Dockerfile
Create a Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY arxiv_server.py /app/
COPY sample_data/ /app/sample_data/
EXPOSE 8080
ENTRYPOINT ["python", "/app/arxiv_server.py"]
CMD ["8080"]Part E: Build and Run Scripts
Create build.sh:
#!/bin/bash
docker build -t arxiv-server:latest .Create run.sh:
#!/bin/bash
# Check for port argument
PORT=${1:-8080}
# Validate port is numeric
if ! [[ "$PORT" =~ ^[0-9]+$ ]]; then
echo "Error: Port must be numeric"
exit 1
fi
# Check port range
if [ "$PORT" -lt 1024 ] || [ "$PORT" -gt 65535 ]; then
echo "Error: Port must be between 1024 and 65535"
exit 1
fi
echo "Starting ArXiv API server on port $PORT"
echo "Access at: http://localhost:$PORT"
echo ""
echo "Available endpoints:"
echo " GET /papers"
echo " GET /papers/{arxiv_id}"
echo " GET /search?q={query}"
echo " GET /stats"
echo ""
# Run container
docker run --rm \
--name arxiv-server \
-p "$PORT:8080" \
arxiv-server:latestPart F: Testing
Create test.sh:
#!/bin/bash
# Start server in background
./run.sh 8081 &
SERVER_PID=$!
# Wait for startup
echo "Waiting for server startup..."
sleep 3
# Test endpoints
echo "Testing /papers endpoint..."
curl -s http://localhost:8081/papers | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
echo "[PASS] /papers endpoint working"
else
echo "[FAIL] /papers endpoint failed"
fi
echo "Testing /stats endpoint..."
curl -s http://localhost:8081/stats | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
echo "[PASS] /stats endpoint working"
else
echo "[FAIL] /stats endpoint failed"
fi
echo "Testing search endpoint..."
curl -s "http://localhost:8081/search?q=machine" | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
echo "[PASS] /search endpoint working"
else
echo "[FAIL] /search endpoint failed"
fi
echo "Testing 404 handling..."
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8081/invalid)
if [ "$RESPONSE" = "404" ]; then
echo "[PASS] 404 handling working"
else
echo "[FAIL] 404 handling failed (got $RESPONSE)"
fi
# Cleanup
kill $SERVER_PID 2>/dev/null
echo "Tests complete"Deliverables
Your problem1/ directory must contain:
problem1/
├── arxiv_server.py
├── Dockerfile
├── build.sh
├── run.sh
├── test.sh
└── sample_data/
└── papers.json
All shell scripts must be executable (chmod +x *.sh).
Validation
We will test your solution by:
- Running
./build.sh- must complete without errors - Running
./run.sh 9000- server must start on port 9000 - Testing all four endpoints with various queries
- Verifying JSON response structure matches specification
- Testing error handling for invalid requests
- Running concurrent requests to test stability
Your server must handle at least 10 concurrent requests without errors and respond to all endpoints within 2 seconds under normal load.