hw02-q1

Problem 1: HTTP API Server for ArXiv Papers

Requirements

Use only Python standard library modules:

http.server (standard library)
urllib.parse (standard library)
json (standard library)
re (standard library)
sys, os, datetime (standard library)

Do not use flask, django, fastapi, requests, or any other web frameworks or external HTTP libraries.

Build a containerized HTTP server that serves ArXiv paper metadata from your HW#1 Problem 2 output.

Part A: Data Source

Your server must load ArXiv paper data from HW#1 Problem 2 output files:

papers.json - Array of paper metadata with abstracts and statistics
corpus_analysis.json - Global corpus analysis with word frequencies

Part B: HTTP Server Implementation

Create arxiv_server.py that implements a basic HTTP server with the following endpoints:

Required Endpoints:

GET /papers - Return list of all papers

[
  {
    "arxiv_id": "2301.12345",
    "title": "Paper Title",
    "authors": ["Author One", "Author Two"],
    "categories": ["cs.LG", "cs.AI"]
  },
  ...
]

GET /papers/{arxiv_id} - Return full paper details

{
  "arxiv_id": "2301.12345",
  "title": "Paper Title",
  "authors": ["Author One", "Author Two"],
  "abstract": "Full abstract text...",
  "categories": ["cs.LG", "cs.AI"],
  "published": "2023-01-15T10:30:00Z",
  "abstract_stats": {
    "total_words": 150,
    "unique_words": 85,
    "total_sentences": 8
  }
}

GET /search?q={query} - Search papers by title and abstract

{
  "query": "machine learning",
  "results": [
    {
      "arxiv_id": "2301.12345",
      "title": "Paper Title",
      "match_score": 3,
      "matches_in": ["title", "abstract"]
    }
  ]
}

GET /stats - Return corpus statistics

{
  "total_papers": 20,
  "total_words": 15000,
  "unique_words": 2500,
  "top_10_words": [
    {"word": "model", "frequency": 145},
    {"word": "data", "frequency": 132}
  ],
  "category_distribution": {
    "cs.LG": 12,
    "cs.AI": 8
  }
}

Error Handling:

Return HTTP 404 for unknown paper IDs or invalid endpoints
Return HTTP 400 for malformed search queries
Return HTTP 500 for server errors with JSON error message

Part C: Implementation Requirements

Your server must:

Command Line Arguments: Accept port number as argument (default 8080)
```
python arxiv_server.py [port]
```
Data Loading: Load JSON data at startup, handle missing files gracefully
Search Implementation:
- Case-insensitive search in titles and abstracts
- Count term frequency as match score
- Support multi-word queries (search for all terms)

Logging: Print requests to stdout in format:

[2025-09-16 14:30:22] GET /papers - 200 OK (15 results)
[2025-09-16 14:30:25] GET /papers/invalid-id - 404 Not Found

Part D: Dockerfile

Create a Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY arxiv_server.py /app/
COPY sample_data/ /app/sample_data/
EXPOSE 8080
ENTRYPOINT ["python", "/app/arxiv_server.py"]
CMD ["8080"]

Part E: Build and Run Scripts

Create build.sh:

#!/bin/bash
docker build -t arxiv-server:latest .

Create run.sh:

#!/bin/bash

# Check for port argument
PORT=${1:-8080}

# Validate port is numeric
if ! [[ "$PORT" =~ ^[0-9]+$ ]]; then
    echo "Error: Port must be numeric"
    exit 1
fi

# Check port range
if [ "$PORT" -lt 1024 ] || [ "$PORT" -gt 65535 ]; then
    echo "Error: Port must be between 1024 and 65535"
    exit 1
fi

echo "Starting ArXiv API server on port $PORT"
echo "Access at: http://localhost:$PORT"
echo ""
echo "Available endpoints:"
echo "  GET /papers"
echo "  GET /papers/{arxiv_id}"
echo "  GET /search?q={query}"
echo "  GET /stats"
echo ""

# Run container
docker run --rm \
    --name arxiv-server \
    -p "$PORT:8080" \
    arxiv-server:latest

Part F: Testing

Create test.sh:

#!/bin/bash

# Start server in background
./run.sh 8081 &
SERVER_PID=$!

# Wait for startup
echo "Waiting for server startup..."
sleep 3

# Test endpoints
echo "Testing /papers endpoint..."
curl -s http://localhost:8081/papers | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
    echo "[PASS] /papers endpoint working"
else
    echo "[FAIL] /papers endpoint failed"
fi

echo "Testing /stats endpoint..."
curl -s http://localhost:8081/stats | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
    echo "[PASS] /stats endpoint working"
else
    echo "[FAIL] /stats endpoint failed"
fi

echo "Testing search endpoint..."
curl -s "http://localhost:8081/search?q=machine" | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
    echo "[PASS] /search endpoint working"
else
    echo "[FAIL] /search endpoint failed"
fi

echo "Testing 404 handling..."
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8081/invalid)
if [ "$RESPONSE" = "404" ]; then
    echo "[PASS] 404 handling working"
else
    echo "[FAIL] 404 handling failed (got $RESPONSE)"
fi

# Cleanup
kill $SERVER_PID 2>/dev/null
echo "Tests complete"

Deliverables

Your problem1/ directory must contain:

problem1/
├── arxiv_server.py
├── Dockerfile
├── build.sh
├── run.sh
├── test.sh
└── sample_data/
    └── papers.json

All shell scripts must be executable (chmod +x *.sh).

Validation

We will test your solution by:

Running ./build.sh - must complete without errors
Running ./run.sh 9000 - server must start on port 9000
Testing all four endpoints with various queries
Verifying JSON response structure matches specification
Testing error handling for invalid requests
Running concurrent requests to test stability

Your server must handle at least 10 concurrent requests without errors and respond to all endpoints within 2 seconds under normal load.