hw01-q2

Problem 2: ArXiv Paper Metadata Processor

Build a containerized application that fetches paper metadata from the ArXiv API, processes it, and generates structured output.

Part A: ArXiv API Client

Create a file arxiv_processor.py that queries the ArXiv API and extracts paper metadata.

Your script must accept exactly three command line arguments:

Search query string (e.g., “cat:cs.LG” for machine learning papers)
Maximum number of results to fetch (integer between 1 and 100)
Path to output directory

Your script must perform the following operations:

Query the ArXiv API using the search query
Fetch up to the specified maximum number of results
Extract and process metadata for each paper
Generate text analysis statistics
Write structured output files

ArXiv API endpoint: http://export.arxiv.org/api/query

Query parameters:

search_query: Your search string
start: Starting index (0-based)
max_results: Maximum results to return

Example API call:

http://export.arxiv.org/api/query?search_query=cat:cs.LG&start=0&max_results=10

The API returns XML (parsing guide). You must parse this XML to extract:

Paper ID (from the <id> tag, extract just the ID portion after the last ‘/’)
Title (from <title>)
Authors (from all <author><name> tags)
Abstract (from <summary>)
Categories (from all <category> tags’ term attribute)
Published date (from <published>)
Updated date (from <updated>)

Part B: Text Processing

For each paper’s abstract, compute the following:

Word frequency analysis:
- Total word count
- Unique word count
- Top 20 most frequent words (excluding stopwords)
- Average word length
Sentence analysis:
- Total sentence count (split on ‘.’, ‘!’, ‘?’)
- Average words per sentence
- Longest sentence (by word count)
- Shortest sentence (by word count)
Technical term extraction:
- Extract all words containing uppercase letters (e.g., “LSTM”, “GPU”)
- Extract all words containing numbers (e.g., “3D”, “ResNet50”)
- Extract all hyphenated terms (e.g., “state-of-the-art”, “pre-trained”)

Use the following stopwords list:

STOPWORDS = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
             'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
             'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
             'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
             'can', 'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it',
             'we', 'they', 'what', 'which', 'who', 'when', 'where', 'why', 'how',
             'all', 'each', 'every', 'both', 'few', 'more', 'most', 'other', 'some',
             'such', 'as', 'also', 'very', 'too', 'only', 'so', 'than', 'not'}

Part C: Output Files

Your script must write three files to the output directory:

File 1: papers.json - Array of paper metadata:

[
  {
    "arxiv_id": "[paper ID]",
    "title": "[paper title]",
    "authors": ["author1", "author2", ...],
    "abstract": "[full abstract text]",
    "categories": ["cat1", "cat2", ...],
    "published": "[ISO-8601 UTC]",
    "updated": "[ISO-8601 UTC]",
    "abstract_stats": {
      "total_words": [integer],
      "unique_words": [integer],
      "total_sentences": [integer],
      "avg_words_per_sentence": [float],
      "avg_word_length": [float]
    }
  },
  ...
]

File 2: corpus_analysis.json - Aggregate analysis across all papers:

{
  "query": "[search query used]",
  "papers_processed": [integer],
  "processing_timestamp": "[ISO-8601 UTC]",
  "corpus_stats": {
    "total_abstracts": [integer],
    "total_words": [integer],
    "unique_words_global": [integer],
    "avg_abstract_length": [float],
    "longest_abstract_words": [integer],
    "shortest_abstract_words": [integer]
  },
  "top_50_words": [
    {"word": "[word1]", "frequency": [count], "documents": [count]},
    ...
  ],
  "technical_terms": {
    "uppercase_terms": ["TERM1", "TERM2", ...],
    "numeric_terms": ["term1", "term2", ...],
    "hyphenated_terms": ["term-1", "term-2", ...]
  },
  "category_distribution": {
    "cs.LG": [count],
    "cs.AI": [count],
    ...
  }
}

File 3: processing.log - Processing log with one line per event:

[ISO-8601 UTC] Starting ArXiv query: [query]
[ISO-8601 UTC] Fetched [N] results from ArXiv API
[ISO-8601 UTC] Processing paper: [arxiv_id]
[ISO-8601 UTC] Completed processing: [N] papers in [X.XX] seconds

Part D: Error Handling

Your script must handle the following error conditions:

Network errors: If the ArXiv API is unreachable, write error to log and exit with code 1
Invalid XML: If the API returns malformed XML, log the error and continue with other papers
Missing fields: If a paper lacks required fields, skip it and log a warning
Rate limiting: If you receive HTTP 429, wait 3 seconds and retry (maximum 3 attempts)

Requirements:

Use only standard library modules: sys, json, urllib.request, xml.etree.ElementTree, datetime, time, re, os
All word processing must be case-insensitive for frequency counting
Preserve original case in the output
Handle Unicode properly (ArXiv abstracts often contain mathematical symbols)

Part E: Dockerfile

Create a Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY arxiv_processor.py /app/
RUN mkdir -p /data/output
ENTRYPOINT ["python", "/app/arxiv_processor.py"]

Part F: Build and Run Scripts

Create build.sh:

#!/bin/bash
docker build -t arxiv-processor:latest .

Create run.sh:

#!/bin/bash

# Check arguments
if [ $# -ne 3 ]; then
    echo "Usage: $0 <query> <max_results> <output_directory>"
    echo "Example: $0 'cat:cs.LG' 10 output/"
    exit 1
fi

QUERY="$1"
MAX_RESULTS="$2"
OUTPUT_DIR="$3"

# Validate max_results is a number
if ! [[ "$MAX_RESULTS" =~ ^[0-9]+$ ]]; then
    echo "Error: max_results must be a positive integer"
    exit 1
fi

# Check max_results is in valid range
if [ "$MAX_RESULTS" -lt 1 ] || [ "$MAX_RESULTS" -gt 100 ]; then
    echo "Error: max_results must be between 1 and 100"
    exit 1
fi

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Run container
docker run --rm \
    --name arxiv-processor \
    -v "$(realpath $OUTPUT_DIR)":/data/output \
    arxiv-processor:latest \
    "$QUERY" "$MAX_RESULTS" "/data/output"

Part G: Testing

Create test.sh:

#!/bin/bash

# Test 1: Machine Learning papers
./run.sh "cat:cs.LG" 5 output_ml/

# Test 2: Search by author
./run.sh "au:LeCun" 3 output_author/

# Test 3: Search by title keyword
./run.sh "ti:transformer" 10 output_title/

# Test 4: Complex query (ML papers about transformers from 2023)
./run.sh "cat:cs.LG AND ti:transformer AND submittedDate:[202301010000 TO 202312312359]" 5 output_complex/

echo "Test completed. Check output directories for results."

Deliverables

Your problem2/ directory must contain exactly:

problem2/
├── arxiv_processor.py
├── Dockerfile
├── build.sh
├── run.sh
└── test.sh

Validation

We will test your solution by:

Running ./build.sh - must complete without errors
Running ./run.sh "cat:cs.LG" 10 output/ - must fetch 10 ML papers
Verifying all three output files exist and contain valid JSON
Checking that word frequencies are accurate
Testing with various queries to ensure robustness
Verifying the container handles network errors gracefully

Your container must respect ArXiv’s rate limits and terms of service. Do not make more than 1 request per 3 seconds to avoid being blocked.