Problem 2: ArXiv Paper Metadata Processor
Build a containerized application that fetches paper metadata from the ArXiv API, processes it, and generates structured output.
Part A: ArXiv API Client
Create a file arxiv_processor.py that queries the ArXiv API and extracts paper metadata.
Your script must accept exactly three command line arguments:
- Search query string (e.g., “cat:cs.LG” for machine learning papers)
- Maximum number of results to fetch (integer between 1 and 100)
- Path to output directory
Your script must perform the following operations:
- Query the ArXiv API using the search query
- Fetch up to the specified maximum number of results
- Extract and process metadata for each paper
- Generate text analysis statistics
- Write structured output files
ArXiv API endpoint: http://export.arxiv.org/api/query
Query parameters:
search_query: Your search stringstart: Starting index (0-based)max_results: Maximum results to return
Example API call:
http://export.arxiv.org/api/query?search_query=cat:cs.LG&start=0&max_results=10
The API returns XML (parsing guide). You must parse this XML to extract:
- Paper ID (from the
<id>tag, extract just the ID portion after the last ‘/’) - Title (from
<title>) - Authors (from all
<author><name>tags) - Abstract (from
<summary>) - Categories (from all
<category>tags’termattribute) - Published date (from
<published>) - Updated date (from
<updated>)
Part B: Text Processing
For each paper’s abstract, compute the following:
Word frequency analysis:
- Total word count
- Unique word count
- Top 20 most frequent words (excluding stopwords)
- Average word length
Sentence analysis:
- Total sentence count (split on ‘.’, ‘!’, ‘?’)
- Average words per sentence
- Longest sentence (by word count)
- Shortest sentence (by word count)
Technical term extraction:
- Extract all words containing uppercase letters (e.g., “LSTM”, “GPU”)
- Extract all words containing numbers (e.g., “3D”, “ResNet50”)
- Extract all hyphenated terms (e.g., “state-of-the-art”, “pre-trained”)
Use the following stopwords list:
STOPWORDS = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
'can', 'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it',
'we', 'they', 'what', 'which', 'who', 'when', 'where', 'why', 'how',
'all', 'each', 'every', 'both', 'few', 'more', 'most', 'other', 'some',
'such', 'as', 'also', 'very', 'too', 'only', 'so', 'than', 'not'}Part C: Output Files
Your script must write three files to the output directory:
File 1: papers.json - Array of paper metadata:
[
{
"arxiv_id": "[paper ID]",
"title": "[paper title]",
"authors": ["author1", "author2", ...],
"abstract": "[full abstract text]",
"categories": ["cat1", "cat2", ...],
"published": "[ISO-8601 UTC]",
"updated": "[ISO-8601 UTC]",
"abstract_stats": {
"total_words": [integer],
"unique_words": [integer],
"total_sentences": [integer],
"avg_words_per_sentence": [float],
"avg_word_length": [float]
}
},
...
]File 2: corpus_analysis.json - Aggregate analysis across all papers:
{
"query": "[search query used]",
"papers_processed": [integer],
"processing_timestamp": "[ISO-8601 UTC]",
"corpus_stats": {
"total_abstracts": [integer],
"total_words": [integer],
"unique_words_global": [integer],
"avg_abstract_length": [float],
"longest_abstract_words": [integer],
"shortest_abstract_words": [integer]
},
"top_50_words": [
{"word": "[word1]", "frequency": [count], "documents": [count]},
...
],
"technical_terms": {
"uppercase_terms": ["TERM1", "TERM2", ...],
"numeric_terms": ["term1", "term2", ...],
"hyphenated_terms": ["term-1", "term-2", ...]
},
"category_distribution": {
"cs.LG": [count],
"cs.AI": [count],
...
}
}File 3: processing.log - Processing log with one line per event:
[ISO-8601 UTC] Starting ArXiv query: [query]
[ISO-8601 UTC] Fetched [N] results from ArXiv API
[ISO-8601 UTC] Processing paper: [arxiv_id]
[ISO-8601 UTC] Completed processing: [N] papers in [X.XX] seconds
Part D: Error Handling
Your script must handle the following error conditions:
- Network errors: If the ArXiv API is unreachable, write error to log and exit with code 1
- Invalid XML: If the API returns malformed XML, log the error and continue with other papers
- Missing fields: If a paper lacks required fields, skip it and log a warning
- Rate limiting: If you receive HTTP 429, wait 3 seconds and retry (maximum 3 attempts)
Requirements:
- Use only standard library modules:
sys,json,urllib.request,xml.etree.ElementTree,datetime,time,re,os - All word processing must be case-insensitive for frequency counting
- Preserve original case in the output
- Handle Unicode properly (ArXiv abstracts often contain mathematical symbols)
Part E: Dockerfile
Create a Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY arxiv_processor.py /app/
RUN mkdir -p /data/output
ENTRYPOINT ["python", "/app/arxiv_processor.py"]Part F: Build and Run Scripts
Create build.sh:
#!/bin/bash
docker build -t arxiv-processor:latest .Create run.sh:
#!/bin/bash
# Check arguments
if [ $# -ne 3 ]; then
echo "Usage: $0 <query> <max_results> <output_directory>"
echo "Example: $0 'cat:cs.LG' 10 output/"
exit 1
fi
QUERY="$1"
MAX_RESULTS="$2"
OUTPUT_DIR="$3"
# Validate max_results is a number
if ! [[ "$MAX_RESULTS" =~ ^[0-9]+$ ]]; then
echo "Error: max_results must be a positive integer"
exit 1
fi
# Check max_results is in valid range
if [ "$MAX_RESULTS" -lt 1 ] || [ "$MAX_RESULTS" -gt 100 ]; then
echo "Error: max_results must be between 1 and 100"
exit 1
fi
# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"
# Run container
docker run --rm \
--name arxiv-processor \
-v "$(realpath $OUTPUT_DIR)":/data/output \
arxiv-processor:latest \
"$QUERY" "$MAX_RESULTS" "/data/output"Part G: Testing
Create test.sh:
#!/bin/bash
# Test 1: Machine Learning papers
./run.sh "cat:cs.LG" 5 output_ml/
# Test 2: Search by author
./run.sh "au:LeCun" 3 output_author/
# Test 3: Search by title keyword
./run.sh "ti:transformer" 10 output_title/
# Test 4: Complex query (ML papers about transformers from 2023)
./run.sh "cat:cs.LG AND ti:transformer AND submittedDate:[202301010000 TO 202312312359]" 5 output_complex/
echo "Test completed. Check output directories for results."Deliverables
Your problem2/ directory must contain exactly:
problem2/
├── arxiv_processor.py
├── Dockerfile
├── build.sh
├── run.sh
└── test.sh
Validation
We will test your solution by:
- Running
./build.sh- must complete without errors - Running
./run.sh "cat:cs.LG" 10 output/- must fetch 10 ML papers - Verifying all three output files exist and contain valid JSON
- Checking that word frequencies are accurate
- Testing with various queries to ensure robustness
- Verifying the container handles network errors gracefully
Your container must respect ArXiv’s rate limits and terms of service. Do not make more than 1 request per 3 seconds to avoid being blocked.