Problem 2: ArXiv Paper Discovery with DynamoDB
Use only the following packages:
- boto3 (AWS SDK for Python)
- Python standard library modules (json, sys, os, datetime, re, collections)
Do not use other AWS libraries, NoSQL ORMs, or database abstraction layers beyond boto3.
Build a paper discovery system using AWS DynamoDB that efficiently supports multiple access patterns through schema design and denormalization.
Part A: Schema Design for Access Patterns
Design a DynamoDB table schema that efficiently supports these required query patterns:
- Browse recent papers by category (e.g., “Show me latest ML papers”)
- Find all papers by a specific author
- Get full paper details by arxiv_id
- List papers published in a date range within a category
- Search papers by keyword (extracted from abstract)
Design Requirements:
- Define partition key and sort key for main table
- Design Global Secondary Indexes (GSIs) to support all access patterns
- Implement denormalization strategy for efficient queries
- Document trade-offs in your schema design
Example Schema Structure:
# Main Table Item
{
"PK": "CATEGORY#cs.LG",
"SK": "2023-01-15#2301.12345",
"arxiv_id": "2301.12345",
"title": "Paper Title",
"authors": ["Author1", "Author2"],
"abstract": "Full abstract text...",
"categories": ["cs.LG", "cs.AI"],
"keywords": ["keyword1", "keyword2"],
"published": "2023-01-15T10:30:00Z"
}
# GSI1: Author access
{
"GSI1PK": "AUTHOR#Author1",
"GSI1SK": "2023-01-15",
# ... rest of paper data
}
# Additional GSIs as needed for other access patternsPart B: Data Loading Script
Create load_data.py that loads ArXiv papers from your HW#1 Problem 2 output (papers.json) into DynamoDB.
Your script must accept these command line arguments:
python load_data.py <papers_json_path> <table_name> [--region REGION]Required Operations:
- Create DynamoDB table with appropriate partition/sort keys
- Create GSIs for alternate access patterns
- Transform paper data from HW#1 format to DynamoDB items
- Extract keywords from abstracts (top 10 most frequent words, excluding stopwords)
- Implement denormalization:
- Papers in multiple categories → multiple items
- Multiple authors → items for each author (GSI)
- Multiple keywords → items for each keyword (GSI)
- Batch write items to DynamoDB (use batch_write_item for efficiency)
- Report statistics:
- Number of papers loaded
- Total DynamoDB items created
- Denormalization factor (items/paper ratio)
Example Output:
Creating DynamoDB table: arxiv-papers
Creating GSIs: AuthorIndex, PaperIdIndex, KeywordIndex
Loading papers from papers.json...
Extracting keywords from abstracts...
Loaded 157 papers
Created 2,345 DynamoDB items (denormalized)
Denormalization factor: 14.9x
Storage breakdown:
- Category items: 314 (2.0 per paper avg)
- Author items: 785 (5.0 per paper avg)
- Keyword items: 1,570 (10.0 per paper avg)
- Paper ID items: 157 (1.0 per paper)
Keyword Extraction: Use the following stopwords list:
STOPWORDS = {
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
'can', 'this', 'that', 'these', 'those', 'we', 'our', 'use', 'using',
'based', 'approach', 'method', 'paper', 'propose', 'proposed', 'show'
}Part C: Query Implementation
Create query_papers.py that implements queries for all five access patterns.
Your script must support these commands:
# Query 1: Recent papers in category
python query_papers.py recent <category> [--limit 20] [--table TABLE]
# Query 2: Papers by author
python query_papers.py author <author_name> [--table TABLE]
# Query 3: Get paper by ID
python query_papers.py get <arxiv_id> [--table TABLE]
# Query 4: Papers in date range
python query_papers.py daterange <category> <start_date> <end_date> [--table TABLE]
# Query 5: Papers by keyword
python query_papers.py keyword <keyword> [--limit 20] [--table TABLE]Query Implementations:
def query_recent_in_category(table_name, category, limit=20):
"""
Query 1: Browse recent papers in category.
Uses: Main table partition key query with sort key descending.
"""
response = dynamodb.Table(table_name).query(
KeyConditionExpression=Key('PK').eq(f'CATEGORY#{category}'),
ScanIndexForward=False,
Limit=limit
)
return response['Items']
def query_papers_by_author(table_name, author_name):
"""
Query 2: Find all papers by author.
Uses: GSI1 (AuthorIndex) partition key query.
"""
response = dynamodb.Table(table_name).query(
IndexName='AuthorIndex',
KeyConditionExpression=Key('GSI1PK').eq(f'AUTHOR#{author_name}')
)
return response['Items']
def get_paper_by_id(table_name, arxiv_id):
"""
Query 3: Get specific paper by ID.
Uses: GSI2 (PaperIdIndex) for direct lookup.
"""
response = dynamodb.Table(table_name).query(
IndexName='PaperIdIndex',
KeyConditionExpression=Key('GSI2PK').eq(f'PAPER#{arxiv_id}')
)
return response['Items'][0] if response['Items'] else None
def query_papers_in_date_range(table_name, category, start_date, end_date):
"""
Query 4: Papers in category within date range.
Uses: Main table with composite sort key range query.
"""
response = dynamodb.Table(table_name).query(
KeyConditionExpression=(
Key('PK').eq(f'CATEGORY#{category}') &
Key('SK').between(f'{start_date}#', f'{end_date}#zzzzzzz')
)
)
return response['Items']
def query_papers_by_keyword(table_name, keyword, limit=20):
"""
Query 5: Papers containing keyword.
Uses: GSI3 (KeywordIndex) partition key query.
"""
response = dynamodb.Table(table_name).query(
IndexName='KeywordIndex',
KeyConditionExpression=Key('GSI3PK').eq(f'KEYWORD#{keyword.lower()}'),
ScanIndexForward=False,
Limit=limit
)
return response['Items']Output Format:
All queries must output JSON to stdout:
{
"query_type": "recent_in_category",
"parameters": {
"category": "cs.LG",
"limit": 20
},
"results": [
{
"arxiv_id": "2301.12345",
"title": "Paper Title",
"authors": ["Author1", "Author2"],
"published": "2023-01-15T10:30:00Z",
"categories": ["cs.LG"]
}
],
"count": 20,
"execution_time_ms": 12
}Part D: API Server with DynamoDB Backend
Create api_server.py that exposes query functionality via HTTP endpoints.
Required Endpoints:
GET /papers/recent?category={category}&limit={limit}- Returns recent papers in category
- Default limit: 20
GET /papers/author/{author_name}- Returns all papers by author
GET /papers/{arxiv_id}- Returns full paper details by ID
GET /papers/search?category={category}&start={date}&end={date}- Returns papers in date range
GET /papers/keyword/{keyword}?limit={limit}- Returns papers matching keyword
- Default limit: 20
Implementation Requirements:
- Use only Python standard library
http.server(no Flask/FastAPI) - Accept port number as command line argument (default 8080)
- Return JSON responses with proper HTTP status codes
- Handle errors gracefully (404 for not found, 500 for server errors)
- Log requests to stdout
Example Request/Response:
curl "http://localhost:8080/papers/recent?category=cs.LG&limit=5"{
"category": "cs.LG",
"papers": [
{
"arxiv_id": "2310.12345",
"title": "Recent ML Paper",
"authors": ["Author One", "Author Two"],
"published": "2023-10-15T10:30:00Z"
}
],
"count": 5
}Part E: EC2 Deployment
Deploy your API server to AWS EC2 and configure it to use DynamoDB.
Deployment Steps:
Launch EC2 instance:
- Instance type: t3.micro or t3.small
- OS: Amazon Linux 2023 or Ubuntu 22.04
- Security group: Allow inbound HTTP (port 80 or custom port)
Configure IAM role with DynamoDB permissions:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "dynamodb:Query", "dynamodb:Scan", "dynamodb:GetItem", "dynamodb:BatchGetItem" ], "Resource": "arn:aws:dynamodb:*:*:table/arxiv-papers*" } ] }Install dependencies on EC2:
sudo yum install python3 python3-pip -y # Amazon Linux pip3 install boto3Deploy and run server:
# Upload api_server.py to EC2 scp -i key.pem api_server.py ec2-user@<public-ip>:~ # SSH to instance ssh -i key.pem ec2-user@<public-ip> # Run server (use screen or systemd for persistence) python3 api_server.py 8080Test from local machine:
curl "http://<ec2-public-ip>:8080/papers/recent?category=cs.LG&limit=5"
Create deploy.sh:
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Usage: $0 <key_file> <ec2_public_ip>"
exit 1
fi
KEY_FILE="$1"
EC2_IP="$2"
echo "Deploying to EC2 instance: $EC2_IP"
# Copy files
scp -i "$KEY_FILE" api_server.py ec2-user@"$EC2_IP":~
scp -i "$KEY_FILE" requirements.txt ec2-user@"$EC2_IP":~
# Install dependencies and start server
ssh -i "$KEY_FILE" ec2-user@"$EC2_IP" << 'EOF'
pip3 install -r requirements.txt
# Kill existing server if running
pkill -f api_server.py
# Start server in background
nohup python3 api_server.py 8080 > server.log 2>&1 &
echo "Server started. Check with: curl http://localhost:8080/papers/recent?category=cs.LG"
EOF
echo "Deployment complete"
echo "Test with: curl http://$EC2_IP:8080/papers/recent?category=cs.LG"Part F: Analysis and Documentation
Create README.md in your problem2/ directory that answers:
- Schema Design Decisions:
- Why did you choose your partition key structure?
- How many GSIs did you create and why?
- What denormalization trade-offs did you make?
- Denormalization Analysis:
- Average number of DynamoDB items per paper
- Storage multiplication factor
- Which access patterns caused the most duplication?
- Query Limitations:
- What queries are NOT efficiently supported by your schema?
- Examples: “Count total papers by author”, “Most cited papers globally”
- Why are these difficult in DynamoDB?
- When to Use DynamoDB:
- Based on this exercise, when would you choose DynamoDB over PostgreSQL?
- What are the key trade-offs?
- EC2 Deployment:
- Your EC2 instance public IP
- IAM role ARN used
- Any challenges encountered during deployment
Deliverables
Your problem2/ directory must contain:
problem2/
├── load_data.py
├── query_papers.py
├── api_server.py
├── deploy.sh
├── requirements.txt
└── README.md
requirements.txt:
boto3>=1.28.0
All scripts must be executable and handle errors gracefully.
Validation
We will test your solution by:
- Running
load_data.pywith sample ArXiv papers - Testing all five query patterns with
query_papers.py - Starting your API server and testing all endpoints
- Verifying your EC2 deployment is accessible
- Checking your README answers all analysis questions
- Validating denormalization is implemented correctly
- Testing query performance for various access patterns
Your API server must respond to all endpoints within 200ms for queries on tables with up to 500 papers.