hw03-q2

Problem 2: ArXiv Paper Discovery with DynamoDB

Requirements

Use only the following packages:

boto3 (AWS SDK for Python)
Python standard library modules (json, sys, os, datetime, re, collections)

Do not use other AWS libraries, NoSQL ORMs, or database abstraction layers beyond boto3.

Build a paper discovery system using AWS DynamoDB that efficiently supports multiple access patterns through schema design and denormalization.

Part A: Schema Design for Access Patterns

Design a DynamoDB table schema that efficiently supports these required query patterns:

Browse recent papers by category (e.g., “Show me latest ML papers”)
Find all papers by a specific author
Get full paper details by arxiv_id
List papers published in a date range within a category
Search papers by keyword (extracted from abstract)

Design Requirements:

Define partition key and sort key for main table
Design Global Secondary Indexes (GSIs) to support all access patterns
Implement denormalization strategy for efficient queries
Document trade-offs in your schema design

Example Schema Structure:

# Main Table Item
{
  "PK": "CATEGORY#cs.LG",
  "SK": "2023-01-15#2301.12345",
  "arxiv_id": "2301.12345",
  "title": "Paper Title",
  "authors": ["Author1", "Author2"],
  "abstract": "Full abstract text...",
  "categories": ["cs.LG", "cs.AI"],
  "keywords": ["keyword1", "keyword2"],
  "published": "2023-01-15T10:30:00Z"
}

# GSI1: Author access
{
  "GSI1PK": "AUTHOR#Author1",
  "GSI1SK": "2023-01-15",
  # ... rest of paper data
}

# Additional GSIs as needed for other access patterns

Part B: Data Loading Script

Create load_data.py that loads ArXiv papers from your HW#1 Problem 2 output (papers.json) into DynamoDB.

Your script must accept these command line arguments:

python load_data.py <papers_json_path> <table_name> [--region REGION]

Required Operations:

Create DynamoDB table with appropriate partition/sort keys
Create GSIs for alternate access patterns
Transform paper data from HW#1 format to DynamoDB items
Extract keywords from abstracts (top 10 most frequent words, excluding stopwords)
Implement denormalization:
- Papers in multiple categories → multiple items
- Multiple authors → items for each author (GSI)
- Multiple keywords → items for each keyword (GSI)
Batch write items to DynamoDB (use batch_write_item for efficiency)
Report statistics:
- Number of papers loaded
- Total DynamoDB items created
- Denormalization factor (items/paper ratio)

Example Output:

Creating DynamoDB table: arxiv-papers
Creating GSIs: AuthorIndex, PaperIdIndex, KeywordIndex
Loading papers from papers.json...
Extracting keywords from abstracts...
Loaded 157 papers
Created 2,345 DynamoDB items (denormalized)
Denormalization factor: 14.9x

Storage breakdown:
  - Category items: 314 (2.0 per paper avg)
  - Author items: 785 (5.0 per paper avg)
  - Keyword items: 1,570 (10.0 per paper avg)
  - Paper ID items: 157 (1.0 per paper)

Keyword Extraction: Use the following stopwords list:

STOPWORDS = {
    'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
    'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
    'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
    'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
    'can', 'this', 'that', 'these', 'those', 'we', 'our', 'use', 'using',
    'based', 'approach', 'method', 'paper', 'propose', 'proposed', 'show'
}

Part C: Query Implementation

Create query_papers.py that implements queries for all five access patterns.

Your script must support these commands:

# Query 1: Recent papers in category
python query_papers.py recent <category> [--limit 20] [--table TABLE]

# Query 2: Papers by author
python query_papers.py author <author_name> [--table TABLE]

# Query 3: Get paper by ID
python query_papers.py get <arxiv_id> [--table TABLE]

# Query 4: Papers in date range
python query_papers.py daterange <category> <start_date> <end_date> [--table TABLE]

# Query 5: Papers by keyword
python query_papers.py keyword <keyword> [--limit 20] [--table TABLE]

Query Implementations:

def query_recent_in_category(table_name, category, limit=20):
    """
    Query 1: Browse recent papers in category.
    Uses: Main table partition key query with sort key descending.
    """
    response = dynamodb.Table(table_name).query(
        KeyConditionExpression=Key('PK').eq(f'CATEGORY#{category}'),
        ScanIndexForward=False,
        Limit=limit
    )
    return response['Items']

def query_papers_by_author(table_name, author_name):
    """
    Query 2: Find all papers by author.
    Uses: GSI1 (AuthorIndex) partition key query.
    """
    response = dynamodb.Table(table_name).query(
        IndexName='AuthorIndex',
        KeyConditionExpression=Key('GSI1PK').eq(f'AUTHOR#{author_name}')
    )
    return response['Items']

def get_paper_by_id(table_name, arxiv_id):
    """
    Query 3: Get specific paper by ID.
    Uses: GSI2 (PaperIdIndex) for direct lookup.
    """
    response = dynamodb.Table(table_name).query(
        IndexName='PaperIdIndex',
        KeyConditionExpression=Key('GSI2PK').eq(f'PAPER#{arxiv_id}')
    )
    return response['Items'][0] if response['Items'] else None

def query_papers_in_date_range(table_name, category, start_date, end_date):
    """
    Query 4: Papers in category within date range.
    Uses: Main table with composite sort key range query.
    """
    response = dynamodb.Table(table_name).query(
        KeyConditionExpression=(
            Key('PK').eq(f'CATEGORY#{category}') &
            Key('SK').between(f'{start_date}#', f'{end_date}#zzzzzzz')
        )
    )
    return response['Items']

def query_papers_by_keyword(table_name, keyword, limit=20):
    """
    Query 5: Papers containing keyword.
    Uses: GSI3 (KeywordIndex) partition key query.
    """
    response = dynamodb.Table(table_name).query(
        IndexName='KeywordIndex',
        KeyConditionExpression=Key('GSI3PK').eq(f'KEYWORD#{keyword.lower()}'),
        ScanIndexForward=False,
        Limit=limit
    )
    return response['Items']

Output Format:

All queries must output JSON to stdout:

{
  "query_type": "recent_in_category",
  "parameters": {
    "category": "cs.LG",
    "limit": 20
  },
  "results": [
    {
      "arxiv_id": "2301.12345",
      "title": "Paper Title",
      "authors": ["Author1", "Author2"],
      "published": "2023-01-15T10:30:00Z",
      "categories": ["cs.LG"]
    }
  ],
  "count": 20,
  "execution_time_ms": 12
}

Part D: API Server with DynamoDB Backend

Create api_server.py that exposes query functionality via HTTP endpoints.

Required Endpoints:

GET /papers/recent?category={category}&limit={limit}
- Returns recent papers in category
- Default limit: 20
GET /papers/author/{author_name}
- Returns all papers by author
GET /papers/{arxiv_id}
- Returns full paper details by ID
GET /papers/search?category={category}&start={date}&end={date}
- Returns papers in date range
GET /papers/keyword/{keyword}?limit={limit}
- Returns papers matching keyword
- Default limit: 20

Implementation Requirements:

Use only Python standard library http.server (no Flask/FastAPI)
Accept port number as command line argument (default 8080)
Return JSON responses with proper HTTP status codes
Handle errors gracefully (404 for not found, 500 for server errors)
Log requests to stdout

Example Request/Response:

curl "http://localhost:8080/papers/recent?category=cs.LG&limit=5"

{
  "category": "cs.LG",
  "papers": [
    {
      "arxiv_id": "2310.12345",
      "title": "Recent ML Paper",
      "authors": ["Author One", "Author Two"],
      "published": "2023-10-15T10:30:00Z"
    }
  ],
  "count": 5
}

Part E: EC2 Deployment

Deploy your API server to AWS EC2 and configure it to use DynamoDB.

Deployment Steps:

Launch EC2 instance:
- Instance type: t3.micro or t3.small
- OS: Amazon Linux 2023 or Ubuntu 22.04
- Security group: Allow inbound HTTP (port 80 or custom port)

Configure IAM role with DynamoDB permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:Query",
        "dynamodb:Scan",
        "dynamodb:GetItem",
        "dynamodb:BatchGetItem"
      ],
      "Resource": "arn:aws:dynamodb:*:*:table/arxiv-papers*"
    }
  ]
}

Install dependencies on EC2:

sudo yum install python3 python3-pip -y  # Amazon Linux
pip3 install boto3

Deploy and run server:

# Upload api_server.py to EC2
scp -i key.pem api_server.py ec2-user@<public-ip>:~

# SSH to instance
ssh -i key.pem ec2-user@<public-ip>

# Run server (use screen or systemd for persistence)
python3 api_server.py 8080

Test from local machine:

curl "http://<ec2-public-ip>:8080/papers/recent?category=cs.LG&limit=5"

Create deploy.sh:

#!/bin/bash

if [ $# -ne 2 ]; then
    echo "Usage: $0 <key_file> <ec2_public_ip>"
    exit 1
fi

KEY_FILE="$1"
EC2_IP="$2"

echo "Deploying to EC2 instance: $EC2_IP"

# Copy files
scp -i "$KEY_FILE" api_server.py ec2-user@"$EC2_IP":~
scp -i "$KEY_FILE" requirements.txt ec2-user@"$EC2_IP":~

# Install dependencies and start server
ssh -i "$KEY_FILE" ec2-user@"$EC2_IP" << 'EOF'
  pip3 install -r requirements.txt

  # Kill existing server if running
  pkill -f api_server.py

  # Start server in background
  nohup python3 api_server.py 8080 > server.log 2>&1 &

  echo "Server started. Check with: curl http://localhost:8080/papers/recent?category=cs.LG"
EOF

echo "Deployment complete"
echo "Test with: curl http://$EC2_IP:8080/papers/recent?category=cs.LG"

Part F: Analysis and Documentation

Create README.md in your problem2/ directory that answers:

Schema Design Decisions:
- Why did you choose your partition key structure?
- How many GSIs did you create and why?
- What denormalization trade-offs did you make?
Denormalization Analysis:
- Average number of DynamoDB items per paper
- Storage multiplication factor
- Which access patterns caused the most duplication?
Query Limitations:
- What queries are NOT efficiently supported by your schema?
- Examples: “Count total papers by author”, “Most cited papers globally”
- Why are these difficult in DynamoDB?
When to Use DynamoDB:
- Based on this exercise, when would you choose DynamoDB over PostgreSQL?
- What are the key trade-offs?
EC2 Deployment:
- Your EC2 instance public IP
- IAM role ARN used
- Any challenges encountered during deployment

Deliverables

Your problem2/ directory must contain:

problem2/
├── load_data.py
├── query_papers.py
├── api_server.py
├── deploy.sh
├── requirements.txt
└── README.md

requirements.txt:

boto3>=1.28.0

All scripts must be executable and handle errors gracefully.

Validation

We will test your solution by:

Running load_data.py with sample ArXiv papers
Testing all five query patterns with query_papers.py
Starting your API server and testing all endpoints
Verifying your EC2 deployment is accessible
Checking your README answers all analysis questions
Validating denormalization is implemented correctly
Testing query performance for various access patterns

Your API server must respond to all endpoints within 200ms for queries on tables with up to 500 papers.