Homework #2: HTTP Services, ML Embeddings, and AWS Fundamentals

EE 547: Fall 2025

Assignment Details

Assigned: 16 September
Due: Monday, 29 September at 23:59

Submission: Gradescope via GitHub repository

Requirements

Docker Desktop must be installed and running on your machine
AWS CLI configured with valid credentials for Problems 3-4
PyTorch installed for Problem 2 (can be inside container)
Use data files from HW#1 Problem 2 output (or provided sample data)

Overview

This assignment introduces HTTP services, machine learning embeddings, and AWS fundamentals. You will build API servers, train embedding models, interact with AWS services, and integrate ML inference into web services.

Problem 1: HTTP API Server for ArXiv Papers

Requirements

Use only Python standard library modules:

http.server (standard library)
urllib.parse (standard library)
json (standard library)
re (standard library)
sys, os, datetime (standard library)

Do not use flask, django, fastapi, requests, or any other web frameworks or external HTTP libraries.

Build a containerized HTTP server that serves ArXiv paper metadata from your HW#1 Problem 2 output.

Part A: Data Source

Your server must load ArXiv paper data from HW#1 Problem 2 output files:

papers.json - Array of paper metadata with abstracts and statistics
corpus_analysis.json - Global corpus analysis with word frequencies

Part B: HTTP Server Implementation

Create arxiv_server.py that implements a basic HTTP server with the following endpoints:

Required Endpoints:

GET /papers - Return list of all papers

[
  {
    "arxiv_id": "2301.12345",
    "title": "Paper Title",
    "authors": ["Author One", "Author Two"],
    "categories": ["cs.LG", "cs.AI"]
  },
  ...
]

GET /papers/{arxiv_id} - Return full paper details

{
  "arxiv_id": "2301.12345",
  "title": "Paper Title",
  "authors": ["Author One", "Author Two"],
  "abstract": "Full abstract text...",
  "categories": ["cs.LG", "cs.AI"],
  "published": "2023-01-15T10:30:00Z",
  "abstract_stats": {
    "total_words": 150,
    "unique_words": 85,
    "total_sentences": 8
  }
}

GET /search?q={query} - Search papers by title and abstract

{
  "query": "machine learning",
  "results": [
    {
      "arxiv_id": "2301.12345",
      "title": "Paper Title",
      "match_score": 3,
      "matches_in": ["title", "abstract"]
    }
  ]
}

GET /stats - Return corpus statistics

{
  "total_papers": 20,
  "total_words": 15000,
  "unique_words": 2500,
  "top_10_words": [
    {"word": "model", "frequency": 145},
    {"word": "data", "frequency": 132}
  ],
  "category_distribution": {
    "cs.LG": 12,
    "cs.AI": 8
  }
}

Error Handling:

Return HTTP 404 for unknown paper IDs or invalid endpoints
Return HTTP 400 for malformed search queries
Return HTTP 500 for server errors with JSON error message

Part C: Implementation Requirements

Your server must:

Command Line Arguments: Accept port number as argument (default 8080)
```
python arxiv_server.py [port]
```
Data Loading: Load JSON data at startup, handle missing files gracefully
Search Implementation:
- Case-insensitive search in titles and abstracts
- Count term frequency as match score
- Support multi-word queries (search for all terms)

Logging: Print requests to stdout in format:

[2025-09-16 14:30:22] GET /papers - 200 OK (15 results)
[2025-09-16 14:30:25] GET /papers/invalid-id - 404 Not Found

Part D: Dockerfile

Create a Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY arxiv_server.py /app/
COPY sample_data/ /app/sample_data/
EXPOSE 8080
ENTRYPOINT ["python", "/app/arxiv_server.py"]
CMD ["8080"]

Part E: Build and Run Scripts

Create build.sh:

#!/bin/bash
docker build -t arxiv-server:latest .

Create run.sh:

#!/bin/bash

# Check for port argument
PORT=${1:-8080}

# Validate port is numeric
if ! [[ "$PORT" =~ ^[0-9]+$ ]]; then
    echo "Error: Port must be numeric"
    exit 1
fi

# Check port range
if [ "$PORT" -lt 1024 ] || [ "$PORT" -gt 65535 ]; then
    echo "Error: Port must be between 1024 and 65535"
    exit 1
fi

echo "Starting ArXiv API server on port $PORT"
echo "Access at: http://localhost:$PORT"
echo ""
echo "Available endpoints:"
echo "  GET /papers"
echo "  GET /papers/{arxiv_id}"
echo "  GET /search?q={query}"
echo "  GET /stats"
echo ""

# Run container
docker run --rm \
    --name arxiv-server \
    -p "$PORT:8080" \
    arxiv-server:latest

Part F: Testing

Create test.sh:

#!/bin/bash

# Start server in background
./run.sh 8081 &
SERVER_PID=$!

# Wait for startup
echo "Waiting for server startup..."
sleep 3

# Test endpoints
echo "Testing /papers endpoint..."
curl -s http://localhost:8081/papers | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
    echo "[PASS] /papers endpoint working"
else
    echo "[FAIL] /papers endpoint failed"
fi

echo "Testing /stats endpoint..."
curl -s http://localhost:8081/stats | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
    echo "[PASS] /stats endpoint working"
else
    echo "[FAIL] /stats endpoint failed"
fi

echo "Testing search endpoint..."
curl -s "http://localhost:8081/search?q=machine" | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
    echo "[PASS] /search endpoint working"
else
    echo "[FAIL] /search endpoint failed"
fi

echo "Testing 404 handling..."
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8081/invalid)
if [ "$RESPONSE" = "404" ]; then
    echo "[PASS] 404 handling working"
else
    echo "[FAIL] 404 handling failed (got $RESPONSE)"
fi

# Cleanup
kill $SERVER_PID 2>/dev/null
echo "Tests complete"

Deliverables

Your problem1/ directory must contain:

problem1/
├── arxiv_server.py
├── Dockerfile
├── build.sh
├── run.sh
├── test.sh
└── sample_data/
    └── papers.json

All shell scripts must be executable (chmod +x *.sh).

Validation

We will test your solution by:

Running ./build.sh - must complete without errors
Running ./run.sh 9000 - server must start on port 9000
Testing all four endpoints with various queries
Verifying JSON response structure matches specification
Testing error handling for invalid requests
Running concurrent requests to test stability

Your server must handle at least 10 concurrent requests without errors and respond to all endpoints within 2 seconds under normal load.

Problem 2: Text Embedding Training with Autoencoders

Requirements

Use only the following packages:

PyTorch (torch, torch.nn, torch.optim)
Python standard library modules (json, sys, os, re, datetime, collections)
Basic text processing: You may implement your own tokenization or use simple word splitting

Do not use transformers, sentence-transformers, scikit-learn, numpy (PyTorch tensors only), or pre-trained embedding models.

Train a text autoencoder to generate embeddings for ArXiv paper abstracts. There is a strict parameter limit to encourage efficient architectures.

Part A: Parameter Limit Calculation

Your encoder must have no more than 2,000,000 total parameters (weights and biases combined).

Example Calculation for Planning:

Assumptions for parameter budget:
- Vocabulary size: ~10,000 words (typical for technical abstracts)
- Suggested embedding dimension: 64-256 (your choice)
- Architecture: Input → Hidden → Bottleneck → Hidden → Output

Example architecture (512 → 128 → 512):
- Input layer: 10,000 × 512 + 512 bias = 5,120,512 parameters
- Encoder: 512 × 128 + 128 bias = 65,664 parameters  
- Decoder: 128 × 512 + 512 bias = 66,048 parameters
- Output: 512 × 10,000 + 10,000 bias = 5,130,000 parameters
Total: ~10.4M parameters (TOO LARGE)

Better architecture (256 → 64 → 256):
- Input layer: 5,000 × 256 + 256 bias = 1,280,256 parameters
- Encoder: 256 × 64 + 64 bias = 16,448 parameters
- Decoder: 64 × 256 + 256 bias = 16,640 parameters  
- Output: 256 × 5,000 + 5,000 bias = 1,285,000 parameters
Total: ~1.6M parameters (WITHIN LIMIT)

Design Constraints:

Smaller vocabulary (limit to top-K most frequent words)
Smaller hidden layers
Efficient embedding dimension (64-256 range suggested)

Your script must print the total parameter count and verify it’s under the limit.

Part B: Data Preprocessing

Create train_embeddings.py that loads ArXiv abstracts from HW#1 Problem 2 output.

Required preprocessing steps:

Text cleaning:

def clean_text(text):
    # Convert to lowercase
    # Remove non-alphabetic characters except spaces
    # Split into words
    # Remove very short words (< 2 characters)
    return words

Vocabulary building:
- Extract all unique words from abstracts
- Keep only the top 5,000 most frequent words (parameter budget constraint)
- Create word-to-index mapping
- Reserve index 0 for unknown words
Sequence encoding:
- Convert abstracts to sequences of word indices
- Pad or truncate to fixed length (e.g., 100-200 words)
- Create bag-of-words representation for autoencoder input/output

Part C: Autoencoder Architecture

Design a simple autoencoder. You may follow this vanilla pattern:

class TextAutoencoder(nn.Module):
    def __init__(self, vocab_size, hidden_dim, embedding_dim):
        super().__init__()
        # Encoder: vocab_size → hidden_dim → embedding_dim
        self.encoder = nn.Sequential(
            nn.Linear(vocab_size, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embedding_dim)
        )
        
        # Decoder: embedding_dim → hidden_dim → vocab_size  
        self.decoder = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, vocab_size),
            nn.Sigmoid()  # Output probabilities
        )
    
    def forward(self, x):
        # Encode to bottleneck
        embedding = self.encoder(x)
        # Decode back to vocabulary space
        reconstruction = self.decoder(embedding)
        return reconstruction, embedding

Architecture Requirements:

Input/output: Bag-of-words vectors (size = vocabulary size)
Bottleneck layer: Your chosen embedding dimension
Activation functions: ReLU for hidden layers, Sigmoid for output
Loss function: Binary cross-entropy (treating as multi-label classification)

Part D: Training Implementation

Your script must accept these command line arguments:

python train_embeddings.py <input_papers.json> <output_dir> [--epochs 50] [--batch_size 32]

Training requirements:

Data loading: Load abstracts from HW#1 format JSON
Batch processing: Process data in batches for memory efficiency
Training loop:
- Forward pass: input bag-of-words → reconstruction + embedding
- Loss: Binary cross-entropy between input and reconstruction
- Backpropagation and parameter updates
Progress logging: Print loss every epoch
Parameter counting: Verify and print total parameters at startup

Example training output:

Loading abstracts from papers.json...
Found 157 abstracts
Building vocabulary from 23,450 words...
Vocabulary size: 5000 words
Model architecture: 5000 → 256 → 64 → 256 → 5000
Total parameters: 1,598,720 (under 2,000,000 limit)

Training autoencoder...
Epoch 10/50, Loss: 0.2847
Epoch 20/50, Loss: 0.1923
Epoch 30/50, Loss: 0.1654
...
Training complete in 127.3 seconds

Part E: Output Generation

Your script must save the following files to the output directory:

File 1: model.pth - Trained PyTorch model

torch.save({
    'model_state_dict': model.state_dict(),
    'vocab_to_idx': vocab_to_idx,
    'model_config': {
        'vocab_size': vocab_size,
        'hidden_dim': hidden_dim, 
        'embedding_dim': embedding_dim
    }
}, 'model.pth')

File 2: embeddings.json - Generated embeddings for all papers

[
  {
    "arxiv_id": "2301.12345",
    "embedding": [0.123, -0.456, 0.789, ...],  // 64-256 dimensional
    "reconstruction_loss": 0.0234
  },
  ...
]

File 3: vocabulary.json - Vocabulary mapping

{
  "vocab_to_idx": {"word1": 1, "word2": 2, ...},
  "idx_to_vocab": {"1": "word1", "2": "word2", ...},
  "vocab_size": 5000,
  "total_words": 23450
}

File 4: training_log.json - Training metadata

{
  "start_time": "2025-09-16T14:30:00Z",
  "end_time": "2025-09-16T14:32:07Z",
  "epochs": 50,
  "final_loss": 0.1234,
  "total_parameters": 1598720,
  "papers_processed": 157,
  "embedding_dimension": 64
}

Part F: Docker Configuration

Create Dockerfile:

FROM python:3.11-slim

# Install PyTorch (CPU only for smaller image)
RUN pip install torch==2.0.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

WORKDIR /app
COPY train_embeddings.py /app/
COPY requirements.txt /app/
RUN pip install -r requirements.txt

ENTRYPOINT ["python", "/app/train_embeddings.py"]

Create requirements.txt:

# PyTorch installed separately in Dockerfile
# Add any other minimal dependencies here

Part G: Build and Run Scripts

Create build.sh:

#!/bin/bash
echo "Building autoencoder training container..."
docker build -t arxiv-embeddings:latest .
echo "Build complete"

Create run.sh:

#!/bin/bash

if [ $# -lt 2 ]; then
    echo "Usage: $0 <input_papers.json> <output_dir> [epochs] [batch_size]"
    echo "Example: $0 ../problem1/sample_data/papers.json output/ 50 32"
    exit 1
fi

INPUT_FILE="$1"
OUTPUT_DIR="$2"  
EPOCHS="${3:-50}"
BATCH_SIZE="${4:-32}"

# Validate input file exists
if [ ! -f "$INPUT_FILE" ]; then
    echo "Error: Input file $INPUT_FILE not found"
    exit 1
fi

# Create output directory
mkdir -p "$OUTPUT_DIR"

echo "Training embeddings with the following settings:"
echo "  Input: $INPUT_FILE"
echo "  Output: $OUTPUT_DIR" 
echo "  Epochs: $EPOCHS"
echo "  Batch size: $BATCH_SIZE"
echo ""

# Run training container
docker run --rm \
    --name arxiv-embeddings \
    -v "$(realpath $INPUT_FILE)":/data/input/papers.json:ro \
    -v "$(realpath $OUTPUT_DIR)":/data/output \
    arxiv-embeddings:latest \
    /data/input/papers.json /data/output --epochs "$EPOCHS" --batch_size "$BATCH_SIZE"

echo ""
echo "Training complete. Output files:"
ls -la "$OUTPUT_DIR"

Deliverables

Your problem2/ directory must contain:

problem2/
├── train_embeddings.py
├── Dockerfile  
├── requirements.txt
├── build.sh
└── run.sh

Validation

We will test your implementation by:

Running ./build.sh - must complete without errors
Running with sample data: ./run.sh ../problem1/sample_data/papers.json output/
Verifying parameter count is under 2,000,000
Checking all output files are generated with correct formats
Validating embeddings have consistent dimensions
Testing reconstruction loss decreases during training
Verifying model can be loaded and used for inference

Your model must train successfully on the provided sample data within 10 minutes on a standard laptop CPU.

Problem 3: AWS Resource Inspector

Requirements

Use only the following packages:

boto3 (AWS SDK for Python)
Python standard library modules (json, sys, datetime, argparse, os)

Do not use other AWS libraries, CLI wrappers, or third-party AWS tools beyond boto3.

Create a Python script that lists and inspects AWS resources across your account, providing insight into IAM users, EC2 instances, S3 buckets, and security groups.

Part A: Authentication Setup

Your script must support AWS credential authentication through:

AWS CLI credentials (primary method):

aws configure
# OR
aws configure set aws_access_key_id YOUR_KEY
aws configure set aws_secret_access_key YOUR_SECRET  
aws configure set region us-east-1

Environment variables (fallback):

export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1

Your script must verify authentication at startup using sts:GetCallerIdentity.

Part B: Required AWS Permissions

Your script needs these permissions (minimum required):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sts:GetCallerIdentity",
                "iam:ListUsers",
                "iam:GetUser",
                "iam:ListAttachedUserPolicies",
                "ec2:DescribeInstances",
                "ec2:DescribeImages",
                "ec2:DescribeSecurityGroups",
                "s3:ListAllMyBuckets",
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": "*"
        }
    ]
}

Part C: Script Implementation

Create aws_inspector.py with the following command line interface:

python aws_inspector.py [--region REGION] [--output OUTPUT_FILE] [--format json|table]

Arguments:

--region: AWS region to inspect (default: from credentials/config)
--output: Output file path (default: print to stdout)
--format: Output format - ‘json’ or ‘table’ (default: json)

Part D: Resource Collection

Your script must collect information for these resource types:

1. IAM Users

For each user, collect:

{
    "username": "user-name",
    "user_id": "AIDACKEXAMPLE",
    "arn": "arn:aws:iam::123456789012:user/user-name",
    "create_date": "2025-01-15T10:30:00Z",
    "last_activity": "2025-09-10T14:20:00Z",  # PasswordLastUsed if available
    "attached_policies": [
        {
            "policy_name": "PowerUserAccess",
            "policy_arn": "arn:aws:iam::aws:policy/PowerUserAccess"
        }
    ]
}

2. EC2 Instances

For each instance, collect:

{
    "instance_id": "i-1234567890abcdef0",
    "instance_type": "t3.micro",
    "state": "running",
    "public_ip": "54.123.45.67",
    "private_ip": "10.0.1.100", 
    "availability_zone": "us-east-1a",
    "launch_time": "2025-09-15T08:00:00Z",
    "ami_id": "ami-0abcdef1234567890",
    "ami_name": "Amazon Linux 2023 AMI",
    "security_groups": ["sg-12345678", "sg-87654321"],
    "tags": {
        "Name": "my-instance",
        "Environment": "development"
    }
}

3. S3 Buckets

For each bucket, collect:

{
    "bucket_name": "my-example-bucket",
    "creation_date": "2025-08-20T12:00:00Z",
    "region": "us-east-1",
    "object_count": 47,  # Approximate from ListObjects
    "size_bytes": 1024000  # Approximate total size
}

4. Security Groups

For each security group, collect:

{
    "group_id": "sg-12345678",
    "group_name": "default",
    "description": "Default security group",
    "vpc_id": "vpc-12345678",
    "inbound_rules": [
        {
            "protocol": "tcp",
            "port_range": "22-22",
            "source": "0.0.0.0/0"
        }
    ],
    "outbound_rules": [
        {
            "protocol": "all",
            "port_range": "all",
            "destination": "0.0.0.0/0"
        }
    ]
}

Part E: Output Formats

JSON Format (Default)

{
    "account_info": {
        "account_id": "123456789012",
        "user_arn": "arn:aws:iam::123456789012:user/student",
        "region": "us-east-1",
        "scan_timestamp": "2025-09-16T14:30:00Z"
    },
    "resources": {
        "iam_users": [...],
        "ec2_instances": [...],
        "s3_buckets": [...],
        "security_groups": [...]
    },
    "summary": {
        "total_users": 3,
        "running_instances": 2,
        "total_buckets": 5,
        "security_groups": 8
    }
}

Table Format

AWS Account: 123456789012 (us-east-1)
Scan Time: 2025-09-16 14:30:00 UTC

IAM USERS (3 total)
Username            Create Date          Last Activity        Policies
student-user        2025-01-15           2025-09-10           2
admin-user          2025-02-01           2025-09-15           1

EC2 INSTANCES (2 running, 1 stopped)  
Instance ID          Type        State      Public IP        Launch Time
i-1234567890abcdef0  t3.micro    running    54.123.45.67     2025-09-15 08:00
i-0987654321fedcba0  t3.small    stopped    -                2025-09-10 12:30

S3 BUCKETS (5 total)
Bucket Name              Region      Created       Objects    Size (MB)
my-example-bucket        us-east-1   2025-08-20    47         ~1.0
data-backup-bucket       us-west-2   2025-07-15    234        ~15.2

SECURITY GROUPS (8 total)
Group ID         Name           VPC ID          Inbound Rules
sg-12345678      default        vpc-12345678    1
sg-87654321      web-servers    vpc-12345678    2

Part F: Error Handling

Your script must handle these error conditions gracefully:

Authentication failures: Print clear error message and exit
Permission denied: Skip resource type, log warning, continue
Network timeouts: Retry once, then skip resource
Invalid regions: Validate region exists before proceeding
Empty resources: Handle accounts with no resources of a type

Example error output:

[WARNING] Access denied for IAM operations - skipping user enumeration
[WARNING] No EC2 instances found in us-east-1
[ERROR] Failed to access S3 bucket 'private-bucket': Access Denied

Part G: Testing Script

Create test.sh:

#!/bin/bash

echo "Testing AWS Inspector Script"
echo "============================"

# Test 1: Verify authentication
echo "Test 1: Authentication check"
python aws_inspector.py --region us-east-1 --format json > /dev/null
if [ $? -eq 0 ]; then
    echo "[PASS] Authentication successful"
else
    echo "[FAIL] Authentication failed"
    exit 1
fi

# Test 2: JSON output format
echo "Test 2: JSON output format"
python aws_inspector.py --region us-east-1 --format json --output test_output.json
if [ -f "test_output.json" ]; then
    python -m json.tool test_output.json > /dev/null
    if [ $? -eq 0 ]; then
        echo "[PASS] Valid JSON output generated"
    else
        echo "[FAIL] Invalid JSON output"
    fi
    rm test_output.json
else
    echo "[FAIL] Output file not created"
fi

# Test 3: Table output format
echo "Test 3: Table output format"
python aws_inspector.py --region us-east-1 --format table | head -10
echo "[PASS] Table format displayed"

# Test 4: Invalid region handling
echo "Test 4: Invalid region handling"
python aws_inspector.py --region invalid-region 2>/dev/null
if [ $? -ne 0 ]; then
    echo "[PASS] Invalid region properly rejected"
else
    echo "[FAIL] Invalid region accepted"
fi

echo ""
echo "Testing complete. Review output above for any failures."

Deliverables

Your problem3/ directory must contain:

problem3/
├── aws_inspector.py
├── requirements.txt
└── test.sh

Create requirements.txt:

boto3>=1.26.0

Validation

We will test your implementation by:

Running with valid AWS credentials in multiple regions
Testing both JSON and table output formats
Verifying all resource types are collected with correct fields
Testing error handling with restricted permissions
Checking output file generation and format validity
Testing with accounts containing no resources
Validating authentication error handling

Your script must complete scanning within 60 seconds for accounts with moderate resource counts (< 50 resources total) and handle rate limiting gracefully.

Submission Requirements

Your GitHub repository must follow this exact structure:

ee547-hw2-[username]/
├── problem1/
│   ├── arxiv_server.py
│   ├── Dockerfile
│   ├── build.sh
│   ├── run.sh
│   ├── test.sh
│   └── sample_data/
│       └── papers.json
├── problem2/
│   ├── train_embeddings.py
│   ├── Dockerfile
│   ├── build.sh
│   ├── run.sh
│   └── requirements.txt
├── problem3/
│   ├── aws_inspector.py
│   ├── requirements.txt
│   └── test.sh
└── README.md

The README.md in your repository root must contain:

Your full name and USC email address
Instructions to run each problem if they differ from specifications
Any assumptions made about input data formats
Brief description of your embedding architecture (Problem 2)

Testing Your Submission

Before submitting, ensure:

All Docker builds complete without errors
HTTP servers respond to all required endpoints
AWS script runs without authentication errors
JSON output matches specified formats exactly
All shell scripts are executable (chmod +x *.sh)