Homework #2: HTTP Services, ML Embeddings, and AWS Fundamentals
EE 547: Fall 2025
Assigned: 16 September
Due: Monday, 29 September at 23:59
Submission: Gradescope via GitHub repository
- Docker Desktop must be installed and running on your machine
- AWS CLI configured with valid credentials for Problems 3-4
- PyTorch installed for Problem 2 (can be inside container)
- Use data files from HW#1 Problem 2 output (or provided sample data)
Overview
This assignment introduces HTTP services, machine learning embeddings, and AWS fundamentals. You will build API servers, train embedding models, interact with AWS services, and integrate ML inference into web services.
Problem 1: HTTP API Server for ArXiv Papers
Use only Python standard library modules:
http.server(standard library)urllib.parse(standard library)json(standard library)re(standard library)sys,os,datetime(standard library)
Do not use flask, django, fastapi, requests, or any other web frameworks or external HTTP libraries.
Build a containerized HTTP server that serves ArXiv paper metadata from your HW#1 Problem 2 output.
Part A: Data Source
Your server must load ArXiv paper data from HW#1 Problem 2 output files:
papers.json- Array of paper metadata with abstracts and statisticscorpus_analysis.json- Global corpus analysis with word frequencies
Part B: HTTP Server Implementation
Create arxiv_server.py that implements a basic HTTP server with the following endpoints:
Required Endpoints:
GET /papers- Return list of all papers[ { "arxiv_id": "2301.12345", "title": "Paper Title", "authors": ["Author One", "Author Two"], "categories": ["cs.LG", "cs.AI"] }, ... ]GET /papers/{arxiv_id}- Return full paper details{ "arxiv_id": "2301.12345", "title": "Paper Title", "authors": ["Author One", "Author Two"], "abstract": "Full abstract text...", "categories": ["cs.LG", "cs.AI"], "published": "2023-01-15T10:30:00Z", "abstract_stats": { "total_words": 150, "unique_words": 85, "total_sentences": 8 } }GET /search?q={query}- Search papers by title and abstract{ "query": "machine learning", "results": [ { "arxiv_id": "2301.12345", "title": "Paper Title", "match_score": 3, "matches_in": ["title", "abstract"] } ] }GET /stats- Return corpus statistics{ "total_papers": 20, "total_words": 15000, "unique_words": 2500, "top_10_words": [ {"word": "model", "frequency": 145}, {"word": "data", "frequency": 132} ], "category_distribution": { "cs.LG": 12, "cs.AI": 8 } }
Error Handling:
- Return HTTP 404 for unknown paper IDs or invalid endpoints
- Return HTTP 400 for malformed search queries
- Return HTTP 500 for server errors with JSON error message
Part C: Implementation Requirements
Your server must:
Command Line Arguments: Accept port number as argument (default 8080)
python arxiv_server.py [port]Data Loading: Load JSON data at startup, handle missing files gracefully
Search Implementation:
- Case-insensitive search in titles and abstracts
- Count term frequency as match score
- Support multi-word queries (search for all terms)
Logging: Print requests to stdout in format:
[2025-09-16 14:30:22] GET /papers - 200 OK (15 results) [2025-09-16 14:30:25] GET /papers/invalid-id - 404 Not Found
Part D: Dockerfile
Create a Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY arxiv_server.py /app/
COPY sample_data/ /app/sample_data/
EXPOSE 8080
ENTRYPOINT ["python", "/app/arxiv_server.py"]
CMD ["8080"]Part E: Build and Run Scripts
Create build.sh:
#!/bin/bash
docker build -t arxiv-server:latest .Create run.sh:
#!/bin/bash
# Check for port argument
PORT=${1:-8080}
# Validate port is numeric
if ! [[ "$PORT" =~ ^[0-9]+$ ]]; then
echo "Error: Port must be numeric"
exit 1
fi
# Check port range
if [ "$PORT" -lt 1024 ] || [ "$PORT" -gt 65535 ]; then
echo "Error: Port must be between 1024 and 65535"
exit 1
fi
echo "Starting ArXiv API server on port $PORT"
echo "Access at: http://localhost:$PORT"
echo ""
echo "Available endpoints:"
echo " GET /papers"
echo " GET /papers/{arxiv_id}"
echo " GET /search?q={query}"
echo " GET /stats"
echo ""
# Run container
docker run --rm \
--name arxiv-server \
-p "$PORT:8080" \
arxiv-server:latestPart F: Testing
Create test.sh:
#!/bin/bash
# Start server in background
./run.sh 8081 &
SERVER_PID=$!
# Wait for startup
echo "Waiting for server startup..."
sleep 3
# Test endpoints
echo "Testing /papers endpoint..."
curl -s http://localhost:8081/papers | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
echo "[PASS] /papers endpoint working"
else
echo "[FAIL] /papers endpoint failed"
fi
echo "Testing /stats endpoint..."
curl -s http://localhost:8081/stats | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
echo "[PASS] /stats endpoint working"
else
echo "[FAIL] /stats endpoint failed"
fi
echo "Testing search endpoint..."
curl -s "http://localhost:8081/search?q=machine" | python -m json.tool > /dev/null
if [ $? -eq 0 ]; then
echo "[PASS] /search endpoint working"
else
echo "[FAIL] /search endpoint failed"
fi
echo "Testing 404 handling..."
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8081/invalid)
if [ "$RESPONSE" = "404" ]; then
echo "[PASS] 404 handling working"
else
echo "[FAIL] 404 handling failed (got $RESPONSE)"
fi
# Cleanup
kill $SERVER_PID 2>/dev/null
echo "Tests complete"Deliverables
Your problem1/ directory must contain:
problem1/
├── arxiv_server.py
├── Dockerfile
├── build.sh
├── run.sh
├── test.sh
└── sample_data/
└── papers.json
All shell scripts must be executable (chmod +x *.sh).
Validation
We will test your solution by:
- Running
./build.sh- must complete without errors - Running
./run.sh 9000- server must start on port 9000 - Testing all four endpoints with various queries
- Verifying JSON response structure matches specification
- Testing error handling for invalid requests
- Running concurrent requests to test stability
Your server must handle at least 10 concurrent requests without errors and respond to all endpoints within 2 seconds under normal load.
Problem 2: Text Embedding Training with Autoencoders
Use only the following packages:
- PyTorch (torch, torch.nn, torch.optim)
- Python standard library modules (json, sys, os, re, datetime, collections)
- Basic text processing: You may implement your own tokenization or use simple word splitting
Do not use transformers, sentence-transformers, scikit-learn, numpy (PyTorch tensors only), or pre-trained embedding models.
Train a text autoencoder to generate embeddings for ArXiv paper abstracts. There is a strict parameter limit to encourage efficient architectures.
Part A: Parameter Limit Calculation
Your encoder must have no more than 2,000,000 total parameters (weights and biases combined).
Example Calculation for Planning:
Assumptions for parameter budget:
- Vocabulary size: ~10,000 words (typical for technical abstracts)
- Suggested embedding dimension: 64-256 (your choice)
- Architecture: Input → Hidden → Bottleneck → Hidden → Output
Example architecture (512 → 128 → 512):
- Input layer: 10,000 × 512 + 512 bias = 5,120,512 parameters
- Encoder: 512 × 128 + 128 bias = 65,664 parameters
- Decoder: 128 × 512 + 512 bias = 66,048 parameters
- Output: 512 × 10,000 + 10,000 bias = 5,130,000 parameters
Total: ~10.4M parameters (TOO LARGE)
Better architecture (256 → 64 → 256):
- Input layer: 5,000 × 256 + 256 bias = 1,280,256 parameters
- Encoder: 256 × 64 + 64 bias = 16,448 parameters
- Decoder: 64 × 256 + 256 bias = 16,640 parameters
- Output: 256 × 5,000 + 5,000 bias = 1,285,000 parameters
Total: ~1.6M parameters (WITHIN LIMIT)
Design Constraints:
- Smaller vocabulary (limit to top-K most frequent words)
- Smaller hidden layers
- Efficient embedding dimension (64-256 range suggested)
Your script must print the total parameter count and verify it’s under the limit.
Part B: Data Preprocessing
Create train_embeddings.py that loads ArXiv abstracts from HW#1 Problem 2 output.
Required preprocessing steps:
Text cleaning:
def clean_text(text): # Convert to lowercase # Remove non-alphabetic characters except spaces # Split into words # Remove very short words (< 2 characters) return wordsVocabulary building:
- Extract all unique words from abstracts
- Keep only the top 5,000 most frequent words (parameter budget constraint)
- Create word-to-index mapping
- Reserve index 0 for unknown words
Sequence encoding:
- Convert abstracts to sequences of word indices
- Pad or truncate to fixed length (e.g., 100-200 words)
- Create bag-of-words representation for autoencoder input/output
Part C: Autoencoder Architecture
Design a simple autoencoder. You may follow this vanilla pattern:
class TextAutoencoder(nn.Module):
def __init__(self, vocab_size, hidden_dim, embedding_dim):
super().__init__()
# Encoder: vocab_size → hidden_dim → embedding_dim
self.encoder = nn.Sequential(
nn.Linear(vocab_size, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, embedding_dim)
)
# Decoder: embedding_dim → hidden_dim → vocab_size
self.decoder = nn.Sequential(
nn.Linear(embedding_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, vocab_size),
nn.Sigmoid() # Output probabilities
)
def forward(self, x):
# Encode to bottleneck
embedding = self.encoder(x)
# Decode back to vocabulary space
reconstruction = self.decoder(embedding)
return reconstruction, embeddingArchitecture Requirements:
- Input/output: Bag-of-words vectors (size = vocabulary size)
- Bottleneck layer: Your chosen embedding dimension
- Activation functions: ReLU for hidden layers, Sigmoid for output
- Loss function: Binary cross-entropy (treating as multi-label classification)
Part D: Training Implementation
Your script must accept these command line arguments:
python train_embeddings.py <input_papers.json> <output_dir> [--epochs 50] [--batch_size 32]Training requirements:
- Data loading: Load abstracts from HW#1 format JSON
- Batch processing: Process data in batches for memory efficiency
- Training loop:
- Forward pass: input bag-of-words → reconstruction + embedding
- Loss: Binary cross-entropy between input and reconstruction
- Backpropagation and parameter updates
- Progress logging: Print loss every epoch
- Parameter counting: Verify and print total parameters at startup
Example training output:
Loading abstracts from papers.json...
Found 157 abstracts
Building vocabulary from 23,450 words...
Vocabulary size: 5000 words
Model architecture: 5000 → 256 → 64 → 256 → 5000
Total parameters: 1,598,720 (under 2,000,000 limit)
Training autoencoder...
Epoch 10/50, Loss: 0.2847
Epoch 20/50, Loss: 0.1923
Epoch 30/50, Loss: 0.1654
...
Training complete in 127.3 seconds
Part E: Output Generation
Your script must save the following files to the output directory:
File 1: model.pth - Trained PyTorch model
torch.save({
'model_state_dict': model.state_dict(),
'vocab_to_idx': vocab_to_idx,
'model_config': {
'vocab_size': vocab_size,
'hidden_dim': hidden_dim,
'embedding_dim': embedding_dim
}
}, 'model.pth')File 2: embeddings.json - Generated embeddings for all papers
[
{
"arxiv_id": "2301.12345",
"embedding": [0.123, -0.456, 0.789, ...], // 64-256 dimensional
"reconstruction_loss": 0.0234
},
...
]File 3: vocabulary.json - Vocabulary mapping
{
"vocab_to_idx": {"word1": 1, "word2": 2, ...},
"idx_to_vocab": {"1": "word1", "2": "word2", ...},
"vocab_size": 5000,
"total_words": 23450
}File 4: training_log.json - Training metadata
{
"start_time": "2025-09-16T14:30:00Z",
"end_time": "2025-09-16T14:32:07Z",
"epochs": 50,
"final_loss": 0.1234,
"total_parameters": 1598720,
"papers_processed": 157,
"embedding_dimension": 64
}Part F: Docker Configuration
Create Dockerfile:
FROM python:3.11-slim
# Install PyTorch (CPU only for smaller image)
RUN pip install torch==2.0.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
WORKDIR /app
COPY train_embeddings.py /app/
COPY requirements.txt /app/
RUN pip install -r requirements.txt
ENTRYPOINT ["python", "/app/train_embeddings.py"]Create requirements.txt:
# PyTorch installed separately in Dockerfile
# Add any other minimal dependencies here
Part G: Build and Run Scripts
Create build.sh:
#!/bin/bash
echo "Building autoencoder training container..."
docker build -t arxiv-embeddings:latest .
echo "Build complete"Create run.sh:
#!/bin/bash
if [ $# -lt 2 ]; then
echo "Usage: $0 <input_papers.json> <output_dir> [epochs] [batch_size]"
echo "Example: $0 ../problem1/sample_data/papers.json output/ 50 32"
exit 1
fi
INPUT_FILE="$1"
OUTPUT_DIR="$2"
EPOCHS="${3:-50}"
BATCH_SIZE="${4:-32}"
# Validate input file exists
if [ ! -f "$INPUT_FILE" ]; then
echo "Error: Input file $INPUT_FILE not found"
exit 1
fi
# Create output directory
mkdir -p "$OUTPUT_DIR"
echo "Training embeddings with the following settings:"
echo " Input: $INPUT_FILE"
echo " Output: $OUTPUT_DIR"
echo " Epochs: $EPOCHS"
echo " Batch size: $BATCH_SIZE"
echo ""
# Run training container
docker run --rm \
--name arxiv-embeddings \
-v "$(realpath $INPUT_FILE)":/data/input/papers.json:ro \
-v "$(realpath $OUTPUT_DIR)":/data/output \
arxiv-embeddings:latest \
/data/input/papers.json /data/output --epochs "$EPOCHS" --batch_size "$BATCH_SIZE"
echo ""
echo "Training complete. Output files:"
ls -la "$OUTPUT_DIR"Deliverables
Your problem2/ directory must contain:
problem2/
├── train_embeddings.py
├── Dockerfile
├── requirements.txt
├── build.sh
└── run.sh
Validation
We will test your implementation by:
- Running
./build.sh- must complete without errors - Running with sample data:
./run.sh ../problem1/sample_data/papers.json output/ - Verifying parameter count is under 2,000,000
- Checking all output files are generated with correct formats
- Validating embeddings have consistent dimensions
- Testing reconstruction loss decreases during training
- Verifying model can be loaded and used for inference
Your model must train successfully on the provided sample data within 10 minutes on a standard laptop CPU.
Problem 3: AWS Resource Inspector
Use only the following packages:
- boto3 (AWS SDK for Python)
- Python standard library modules (json, sys, datetime, argparse, os)
Do not use other AWS libraries, CLI wrappers, or third-party AWS tools beyond boto3.
Create a Python script that lists and inspects AWS resources across your account, providing insight into IAM users, EC2 instances, S3 buckets, and security groups.
Part A: Authentication Setup
Your script must support AWS credential authentication through:
AWS CLI credentials (primary method):
aws configure # OR aws configure set aws_access_key_id YOUR_KEY aws configure set aws_secret_access_key YOUR_SECRET aws configure set region us-east-1Environment variables (fallback):
export AWS_ACCESS_KEY_ID=your_key export AWS_SECRET_ACCESS_KEY=your_secret export AWS_DEFAULT_REGION=us-east-1
Your script must verify authentication at startup using sts:GetCallerIdentity.
Part B: Required AWS Permissions
Your script needs these permissions (minimum required):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sts:GetCallerIdentity",
"iam:ListUsers",
"iam:GetUser",
"iam:ListAttachedUserPolicies",
"ec2:DescribeInstances",
"ec2:DescribeImages",
"ec2:DescribeSecurityGroups",
"s3:ListAllMyBuckets",
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": "*"
}
]
}Part C: Script Implementation
Create aws_inspector.py with the following command line interface:
python aws_inspector.py [--region REGION] [--output OUTPUT_FILE] [--format json|table]Arguments:
--region: AWS region to inspect (default: from credentials/config)--output: Output file path (default: print to stdout)
--format: Output format - ‘json’ or ‘table’ (default: json)
Part D: Resource Collection
Your script must collect information for these resource types:
1. IAM Users
For each user, collect:
{
"username": "user-name",
"user_id": "AIDACKEXAMPLE",
"arn": "arn:aws:iam::123456789012:user/user-name",
"create_date": "2025-01-15T10:30:00Z",
"last_activity": "2025-09-10T14:20:00Z", # PasswordLastUsed if available
"attached_policies": [
{
"policy_name": "PowerUserAccess",
"policy_arn": "arn:aws:iam::aws:policy/PowerUserAccess"
}
]
}2. EC2 Instances
For each instance, collect:
{
"instance_id": "i-1234567890abcdef0",
"instance_type": "t3.micro",
"state": "running",
"public_ip": "54.123.45.67",
"private_ip": "10.0.1.100",
"availability_zone": "us-east-1a",
"launch_time": "2025-09-15T08:00:00Z",
"ami_id": "ami-0abcdef1234567890",
"ami_name": "Amazon Linux 2023 AMI",
"security_groups": ["sg-12345678", "sg-87654321"],
"tags": {
"Name": "my-instance",
"Environment": "development"
}
}3. S3 Buckets
For each bucket, collect:
{
"bucket_name": "my-example-bucket",
"creation_date": "2025-08-20T12:00:00Z",
"region": "us-east-1",
"object_count": 47, # Approximate from ListObjects
"size_bytes": 1024000 # Approximate total size
}4. Security Groups
For each security group, collect:
{
"group_id": "sg-12345678",
"group_name": "default",
"description": "Default security group",
"vpc_id": "vpc-12345678",
"inbound_rules": [
{
"protocol": "tcp",
"port_range": "22-22",
"source": "0.0.0.0/0"
}
],
"outbound_rules": [
{
"protocol": "all",
"port_range": "all",
"destination": "0.0.0.0/0"
}
]
}Part E: Output Formats
JSON Format (Default)
{
"account_info": {
"account_id": "123456789012",
"user_arn": "arn:aws:iam::123456789012:user/student",
"region": "us-east-1",
"scan_timestamp": "2025-09-16T14:30:00Z"
},
"resources": {
"iam_users": [...],
"ec2_instances": [...],
"s3_buckets": [...],
"security_groups": [...]
},
"summary": {
"total_users": 3,
"running_instances": 2,
"total_buckets": 5,
"security_groups": 8
}
}Table Format
AWS Account: 123456789012 (us-east-1)
Scan Time: 2025-09-16 14:30:00 UTC
IAM USERS (3 total)
Username Create Date Last Activity Policies
student-user 2025-01-15 2025-09-10 2
admin-user 2025-02-01 2025-09-15 1
EC2 INSTANCES (2 running, 1 stopped)
Instance ID Type State Public IP Launch Time
i-1234567890abcdef0 t3.micro running 54.123.45.67 2025-09-15 08:00
i-0987654321fedcba0 t3.small stopped - 2025-09-10 12:30
S3 BUCKETS (5 total)
Bucket Name Region Created Objects Size (MB)
my-example-bucket us-east-1 2025-08-20 47 ~1.0
data-backup-bucket us-west-2 2025-07-15 234 ~15.2
SECURITY GROUPS (8 total)
Group ID Name VPC ID Inbound Rules
sg-12345678 default vpc-12345678 1
sg-87654321 web-servers vpc-12345678 2
Part F: Error Handling
Your script must handle these error conditions gracefully:
- Authentication failures: Print clear error message and exit
- Permission denied: Skip resource type, log warning, continue
- Network timeouts: Retry once, then skip resource
- Invalid regions: Validate region exists before proceeding
- Empty resources: Handle accounts with no resources of a type
Example error output:
[WARNING] Access denied for IAM operations - skipping user enumeration
[WARNING] No EC2 instances found in us-east-1
[ERROR] Failed to access S3 bucket 'private-bucket': Access Denied
Part G: Testing Script
Create test.sh:
#!/bin/bash
echo "Testing AWS Inspector Script"
echo "============================"
# Test 1: Verify authentication
echo "Test 1: Authentication check"
python aws_inspector.py --region us-east-1 --format json > /dev/null
if [ $? -eq 0 ]; then
echo "[PASS] Authentication successful"
else
echo "[FAIL] Authentication failed"
exit 1
fi
# Test 2: JSON output format
echo "Test 2: JSON output format"
python aws_inspector.py --region us-east-1 --format json --output test_output.json
if [ -f "test_output.json" ]; then
python -m json.tool test_output.json > /dev/null
if [ $? -eq 0 ]; then
echo "[PASS] Valid JSON output generated"
else
echo "[FAIL] Invalid JSON output"
fi
rm test_output.json
else
echo "[FAIL] Output file not created"
fi
# Test 3: Table output format
echo "Test 3: Table output format"
python aws_inspector.py --region us-east-1 --format table | head -10
echo "[PASS] Table format displayed"
# Test 4: Invalid region handling
echo "Test 4: Invalid region handling"
python aws_inspector.py --region invalid-region 2>/dev/null
if [ $? -ne 0 ]; then
echo "[PASS] Invalid region properly rejected"
else
echo "[FAIL] Invalid region accepted"
fi
echo ""
echo "Testing complete. Review output above for any failures."Deliverables
Your problem3/ directory must contain:
problem3/
├── aws_inspector.py
├── requirements.txt
└── test.sh
Create requirements.txt:
boto3>=1.26.0
Validation
We will test your implementation by:
- Running with valid AWS credentials in multiple regions
- Testing both JSON and table output formats
- Verifying all resource types are collected with correct fields
- Testing error handling with restricted permissions
- Checking output file generation and format validity
- Testing with accounts containing no resources
- Validating authentication error handling
Your script must complete scanning within 60 seconds for accounts with moderate resource counts (< 50 resources total) and handle rate limiting gracefully.
Submission Requirements
Your GitHub repository must follow this exact structure:
ee547-hw2-[username]/
├── problem1/
│ ├── arxiv_server.py
│ ├── Dockerfile
│ ├── build.sh
│ ├── run.sh
│ ├── test.sh
│ └── sample_data/
│ └── papers.json
├── problem2/
│ ├── train_embeddings.py
│ ├── Dockerfile
│ ├── build.sh
│ ├── run.sh
│ └── requirements.txt
├── problem3/
│ ├── aws_inspector.py
│ ├── requirements.txt
│ └── test.sh
└── README.md
The README.md in your repository root must contain:
- Your full name and USC email address
- Instructions to run each problem if they differ from specifications
- Any assumptions made about input data formats
- Brief description of your embedding architecture (Problem 2)
Before submitting, ensure:
- All Docker builds complete without errors
- HTTP servers respond to all required endpoints
- AWS script runs without authentication errors
- JSON output matches specified formats exactly
- All shell scripts are executable (
chmod +x *.sh)