Homework #3: HTTP Services, ML Embeddings, and AWS Fundamentals
EE 547: Spring 2026
Assigned: 11 February
Due: Tuesday, 24 February at 23:59
Gradescope: Homework 3 | How to Submit
- Docker Desktop must be installed and running on your machine
- Python 3.11+ required
- Use only Python standard library modules unless explicitly permitted
Overview
This assignment introduces HTTP services, machine learning embeddings, and AWS fundamentals. You will build API servers, train embedding models, and interact with AWS services.
Getting Started
Download the starter code: hw3-starter.zip
unzip hw3-starter.zip
cd hw3-starterProblem 1: HTTP API Server for ArXiv Papers
Build a containerized HTTP server that serves ArXiv paper metadata from your HW#2 Problem 2 output.
Use only Python standard library modules: http.server, urllib.parse, json, re, sys, os, datetime. Do not use flask, django, fastapi, requests, or any other web frameworks or external HTTP libraries.
Part A: Data Source
Your server must load ArXiv paper data from HW#2 Problem 2 output files:
papers.json- Array of paper metadata with abstracts and statisticscorpus_analysis.json- Global corpus analysis with word frequencies
Part B: HTTP Server Implementation
Create arxiv_server.py that implements a basic HTTP server with the following endpoints:
Required Endpoints:
GET /papers- Return list of all papers[ { "arxiv_id": "2301.12345", "title": "Paper Title", "authors": ["Author One", "Author Two"], "categories": ["cs.LG", "cs.AI"] }, ... ]GET /papers/{arxiv_id}- Return full paper details{ "arxiv_id": "2301.12345", "title": "Paper Title", "authors": ["Author One", "Author Two"], "abstract": "Full abstract text...", "categories": ["cs.LG", "cs.AI"], "published": "2023-01-15T10:30:00Z", "abstract_stats": { "total_words": 150, "unique_words": 85, "total_sentences": 8 } }GET /search?q={query}- Search papers by title and abstract{ "query": "machine learning", "results": [ { "arxiv_id": "2301.12345", "title": "Paper Title", "match_score": 3, "matches_in": ["title", "abstract"] } ] }GET /stats- Return corpus statistics{ "total_papers": 20, "total_words": 15000, "unique_words": 2500, "top_10_words": [ {"word": "model", "frequency": 145}, {"word": "data", "frequency": 132} ], "category_distribution": { "cs.LG": 12, "cs.AI": 8 } }
Error Handling:
- Return HTTP 404 for unknown paper IDs or invalid endpoints
- Return HTTP 400 for malformed search queries
- Return HTTP 500 for server errors with JSON error message
Part C: Implementation Requirements
Your server must:
Command Line Arguments: Accept port number as argument (default 8080)
python arxiv_server.py [port]Data Loading: Load JSON data at startup, handle missing files gracefully
Search Implementation:
- Case-insensitive search in titles and abstracts
- Count term frequency as match score
- Support multi-word queries (search for all terms)
Logging: Print requests to stdout in format:
[2026-02-16 14:30:22] GET /papers - 200 OK (15 results) [2026-02-16 14:30:25] GET /papers/invalid-id - 404 Not Found
Part D: Dockerfile
A Dockerfile is provided in the starter code:
Code: Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY arxiv_server.py /app/
COPY sample_data/ /app/sample_data/
EXPOSE 8080
ENTRYPOINT ["python", "/app/arxiv_server.py"]
CMD ["8080"]
Part E: Building and Running
Build your container image:
docker build -t arxiv-server:latest .
Run your container:
docker run --rm \
-p 8080:8080 \
arxiv-server:latest
The server will be available at http://localhost:8080. To use a different host port:
docker run --rm \
-p 9000:8080 \
arxiv-server:latest
Part F: Testing
With the server running, test your endpoints using curl:
curl -s http://localhost:8080/papers | python -m json.tool
curl -s http://localhost:8080/stats | python -m json.tool
curl -s "http://localhost:8080/search?q=machine" | python -m json.tool
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/invalid
Deliverables
See Submission.
We will validate your submission by running the following commands from your q1/ directory:
docker build -t arxiv-server:latest .
docker run --rm \
-p 8080:8080 \
arxiv-server:latest
These commands must complete without errors. We will then verify:
- All four endpoints respond with correct JSON structure
- Error handling for invalid requests (404, 400)
- Server handles at least 10 concurrent requests without errors
- All endpoints respond within 2 seconds under normal load
Problem 2: Text Embedding Training with Autoencoders
Train a text autoencoder to generate embeddings for ArXiv paper abstracts. There is a strict parameter limit to encourage efficient architectures.
This problem requires PyTorch (torch, torch.nn, torch.optim) and Python standard library modules (json, sys, os, re, datetime, collections). You may implement your own tokenization or use simple word splitting. Do not use transformers, sentence-transformers, scikit-learn, numpy (PyTorch tensors only), or pre-trained embedding models.
Part A: Parameter Limit Calculation
Your encoder must have no more than 2,000,000 total parameters (weights and biases combined).
Example Calculation for Planning:
Assumptions for parameter budget:
- Vocabulary size: ~10,000 words (typical for technical abstracts)
- Suggested embedding dimension: 64-256 (your choice)
- Architecture: Input → Hidden → Bottleneck → Hidden → Output
Example architecture (512 → 128 → 512):
- Input layer: 10,000 × 512 + 512 bias = 5,120,512 parameters
- Encoder: 512 × 128 + 128 bias = 65,664 parameters
- Decoder: 128 × 512 + 512 bias = 66,048 parameters
- Output: 512 × 10,000 + 10,000 bias = 5,130,000 parameters
Total: ~10.4M parameters (TOO LARGE)
Better architecture (256 → 64 → 256):
- Input layer: 5,000 × 256 + 256 bias = 1,280,256 parameters
- Encoder: 256 × 64 + 64 bias = 16,448 parameters
- Decoder: 64 × 256 + 256 bias = 16,640 parameters
- Output: 256 × 5,000 + 5,000 bias = 1,285,000 parameters
Total: ~1.6M parameters (WITHIN LIMIT)
Design Constraints:
- Smaller vocabulary (limit to top-K most frequent words)
- Smaller hidden layers
- Efficient embedding dimension (64-256 range suggested)
Your script must print the total parameter count and verify it’s under the limit.
Part B: Data Preprocessing
Create train_embeddings.py that loads ArXiv abstracts from HW#2 Problem 2 output.
Required preprocessing steps:
Text cleaning:
def clean_text(text): # Convert to lowercase # Remove non-alphabetic characters except spaces # Split into words # Remove very short words (< 2 characters) return wordsVocabulary building:
- Extract all unique words from abstracts
- Keep only the top 5,000 most frequent words (parameter budget constraint)
- Create word-to-index mapping
- Reserve index 0 for unknown words
Sequence encoding:
- Convert abstracts to sequences of word indices
- Pad or truncate to fixed length (e.g., 100-200 words)
- Create bag-of-words representation for autoencoder input/output
Part C: Autoencoder Architecture
Design a simple autoencoder. You may follow this vanilla pattern:
class TextAutoencoder(nn.Module):
def __init__(self, vocab_size, hidden_dim, embedding_dim):
super().__init__()
# Encoder: vocab_size → hidden_dim → embedding_dim
self.encoder = nn.Sequential(
nn.Linear(vocab_size, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, embedding_dim)
)
# Decoder: embedding_dim → hidden_dim → vocab_size
self.decoder = nn.Sequential(
nn.Linear(embedding_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, vocab_size),
nn.Sigmoid() # Output probabilities
)
def forward(self, x):
# Encode to bottleneck
embedding = self.encoder(x)
# Decode back to vocabulary space
reconstruction = self.decoder(embedding)
return reconstruction, embeddingArchitecture Requirements:
- Input/output: Bag-of-words vectors (size = vocabulary size)
- Bottleneck layer: Your chosen embedding dimension
- Activation functions: ReLU for hidden layers, Sigmoid for output
- Loss function: Binary cross-entropy (treating as multi-label classification)
Part D: Training Implementation
Your script must accept these command line arguments:
python train_embeddings.py <input_papers.json> <output_dir> [--epochs 50] [--batch_size 32]Training requirements:
- Data loading: Load abstracts from HW#2 format JSON
- Batch processing: Process data in batches for memory efficiency
- Training loop:
- Forward pass: input bag-of-words → reconstruction + embedding
- Loss: Binary cross-entropy between input and reconstruction
- Backpropagation and parameter updates
- Progress logging: Print loss every epoch
- Parameter counting: Verify and print total parameters at startup
Example training output:
Loading abstracts from papers.json...
Found 157 abstracts
Building vocabulary from 23,450 words...
Vocabulary size: 5000 words
Model architecture: 5000 → 256 → 64 → 256 → 5000
Total parameters: 1,598,720 (under 2,000,000 limit)
Training autoencoder...
Epoch 10/50, Loss: 0.2847
Epoch 20/50, Loss: 0.1923
Epoch 30/50, Loss: 0.1654
...
Training complete in 127.3 seconds
Part E: Output Generation
Your script must save the following files to the output directory:
File 1: model.pth - Trained PyTorch model
torch.save({
'model_state_dict': model.state_dict(),
'vocab_to_idx': vocab_to_idx,
'model_config': {
'vocab_size': vocab_size,
'hidden_dim': hidden_dim,
'embedding_dim': embedding_dim
}
}, 'model.pth')File 2: embeddings.json - Generated embeddings for all papers
[
{
"arxiv_id": "2301.12345",
"embedding": [0.123, -0.456, 0.789, ...],
"reconstruction_loss": 0.0234
},
...
]File 3: vocabulary.json - Vocabulary mapping
{
"vocab_to_idx": {"word1": 1, "word2": 2, ...},
"idx_to_vocab": {"1": "word1", "2": "word2", ...},
"vocab_size": 5000,
"total_words": 23450
}File 4: training_log.json - Training metadata
{
"start_time": "2026-02-16T14:30:00Z",
"end_time": "2026-02-16T14:32:07Z",
"epochs": 50,
"final_loss": 0.1234,
"total_parameters": 1598720,
"papers_processed": 157,
"embedding_dimension": 64
}Part F: Docker Configuration
A Dockerfile and requirements.txt are provided in the starter code:
Code: Dockerfile
FROM python:3.11-slim
# Install PyTorch (CPU only for smaller image)
RUN pip install torch==2.0.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
WORKDIR /app
COPY train_embeddings.py /app/
COPY requirements.txt /app/
RUN pip install -r requirements.txt
ENTRYPOINT ["python", "/app/train_embeddings.py"]
Code: requirements.txt
# PyTorch installed separately in Dockerfile
# Add any other minimal dependencies here
Part G: Building and Running
Build your container image:
docker build -t arxiv-embeddings:latest .
Run your container:
docker run --rm \
-v "$(pwd)/papers.json":/data/input/papers.json:ro \
-v "$(pwd)/output":/data/output \
arxiv-embeddings:latest \
/data/input/papers.json /data/output --epochs 50 --batch_size 32
The volume mounts connect your host filesystem to the container:
-v "$(pwd)/papers.json":/data/input/papers.json:romounts your input file read-only-v "$(pwd)/output":/data/outputmounts the output directory for results
On Windows, replace $(pwd) with the full path or %cd% (cmd) / ${PWD} (PowerShell).
Deliverables
See Submission.
We will validate your submission by running the following commands from your q2/ directory:
docker build -t arxiv-embeddings:latest .
docker run --rm \
-v "$(pwd)/papers.json":/data/input/papers.json:ro \
-v "$(pwd)/output":/data/output \
arxiv-embeddings:latest \
/data/input/papers.json /data/output --epochs 50 --batch_size 32
These commands must complete without errors. We will then verify:
- Parameter count is under 2,000,000
- All output files are generated with correct formats
- Embeddings have consistent dimensions
- Reconstruction loss decreases during training
- Model can be loaded and used for inference
- Training completes within 10 minutes on a standard laptop CPU
Problem 3: AWS Resource Inspector
Create a Python script that lists and inspects AWS resources across your account, providing insight into IAM users, EC2 instances, S3 buckets, and security groups.
This problem requires boto3 (AWS SDK for Python) and Python standard library modules (json, sys, datetime, argparse, os). Do not use other AWS libraries, CLI wrappers, or third-party AWS tools beyond boto3. AWS CLI must be configured with valid credentials.
Part A: Authentication Setup
Your script must support AWS credential authentication through:
AWS CLI credentials (primary method):
aws configure # OR aws configure set aws_access_key_id YOUR_KEY aws configure set aws_secret_access_key YOUR_SECRET aws configure set region us-west-2Environment variables (fallback):
export AWS_ACCESS_KEY_ID=your_key export AWS_SECRET_ACCESS_KEY=your_secret export AWS_DEFAULT_REGION=us-west-2
Your script must verify authentication at startup using sts:GetCallerIdentity.
Part B: Required AWS Permissions
Your script needs these permissions (minimum required):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sts:GetCallerIdentity",
"iam:ListUsers",
"iam:GetUser",
"iam:ListAttachedUserPolicies",
"ec2:DescribeInstances",
"ec2:DescribeImages",
"ec2:DescribeSecurityGroups",
"s3:ListAllMyBuckets",
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": "*"
}
]
}Part C: Script Implementation
Create aws_inspector.py with the following command line interface:
python aws_inspector.py [--region REGION] [--output OUTPUT_FILE] [--format json|table]Arguments:
--region: AWS region to inspect (default: from credentials/config)--output: Output file path (default: print to stdout)--format: Output format - ‘json’ or ‘table’ (default: json)
Part D: Resource Collection
Your script must collect information for these resource types:
1. IAM Users
For each user, collect:
{
"username": "user-name",
"user_id": "AIDACKEXAMPLE",
"arn": "arn:aws:iam::123456789012:user/user-name",
"create_date": "2026-01-15T10:30:00Z",
"last_activity": "2026-02-10T14:20:00Z",
"attached_policies": [
{
"policy_name": "PowerUserAccess",
"policy_arn": "arn:aws:iam::aws:policy/PowerUserAccess"
}
]
}2. EC2 Instances
For each instance, collect:
{
"instance_id": "i-1234567890abcdef0",
"instance_type": "t3.micro",
"state": "running",
"public_ip": "54.123.45.67",
"private_ip": "10.0.1.100",
"availability_zone": "us-west-2a",
"launch_time": "2026-02-15T08:00:00Z",
"ami_id": "ami-0abcdef1234567890",
"ami_name": "Amazon Linux 2023 AMI",
"security_groups": ["sg-12345678", "sg-87654321"],
"tags": {
"Name": "my-instance",
"Environment": "development"
}
}3. S3 Buckets
For each bucket, collect:
{
"bucket_name": "my-example-bucket",
"creation_date": "2026-01-20T12:00:00Z",
"region": "us-west-2",
"object_count": 47,
"size_bytes": 1024000
}4. Security Groups
For each security group, collect:
{
"group_id": "sg-12345678",
"group_name": "default",
"description": "Default security group",
"vpc_id": "vpc-12345678",
"inbound_rules": [
{
"protocol": "tcp",
"port_range": "22-22",
"source": "0.0.0.0/0"
}
],
"outbound_rules": [
{
"protocol": "all",
"port_range": "all",
"destination": "0.0.0.0/0"
}
]
}Part E: Output Formats
JSON Format (Default)
{
"account_info": {
"account_id": "123456789012",
"user_arn": "arn:aws:iam::123456789012:user/student",
"region": "us-west-2",
"scan_timestamp": "2026-02-16T14:30:00Z"
},
"resources": {
"iam_users": [...],
"ec2_instances": [...],
"s3_buckets": [...],
"security_groups": [...]
},
"summary": {
"total_users": 3,
"running_instances": 2,
"total_buckets": 5,
"security_groups": 8
}
}Table Format
AWS Account: 123456789012 (us-west-2)
Scan Time: 2026-02-16 14:30:00 UTC
IAM USERS (3 total)
Username Create Date Last Activity Policies
student-user 2026-01-15 2026-02-10 2
admin-user 2026-02-01 2026-02-15 1
EC2 INSTANCES (2 running, 1 stopped)
Instance ID Type State Public IP Launch Time
i-1234567890abcdef0 t3.micro running 54.123.45.67 2026-02-15 08:00
i-0987654321fedcba0 t3.small stopped - 2026-02-10 12:30
S3 BUCKETS (5 total)
Bucket Name Region Created Objects Size (MB)
my-example-bucket us-west-2 2026-01-20 47 ~1.0
data-backup-bucket us-west-2 2026-01-15 234 ~15.2
SECURITY GROUPS (8 total)
Group ID Name VPC ID Inbound Rules
sg-12345678 default vpc-12345678 1
sg-87654321 web-servers vpc-12345678 2
Part F: Error Handling
Your script must handle these error conditions gracefully:
- Authentication failures: Print clear error message and exit
- Permission denied: Skip resource type, log warning, continue
- Network timeouts: Retry once, then skip resource
- Invalid regions: Validate region exists before proceeding
- Empty resources: Handle accounts with no resources of a type
Example error output:
[WARNING] Access denied for IAM operations - skipping user enumeration
[WARNING] No EC2 instances found in us-west-2
[ERROR] Failed to access S3 bucket 'private-bucket': Access Denied
Part G: Testing
Test your script with various options:
# JSON output to stdout
python aws_inspector.py --region us-west-2 --format json
# JSON output to file
python aws_inspector.py --region us-west-2 --format json --output output.json
# Table output
python aws_inspector.py --region us-west-2 --format table
# Verify JSON is valid
python -m json.tool output.json > /dev/null
Deliverables
See Submission. Your README should explain:
- Your approach to error handling and permission failures
- Any assumptions about AWS account configuration
A requirements.txt is provided in the starter code:
Code: requirements.txt
boto3>=1.26.0
We will validate your submission by running the following commands from your q3/ directory:
pip install -r requirements.txt
python aws_inspector.py --region us-west-2 --format json --output output.json
python aws_inspector.py --region us-west-2 --format table
These commands must complete without errors. We will then verify:
- Both JSON and table output formats are correct
- All resource types are collected with correct fields
- Error handling works with restricted permissions
- Script completes within 60 seconds for accounts with moderate resource counts
- Authentication error handling works correctly
README.md
q1/
├── arxiv_server.py
├── Dockerfile
└── sample_data/
└── papers.json
q2/
├── train_embeddings.py
├── Dockerfile
└── requirements.txt
q3/
├── aws_inspector.py
└── requirements.txt