Homework #3: HTTP Services, ML Embeddings, and AWS Fundamentals

EE 547: Spring 2026

Assignment Details

Assigned: 11 February
Due: Tuesday, 24 February at 23:59

Gradescope: Homework 3 | How to Submit

Requirements
  • Docker Desktop must be installed and running on your machine
  • Python 3.11+ required
  • Use only Python standard library modules unless explicitly permitted

Overview

This assignment introduces HTTP services, machine learning embeddings, and AWS fundamentals. You will build API servers, train embedding models, and interact with AWS services.

Getting Started

Download the starter code: hw3-starter.zip

unzip hw3-starter.zip
cd hw3-starter

Problem 1: HTTP API Server for ArXiv Papers

Build a containerized HTTP server that serves ArXiv paper metadata from your HW#2 Problem 2 output.

Use only Python standard library modules: http.server, urllib.parse, json, re, sys, os, datetime. Do not use flask, django, fastapi, requests, or any other web frameworks or external HTTP libraries.

Part A: Data Source

Your server must load ArXiv paper data from HW#2 Problem 2 output files:

  • papers.json - Array of paper metadata with abstracts and statistics
  • corpus_analysis.json - Global corpus analysis with word frequencies

Part B: HTTP Server Implementation

Create arxiv_server.py that implements a basic HTTP server with the following endpoints:

Required Endpoints:

  1. GET /papers - Return list of all papers

    [
      {
        "arxiv_id": "2301.12345",
        "title": "Paper Title",
        "authors": ["Author One", "Author Two"],
        "categories": ["cs.LG", "cs.AI"]
      },
      ...
    ]
  2. GET /papers/{arxiv_id} - Return full paper details

    {
      "arxiv_id": "2301.12345",
      "title": "Paper Title",
      "authors": ["Author One", "Author Two"],
      "abstract": "Full abstract text...",
      "categories": ["cs.LG", "cs.AI"],
      "published": "2023-01-15T10:30:00Z",
      "abstract_stats": {
        "total_words": 150,
        "unique_words": 85,
        "total_sentences": 8
      }
    }
  3. GET /search?q={query} - Search papers by title and abstract

    {
      "query": "machine learning",
      "results": [
        {
          "arxiv_id": "2301.12345",
          "title": "Paper Title",
          "match_score": 3,
          "matches_in": ["title", "abstract"]
        }
      ]
    }
  4. GET /stats - Return corpus statistics

    {
      "total_papers": 20,
      "total_words": 15000,
      "unique_words": 2500,
      "top_10_words": [
        {"word": "model", "frequency": 145},
        {"word": "data", "frequency": 132}
      ],
      "category_distribution": {
        "cs.LG": 12,
        "cs.AI": 8
      }
    }

Error Handling:

  • Return HTTP 404 for unknown paper IDs or invalid endpoints
  • Return HTTP 400 for malformed search queries
  • Return HTTP 500 for server errors with JSON error message

Part C: Implementation Requirements

Your server must:

  1. Command Line Arguments: Accept port number as argument (default 8080)

    python arxiv_server.py [port]
  2. Data Loading: Load JSON data at startup, handle missing files gracefully

  3. Search Implementation:

    • Case-insensitive search in titles and abstracts
    • Count term frequency as match score
    • Support multi-word queries (search for all terms)
  4. Logging: Print requests to stdout in format:

    [2026-02-16 14:30:22] GET /papers - 200 OK (15 results)
    [2026-02-16 14:30:25] GET /papers/invalid-id - 404 Not Found

Part D: Dockerfile

A Dockerfile is provided in the starter code:

Code: Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY arxiv_server.py /app/
COPY sample_data/ /app/sample_data/
EXPOSE 8080
ENTRYPOINT ["python", "/app/arxiv_server.py"]
CMD ["8080"]

Part E: Building and Running

Build your container image:

docker build -t arxiv-server:latest .

Run your container:

docker run --rm \
    -p 8080:8080 \
    arxiv-server:latest

The server will be available at http://localhost:8080. To use a different host port:

docker run --rm \
    -p 9000:8080 \
    arxiv-server:latest

Part F: Testing

With the server running, test your endpoints using curl:

curl -s http://localhost:8080/papers | python -m json.tool
curl -s http://localhost:8080/stats | python -m json.tool
curl -s "http://localhost:8080/search?q=machine" | python -m json.tool
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/invalid

Deliverables

See Submission.

Grading Commands

We will validate your submission by running the following commands from your q1/ directory:

docker build -t arxiv-server:latest .
docker run --rm \
    -p 8080:8080 \
    arxiv-server:latest

These commands must complete without errors. We will then verify:

  • All four endpoints respond with correct JSON structure
  • Error handling for invalid requests (404, 400)
  • Server handles at least 10 concurrent requests without errors
  • All endpoints respond within 2 seconds under normal load

Problem 2: Text Embedding Training with Autoencoders

Train a text autoencoder to generate embeddings for ArXiv paper abstracts. There is a strict parameter limit to encourage efficient architectures.

This problem requires PyTorch (torch, torch.nn, torch.optim) and Python standard library modules (json, sys, os, re, datetime, collections). You may implement your own tokenization or use simple word splitting. Do not use transformers, sentence-transformers, scikit-learn, numpy (PyTorch tensors only), or pre-trained embedding models.

Part A: Parameter Limit Calculation

Your encoder must have no more than 2,000,000 total parameters (weights and biases combined).

Example Calculation for Planning:

Assumptions for parameter budget:
- Vocabulary size: ~10,000 words (typical for technical abstracts)
- Suggested embedding dimension: 64-256 (your choice)
- Architecture: Input → Hidden → Bottleneck → Hidden → Output

Example architecture (512 → 128 → 512):
- Input layer: 10,000 × 512 + 512 bias = 5,120,512 parameters
- Encoder: 512 × 128 + 128 bias = 65,664 parameters
- Decoder: 128 × 512 + 512 bias = 66,048 parameters
- Output: 512 × 10,000 + 10,000 bias = 5,130,000 parameters
Total: ~10.4M parameters (TOO LARGE)

Better architecture (256 → 64 → 256):
- Input layer: 5,000 × 256 + 256 bias = 1,280,256 parameters
- Encoder: 256 × 64 + 64 bias = 16,448 parameters
- Decoder: 64 × 256 + 256 bias = 16,640 parameters
- Output: 256 × 5,000 + 5,000 bias = 1,285,000 parameters
Total: ~1.6M parameters (WITHIN LIMIT)

Design Constraints:

  • Smaller vocabulary (limit to top-K most frequent words)
  • Smaller hidden layers
  • Efficient embedding dimension (64-256 range suggested)

Your script must print the total parameter count and verify it’s under the limit.

Part B: Data Preprocessing

Create train_embeddings.py that loads ArXiv abstracts from HW#2 Problem 2 output.

Required preprocessing steps:

  1. Text cleaning:

    def clean_text(text):
        # Convert to lowercase
        # Remove non-alphabetic characters except spaces
        # Split into words
        # Remove very short words (< 2 characters)
        return words
  2. Vocabulary building:

    • Extract all unique words from abstracts
    • Keep only the top 5,000 most frequent words (parameter budget constraint)
    • Create word-to-index mapping
    • Reserve index 0 for unknown words
  3. Sequence encoding:

    • Convert abstracts to sequences of word indices
    • Pad or truncate to fixed length (e.g., 100-200 words)
    • Create bag-of-words representation for autoencoder input/output

Part C: Autoencoder Architecture

Design a simple autoencoder. You may follow this vanilla pattern:

class TextAutoencoder(nn.Module):
    def __init__(self, vocab_size, hidden_dim, embedding_dim):
        super().__init__()
        # Encoder: vocab_size → hidden_dim → embedding_dim
        self.encoder = nn.Sequential(
            nn.Linear(vocab_size, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embedding_dim)
        )

        # Decoder: embedding_dim → hidden_dim → vocab_size
        self.decoder = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, vocab_size),
            nn.Sigmoid()  # Output probabilities
        )

    def forward(self, x):
        # Encode to bottleneck
        embedding = self.encoder(x)
        # Decode back to vocabulary space
        reconstruction = self.decoder(embedding)
        return reconstruction, embedding

Architecture Requirements:

  • Input/output: Bag-of-words vectors (size = vocabulary size)
  • Bottleneck layer: Your chosen embedding dimension
  • Activation functions: ReLU for hidden layers, Sigmoid for output
  • Loss function: Binary cross-entropy (treating as multi-label classification)

Part D: Training Implementation

Your script must accept these command line arguments:

python train_embeddings.py <input_papers.json> <output_dir> [--epochs 50] [--batch_size 32]

Training requirements:

  1. Data loading: Load abstracts from HW#2 format JSON
  2. Batch processing: Process data in batches for memory efficiency
  3. Training loop:
    • Forward pass: input bag-of-words → reconstruction + embedding
    • Loss: Binary cross-entropy between input and reconstruction
    • Backpropagation and parameter updates
  4. Progress logging: Print loss every epoch
  5. Parameter counting: Verify and print total parameters at startup

Example training output:

Loading abstracts from papers.json...
Found 157 abstracts
Building vocabulary from 23,450 words...
Vocabulary size: 5000 words
Model architecture: 5000 → 256 → 64 → 256 → 5000
Total parameters: 1,598,720 (under 2,000,000 limit)

Training autoencoder...
Epoch 10/50, Loss: 0.2847
Epoch 20/50, Loss: 0.1923
Epoch 30/50, Loss: 0.1654
...
Training complete in 127.3 seconds

Part E: Output Generation

Your script must save the following files to the output directory:

File 1: model.pth - Trained PyTorch model

torch.save({
    'model_state_dict': model.state_dict(),
    'vocab_to_idx': vocab_to_idx,
    'model_config': {
        'vocab_size': vocab_size,
        'hidden_dim': hidden_dim,
        'embedding_dim': embedding_dim
    }
}, 'model.pth')

File 2: embeddings.json - Generated embeddings for all papers

[
  {
    "arxiv_id": "2301.12345",
    "embedding": [0.123, -0.456, 0.789, ...],
    "reconstruction_loss": 0.0234
  },
  ...
]

File 3: vocabulary.json - Vocabulary mapping

{
  "vocab_to_idx": {"word1": 1, "word2": 2, ...},
  "idx_to_vocab": {"1": "word1", "2": "word2", ...},
  "vocab_size": 5000,
  "total_words": 23450
}

File 4: training_log.json - Training metadata

{
  "start_time": "2026-02-16T14:30:00Z",
  "end_time": "2026-02-16T14:32:07Z",
  "epochs": 50,
  "final_loss": 0.1234,
  "total_parameters": 1598720,
  "papers_processed": 157,
  "embedding_dimension": 64
}

Part F: Docker Configuration

A Dockerfile and requirements.txt are provided in the starter code:

Code: Dockerfile
FROM python:3.11-slim

# Install PyTorch (CPU only for smaller image)
RUN pip install torch==2.0.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

WORKDIR /app
COPY train_embeddings.py /app/
COPY requirements.txt /app/
RUN pip install -r requirements.txt

ENTRYPOINT ["python", "/app/train_embeddings.py"]
Code: requirements.txt
# PyTorch installed separately in Dockerfile
# Add any other minimal dependencies here

Part G: Building and Running

Build your container image:

docker build -t arxiv-embeddings:latest .

Run your container:

docker run --rm \
    -v "$(pwd)/papers.json":/data/input/papers.json:ro \
    -v "$(pwd)/output":/data/output \
    arxiv-embeddings:latest \
    /data/input/papers.json /data/output --epochs 50 --batch_size 32

The volume mounts connect your host filesystem to the container:

  • -v "$(pwd)/papers.json":/data/input/papers.json:ro mounts your input file read-only
  • -v "$(pwd)/output":/data/output mounts the output directory for results

On Windows, replace $(pwd) with the full path or %cd% (cmd) / ${PWD} (PowerShell).

Deliverables

See Submission.

Grading Commands

We will validate your submission by running the following commands from your q2/ directory:

docker build -t arxiv-embeddings:latest .
docker run --rm \
    -v "$(pwd)/papers.json":/data/input/papers.json:ro \
    -v "$(pwd)/output":/data/output \
    arxiv-embeddings:latest \
    /data/input/papers.json /data/output --epochs 50 --batch_size 32

These commands must complete without errors. We will then verify:

  • Parameter count is under 2,000,000
  • All output files are generated with correct formats
  • Embeddings have consistent dimensions
  • Reconstruction loss decreases during training
  • Model can be loaded and used for inference
  • Training completes within 10 minutes on a standard laptop CPU

Problem 3: AWS Resource Inspector

Create a Python script that lists and inspects AWS resources across your account, providing insight into IAM users, EC2 instances, S3 buckets, and security groups.

This problem requires boto3 (AWS SDK for Python) and Python standard library modules (json, sys, datetime, argparse, os). Do not use other AWS libraries, CLI wrappers, or third-party AWS tools beyond boto3. AWS CLI must be configured with valid credentials.

Part A: Authentication Setup

Your script must support AWS credential authentication through:

  1. AWS CLI credentials (primary method):

    aws configure
    # OR
    aws configure set aws_access_key_id YOUR_KEY
    aws configure set aws_secret_access_key YOUR_SECRET
    aws configure set region us-west-2
  2. Environment variables (fallback):

    export AWS_ACCESS_KEY_ID=your_key
    export AWS_SECRET_ACCESS_KEY=your_secret
    export AWS_DEFAULT_REGION=us-west-2

Your script must verify authentication at startup using sts:GetCallerIdentity.

Part B: Required AWS Permissions

Your script needs these permissions (minimum required):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sts:GetCallerIdentity",
                "iam:ListUsers",
                "iam:GetUser",
                "iam:ListAttachedUserPolicies",
                "ec2:DescribeInstances",
                "ec2:DescribeImages",
                "ec2:DescribeSecurityGroups",
                "s3:ListAllMyBuckets",
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": "*"
        }
    ]
}

Part C: Script Implementation

Create aws_inspector.py with the following command line interface:

python aws_inspector.py [--region REGION] [--output OUTPUT_FILE] [--format json|table]

Arguments:

  • --region: AWS region to inspect (default: from credentials/config)
  • --output: Output file path (default: print to stdout)
  • --format: Output format - ‘json’ or ‘table’ (default: json)

Part D: Resource Collection

Your script must collect information for these resource types:

1. IAM Users

For each user, collect:

{
    "username": "user-name",
    "user_id": "AIDACKEXAMPLE",
    "arn": "arn:aws:iam::123456789012:user/user-name",
    "create_date": "2026-01-15T10:30:00Z",
    "last_activity": "2026-02-10T14:20:00Z",
    "attached_policies": [
        {
            "policy_name": "PowerUserAccess",
            "policy_arn": "arn:aws:iam::aws:policy/PowerUserAccess"
        }
    ]
}

2. EC2 Instances

For each instance, collect:

{
    "instance_id": "i-1234567890abcdef0",
    "instance_type": "t3.micro",
    "state": "running",
    "public_ip": "54.123.45.67",
    "private_ip": "10.0.1.100",
    "availability_zone": "us-west-2a",
    "launch_time": "2026-02-15T08:00:00Z",
    "ami_id": "ami-0abcdef1234567890",
    "ami_name": "Amazon Linux 2023 AMI",
    "security_groups": ["sg-12345678", "sg-87654321"],
    "tags": {
        "Name": "my-instance",
        "Environment": "development"
    }
}

3. S3 Buckets

For each bucket, collect:

{
    "bucket_name": "my-example-bucket",
    "creation_date": "2026-01-20T12:00:00Z",
    "region": "us-west-2",
    "object_count": 47,
    "size_bytes": 1024000
}

4. Security Groups

For each security group, collect:

{
    "group_id": "sg-12345678",
    "group_name": "default",
    "description": "Default security group",
    "vpc_id": "vpc-12345678",
    "inbound_rules": [
        {
            "protocol": "tcp",
            "port_range": "22-22",
            "source": "0.0.0.0/0"
        }
    ],
    "outbound_rules": [
        {
            "protocol": "all",
            "port_range": "all",
            "destination": "0.0.0.0/0"
        }
    ]
}

Part E: Output Formats

JSON Format (Default)

{
    "account_info": {
        "account_id": "123456789012",
        "user_arn": "arn:aws:iam::123456789012:user/student",
        "region": "us-west-2",
        "scan_timestamp": "2026-02-16T14:30:00Z"
    },
    "resources": {
        "iam_users": [...],
        "ec2_instances": [...],
        "s3_buckets": [...],
        "security_groups": [...]
    },
    "summary": {
        "total_users": 3,
        "running_instances": 2,
        "total_buckets": 5,
        "security_groups": 8
    }
}

Table Format

AWS Account: 123456789012 (us-west-2)
Scan Time: 2026-02-16 14:30:00 UTC

IAM USERS (3 total)
Username            Create Date          Last Activity        Policies
student-user        2026-01-15           2026-02-10           2
admin-user          2026-02-01           2026-02-15           1

EC2 INSTANCES (2 running, 1 stopped)
Instance ID          Type        State      Public IP        Launch Time
i-1234567890abcdef0  t3.micro    running    54.123.45.67     2026-02-15 08:00
i-0987654321fedcba0  t3.small    stopped    -                2026-02-10 12:30

S3 BUCKETS (5 total)
Bucket Name              Region      Created       Objects    Size (MB)
my-example-bucket        us-west-2   2026-01-20    47         ~1.0
data-backup-bucket       us-west-2   2026-01-15    234        ~15.2

SECURITY GROUPS (8 total)
Group ID         Name           VPC ID          Inbound Rules
sg-12345678      default        vpc-12345678    1
sg-87654321      web-servers    vpc-12345678    2

Part F: Error Handling

Your script must handle these error conditions gracefully:

  1. Authentication failures: Print clear error message and exit
  2. Permission denied: Skip resource type, log warning, continue
  3. Network timeouts: Retry once, then skip resource
  4. Invalid regions: Validate region exists before proceeding
  5. Empty resources: Handle accounts with no resources of a type

Example error output:

[WARNING] Access denied for IAM operations - skipping user enumeration
[WARNING] No EC2 instances found in us-west-2
[ERROR] Failed to access S3 bucket 'private-bucket': Access Denied

Part G: Testing

Test your script with various options:

# JSON output to stdout
python aws_inspector.py --region us-west-2 --format json

# JSON output to file
python aws_inspector.py --region us-west-2 --format json --output output.json

# Table output
python aws_inspector.py --region us-west-2 --format table

# Verify JSON is valid
python -m json.tool output.json > /dev/null

Deliverables

See Submission. Your README should explain:

  • Your approach to error handling and permission failures
  • Any assumptions about AWS account configuration

A requirements.txt is provided in the starter code:

Code: requirements.txt
boto3>=1.26.0

Grading Commands

We will validate your submission by running the following commands from your q3/ directory:

pip install -r requirements.txt
python aws_inspector.py --region us-west-2 --format json --output output.json
python aws_inspector.py --region us-west-2 --format table

These commands must complete without errors. We will then verify:

  • Both JSON and table output formats are correct
  • All resource types are collected with correct fields
  • Error handling works with restricted permissions
  • Script completes within 60 seconds for accounts with moderate resource counts
  • Authentication error handling works correctly

Submission
README.md
q1/
├── arxiv_server.py
├── Dockerfile
└── sample_data/
    └── papers.json
q2/
├── train_embeddings.py
├── Dockerfile
└── requirements.txt
q3/
├── aws_inspector.py
└── requirements.txt