hw02-q2

Problem 2: Text Embedding Training with Autoencoders

Requirements

Use only the following packages:

PyTorch (torch, torch.nn, torch.optim)
Python standard library modules (json, sys, os, re, datetime, collections)
Basic text processing: You may implement your own tokenization or use simple word splitting

Do not use transformers, sentence-transformers, scikit-learn, numpy (PyTorch tensors only), or pre-trained embedding models.

Train a text autoencoder to generate embeddings for ArXiv paper abstracts. There is a strict parameter limit to encourage efficient architectures.

Part A: Parameter Limit Calculation

Your encoder must have no more than 2,000,000 total parameters (weights and biases combined).

Example Calculation for Planning:

Assumptions for parameter budget:
- Vocabulary size: ~10,000 words (typical for technical abstracts)
- Suggested embedding dimension: 64-256 (your choice)
- Architecture: Input → Hidden → Bottleneck → Hidden → Output

Example architecture (512 → 128 → 512):
- Input layer: 10,000 × 512 + 512 bias = 5,120,512 parameters
- Encoder: 512 × 128 + 128 bias = 65,664 parameters  
- Decoder: 128 × 512 + 512 bias = 66,048 parameters
- Output: 512 × 10,000 + 10,000 bias = 5,130,000 parameters
Total: ~10.4M parameters (TOO LARGE)

Better architecture (256 → 64 → 256):
- Input layer: 5,000 × 256 + 256 bias = 1,280,256 parameters
- Encoder: 256 × 64 + 64 bias = 16,448 parameters
- Decoder: 64 × 256 + 256 bias = 16,640 parameters  
- Output: 256 × 5,000 + 5,000 bias = 1,285,000 parameters
Total: ~1.6M parameters (WITHIN LIMIT)

Design Constraints:

Smaller vocabulary (limit to top-K most frequent words)
Smaller hidden layers
Efficient embedding dimension (64-256 range suggested)

Your script must print the total parameter count and verify it’s under the limit.

Part B: Data Preprocessing

Create train_embeddings.py that loads ArXiv abstracts from HW#1 Problem 2 output.

Required preprocessing steps:

Text cleaning:

def clean_text(text):
    # Convert to lowercase
    # Remove non-alphabetic characters except spaces
    # Split into words
    # Remove very short words (< 2 characters)
    return words

Vocabulary building:
- Extract all unique words from abstracts
- Keep only the top 5,000 most frequent words (parameter budget constraint)
- Create word-to-index mapping
- Reserve index 0 for unknown words
Sequence encoding:
- Convert abstracts to sequences of word indices
- Pad or truncate to fixed length (e.g., 100-200 words)
- Create bag-of-words representation for autoencoder input/output

Part C: Autoencoder Architecture

Design a simple autoencoder. You may follow this vanilla pattern:

class TextAutoencoder(nn.Module):
    def __init__(self, vocab_size, hidden_dim, embedding_dim):
        super().__init__()
        # Encoder: vocab_size → hidden_dim → embedding_dim
        self.encoder = nn.Sequential(
            nn.Linear(vocab_size, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embedding_dim)
        )
        
        # Decoder: embedding_dim → hidden_dim → vocab_size  
        self.decoder = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, vocab_size),
            nn.Sigmoid()  # Output probabilities
        )
    
    def forward(self, x):
        # Encode to bottleneck
        embedding = self.encoder(x)
        # Decode back to vocabulary space
        reconstruction = self.decoder(embedding)
        return reconstruction, embedding

Architecture Requirements:

Input/output: Bag-of-words vectors (size = vocabulary size)
Bottleneck layer: Your chosen embedding dimension
Activation functions: ReLU for hidden layers, Sigmoid for output
Loss function: Binary cross-entropy (treating as multi-label classification)

Part D: Training Implementation

Your script must accept these command line arguments:

python train_embeddings.py <input_papers.json> <output_dir> [--epochs 50] [--batch_size 32]

Training requirements:

Data loading: Load abstracts from HW#1 format JSON
Batch processing: Process data in batches for memory efficiency
Training loop:
- Forward pass: input bag-of-words → reconstruction + embedding
- Loss: Binary cross-entropy between input and reconstruction
- Backpropagation and parameter updates
Progress logging: Print loss every epoch
Parameter counting: Verify and print total parameters at startup

Example training output:

Loading abstracts from papers.json...
Found 157 abstracts
Building vocabulary from 23,450 words...
Vocabulary size: 5000 words
Model architecture: 5000 → 256 → 64 → 256 → 5000
Total parameters: 1,598,720 (under 2,000,000 limit)

Training autoencoder...
Epoch 10/50, Loss: 0.2847
Epoch 20/50, Loss: 0.1923
Epoch 30/50, Loss: 0.1654
...
Training complete in 127.3 seconds

Part E: Output Generation

Your script must save the following files to the output directory:

File 1: model.pth - Trained PyTorch model

torch.save({
    'model_state_dict': model.state_dict(),
    'vocab_to_idx': vocab_to_idx,
    'model_config': {
        'vocab_size': vocab_size,
        'hidden_dim': hidden_dim, 
        'embedding_dim': embedding_dim
    }
}, 'model.pth')

File 2: embeddings.json - Generated embeddings for all papers

[
  {
    "arxiv_id": "2301.12345",
    "embedding": [0.123, -0.456, 0.789, ...],  // 64-256 dimensional
    "reconstruction_loss": 0.0234
  },
  ...
]

File 3: vocabulary.json - Vocabulary mapping

{
  "vocab_to_idx": {"word1": 1, "word2": 2, ...},
  "idx_to_vocab": {"1": "word1", "2": "word2", ...},
  "vocab_size": 5000,
  "total_words": 23450
}

File 4: training_log.json - Training metadata

{
  "start_time": "2025-09-16T14:30:00Z",
  "end_time": "2025-09-16T14:32:07Z",
  "epochs": 50,
  "final_loss": 0.1234,
  "total_parameters": 1598720,
  "papers_processed": 157,
  "embedding_dimension": 64
}

Part F: Docker Configuration

Create Dockerfile:

FROM python:3.11-slim

# Install PyTorch (CPU only for smaller image)
RUN pip install torch==2.0.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

WORKDIR /app
COPY train_embeddings.py /app/
COPY requirements.txt /app/
RUN pip install -r requirements.txt

ENTRYPOINT ["python", "/app/train_embeddings.py"]

Create requirements.txt:

# PyTorch installed separately in Dockerfile
# Add any other minimal dependencies here

Part G: Build and Run Scripts

Create build.sh:

#!/bin/bash
echo "Building autoencoder training container..."
docker build -t arxiv-embeddings:latest .
echo "Build complete"

Create run.sh:

#!/bin/bash

if [ $# -lt 2 ]; then
    echo "Usage: $0 <input_papers.json> <output_dir> [epochs] [batch_size]"
    echo "Example: $0 ../problem1/sample_data/papers.json output/ 50 32"
    exit 1
fi

INPUT_FILE="$1"
OUTPUT_DIR="$2"  
EPOCHS="${3:-50}"
BATCH_SIZE="${4:-32}"

# Validate input file exists
if [ ! -f "$INPUT_FILE" ]; then
    echo "Error: Input file $INPUT_FILE not found"
    exit 1
fi

# Create output directory
mkdir -p "$OUTPUT_DIR"

echo "Training embeddings with the following settings:"
echo "  Input: $INPUT_FILE"
echo "  Output: $OUTPUT_DIR" 
echo "  Epochs: $EPOCHS"
echo "  Batch size: $BATCH_SIZE"
echo ""

# Run training container
docker run --rm \
    --name arxiv-embeddings \
    -v "$(realpath $INPUT_FILE)":/data/input/papers.json:ro \
    -v "$(realpath $OUTPUT_DIR)":/data/output \
    arxiv-embeddings:latest \
    /data/input/papers.json /data/output --epochs "$EPOCHS" --batch_size "$BATCH_SIZE"

echo ""
echo "Training complete. Output files:"
ls -la "$OUTPUT_DIR"

Deliverables

Your problem2/ directory must contain:

problem2/
├── train_embeddings.py
├── Dockerfile  
├── requirements.txt
├── build.sh
└── run.sh

Validation

We will test your implementation by:

Running ./build.sh - must complete without errors
Running with sample data: ./run.sh ../problem1/sample_data/papers.json output/
Verifying parameter count is under 2,000,000
Checking all output files are generated with correct formats
Validating embeddings have consistent dimensions
Testing reconstruction loss decreases during training
Verifying model can be loaded and used for inference

Your model must train successfully on the provided sample data within 10 minutes on a standard laptop CPU.