Problem 2: Text Embedding Training with Autoencoders
Use only the following packages:
- PyTorch (torch, torch.nn, torch.optim)
- Python standard library modules (json, sys, os, re, datetime, collections)
- Basic text processing: You may implement your own tokenization or use simple word splitting
Do not use transformers, sentence-transformers, scikit-learn, numpy (PyTorch tensors only), or pre-trained embedding models.
Train a text autoencoder to generate embeddings for ArXiv paper abstracts. There is a strict parameter limit to encourage efficient architectures.
Part A: Parameter Limit Calculation
Your encoder must have no more than 2,000,000 total parameters (weights and biases combined).
Example Calculation for Planning:
Assumptions for parameter budget:
- Vocabulary size: ~10,000 words (typical for technical abstracts)
- Suggested embedding dimension: 64-256 (your choice)
- Architecture: Input → Hidden → Bottleneck → Hidden → Output
Example architecture (512 → 128 → 512):
- Input layer: 10,000 × 512 + 512 bias = 5,120,512 parameters
- Encoder: 512 × 128 + 128 bias = 65,664 parameters
- Decoder: 128 × 512 + 512 bias = 66,048 parameters
- Output: 512 × 10,000 + 10,000 bias = 5,130,000 parameters
Total: ~10.4M parameters (TOO LARGE)
Better architecture (256 → 64 → 256):
- Input layer: 5,000 × 256 + 256 bias = 1,280,256 parameters
- Encoder: 256 × 64 + 64 bias = 16,448 parameters
- Decoder: 64 × 256 + 256 bias = 16,640 parameters
- Output: 256 × 5,000 + 5,000 bias = 1,285,000 parameters
Total: ~1.6M parameters (WITHIN LIMIT)
Design Constraints:
- Smaller vocabulary (limit to top-K most frequent words)
- Smaller hidden layers
- Efficient embedding dimension (64-256 range suggested)
Your script must print the total parameter count and verify it’s under the limit.
Part B: Data Preprocessing
Create train_embeddings.py that loads ArXiv abstracts from HW#1 Problem 2 output.
Required preprocessing steps:
Text cleaning:
def clean_text(text): # Convert to lowercase # Remove non-alphabetic characters except spaces # Split into words # Remove very short words (< 2 characters) return wordsVocabulary building:
- Extract all unique words from abstracts
- Keep only the top 5,000 most frequent words (parameter budget constraint)
- Create word-to-index mapping
- Reserve index 0 for unknown words
Sequence encoding:
- Convert abstracts to sequences of word indices
- Pad or truncate to fixed length (e.g., 100-200 words)
- Create bag-of-words representation for autoencoder input/output
Part C: Autoencoder Architecture
Design a simple autoencoder. You may follow this vanilla pattern:
class TextAutoencoder(nn.Module):
def __init__(self, vocab_size, hidden_dim, embedding_dim):
super().__init__()
# Encoder: vocab_size → hidden_dim → embedding_dim
self.encoder = nn.Sequential(
nn.Linear(vocab_size, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, embedding_dim)
)
# Decoder: embedding_dim → hidden_dim → vocab_size
self.decoder = nn.Sequential(
nn.Linear(embedding_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, vocab_size),
nn.Sigmoid() # Output probabilities
)
def forward(self, x):
# Encode to bottleneck
embedding = self.encoder(x)
# Decode back to vocabulary space
reconstruction = self.decoder(embedding)
return reconstruction, embeddingArchitecture Requirements:
- Input/output: Bag-of-words vectors (size = vocabulary size)
- Bottleneck layer: Your chosen embedding dimension
- Activation functions: ReLU for hidden layers, Sigmoid for output
- Loss function: Binary cross-entropy (treating as multi-label classification)
Part D: Training Implementation
Your script must accept these command line arguments:
python train_embeddings.py <input_papers.json> <output_dir> [--epochs 50] [--batch_size 32]Training requirements:
- Data loading: Load abstracts from HW#1 format JSON
- Batch processing: Process data in batches for memory efficiency
- Training loop:
- Forward pass: input bag-of-words → reconstruction + embedding
- Loss: Binary cross-entropy between input and reconstruction
- Backpropagation and parameter updates
- Progress logging: Print loss every epoch
- Parameter counting: Verify and print total parameters at startup
Example training output:
Loading abstracts from papers.json...
Found 157 abstracts
Building vocabulary from 23,450 words...
Vocabulary size: 5000 words
Model architecture: 5000 → 256 → 64 → 256 → 5000
Total parameters: 1,598,720 (under 2,000,000 limit)
Training autoencoder...
Epoch 10/50, Loss: 0.2847
Epoch 20/50, Loss: 0.1923
Epoch 30/50, Loss: 0.1654
...
Training complete in 127.3 seconds
Part E: Output Generation
Your script must save the following files to the output directory:
File 1: model.pth - Trained PyTorch model
torch.save({
'model_state_dict': model.state_dict(),
'vocab_to_idx': vocab_to_idx,
'model_config': {
'vocab_size': vocab_size,
'hidden_dim': hidden_dim,
'embedding_dim': embedding_dim
}
}, 'model.pth')File 2: embeddings.json - Generated embeddings for all papers
[
{
"arxiv_id": "2301.12345",
"embedding": [0.123, -0.456, 0.789, ...], // 64-256 dimensional
"reconstruction_loss": 0.0234
},
...
]File 3: vocabulary.json - Vocabulary mapping
{
"vocab_to_idx": {"word1": 1, "word2": 2, ...},
"idx_to_vocab": {"1": "word1", "2": "word2", ...},
"vocab_size": 5000,
"total_words": 23450
}File 4: training_log.json - Training metadata
{
"start_time": "2025-09-16T14:30:00Z",
"end_time": "2025-09-16T14:32:07Z",
"epochs": 50,
"final_loss": 0.1234,
"total_parameters": 1598720,
"papers_processed": 157,
"embedding_dimension": 64
}Part F: Docker Configuration
Create Dockerfile:
FROM python:3.11-slim
# Install PyTorch (CPU only for smaller image)
RUN pip install torch==2.0.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
WORKDIR /app
COPY train_embeddings.py /app/
COPY requirements.txt /app/
RUN pip install -r requirements.txt
ENTRYPOINT ["python", "/app/train_embeddings.py"]Create requirements.txt:
# PyTorch installed separately in Dockerfile
# Add any other minimal dependencies here
Part G: Build and Run Scripts
Create build.sh:
#!/bin/bash
echo "Building autoencoder training container..."
docker build -t arxiv-embeddings:latest .
echo "Build complete"Create run.sh:
#!/bin/bash
if [ $# -lt 2 ]; then
echo "Usage: $0 <input_papers.json> <output_dir> [epochs] [batch_size]"
echo "Example: $0 ../problem1/sample_data/papers.json output/ 50 32"
exit 1
fi
INPUT_FILE="$1"
OUTPUT_DIR="$2"
EPOCHS="${3:-50}"
BATCH_SIZE="${4:-32}"
# Validate input file exists
if [ ! -f "$INPUT_FILE" ]; then
echo "Error: Input file $INPUT_FILE not found"
exit 1
fi
# Create output directory
mkdir -p "$OUTPUT_DIR"
echo "Training embeddings with the following settings:"
echo " Input: $INPUT_FILE"
echo " Output: $OUTPUT_DIR"
echo " Epochs: $EPOCHS"
echo " Batch size: $BATCH_SIZE"
echo ""
# Run training container
docker run --rm \
--name arxiv-embeddings \
-v "$(realpath $INPUT_FILE)":/data/input/papers.json:ro \
-v "$(realpath $OUTPUT_DIR)":/data/output \
arxiv-embeddings:latest \
/data/input/papers.json /data/output --epochs "$EPOCHS" --batch_size "$BATCH_SIZE"
echo ""
echo "Training complete. Output files:"
ls -la "$OUTPUT_DIR"Deliverables
Your problem2/ directory must contain:
problem2/
├── train_embeddings.py
├── Dockerfile
├── requirements.txt
├── build.sh
└── run.sh
Validation
We will test your implementation by:
- Running
./build.sh- must complete without errors - Running with sample data:
./run.sh ../problem1/sample_data/papers.json output/ - Verifying parameter count is under 2,000,000
- Checking all output files are generated with correct formats
- Validating embeddings have consistent dimensions
- Testing reconstruction loss decreases during training
- Verifying model can be loaded and used for inference
Your model must train successfully on the provided sample data within 10 minutes on a standard laptop CPU.