Computational Deep Learning

EE 547 - Unit 1

Dr. Brandon Franzke

Fall 2025

Scale & Hardware

Training a Modern Language Model

Memory Requirements

Total training memory required: 1,575 GB
Minimum A100 GPUs needed: 20 GPUs (80GB each)

What is a CPU vs GPU?

Central Processing Unit (CPU)

Design Philosophy: Sequential processing
Cores: 4-128 powerful cores (typical: 8-16)
Cache: Large caches (L1/L2/L3) to minimize memory latency
Control Logic: Sophisticated branch prediction, out-of-order execution
Best for: General computing, complex logic, single-threaded tasks

Graphics Processing Unit (GPU)

Design Philosophy: Parallel processing
Cores: 1000s of simple cores (A100: 6,912 CUDA cores)
Memory: High bandwidth memory (HBM) - 2TB/s vs 100GB/s for CPU
Control Logic: Simple, same instruction on multiple data (SIMD)
Best for: Matrix operations, parallel algorithms, ML training

GPU Architecture

Streaming Multiprocessor (SM)

Each GPU contains multiple SMs (A100 has 108 SMs): - Warp: 32 threads executing in lockstep - Shared Memory: 164KB per SM - fast local storage - Tensor Cores: Specialized for matrix multiply (4x4 matrices)

Why GPUs Excel at ML

Matrix Multiply Example:
C = A × B where A is 1000×1000, B is 1000×1000

CPU approach: 3 nested loops
for i in range(1000):
    for j in range(1000):
        for k in range(1000):
            C[i][j] += A[i][k] * B[k][j]
Time: 1 billion operations sequentially

GPU approach: All operations in parallel
Each output C[i][j] computed by separate thread
Time: ~1000 operations (depth of computation)

Memory Hierarchy in Modern Systems

The Bandwidth Wall

Understanding FLOPS and Compute

What is a FLOP?

FLOP: Floating Point Operation

Basic operations (1 FLOP each): - Addition: a + b - Multiplication: a * b - Fused Multiply-Add (FMA): a * b + c (counts as 2 FLOPS)

Compute Capacity

A100 GPU: 312 TFLOPS (FP16) - 312 trillion operations per second - But only for perfectly parallel operations - Real utilization: 30-80% depending on problem

Example: Matrix Multiply Cost

Multiplying two N×N matrices:
FLOPS = 2N³ (multiply-add for each output)

1000×1000 matrices:
2 × 1000³ = 2 billion FLOPS
Time on A100: 2×10⁹ / 312×10¹² = 6.4 microseconds (theoretical)
Actual time: ~50 microseconds (memory bound)

Data Types and Precision

Floating Point Representations

FP32 (Single): [Sign:1][Exponent:8][Mantissa:23]
Range: ±1.2×10⁻³⁸ to ±3.4×10³⁸
Precision: ~7 decimal digits

FP16 (Half): [Sign:1][Exponent:5][Mantissa:10]
Range: ±6.1×10⁻⁵ to ±65,504
Precision: ~3 decimal digits

BF16 (BFloat16): [Sign:1][Exponent:8][Mantissa:7]
Range: ±1.2×10⁻³⁸ to ±3.4×10³⁸ (same as FP32)
Precision: ~2 decimal digits

Why Different Precisions?

FP32: Training stability, scientific computing
FP16: 2x memory savings, 2-4x compute speedup
BF16: Better for ML - maintains FP32 range
INT8: Inference optimization - 4x smaller

Why Data Movement is Expensive

Energy Cost of Data Movement

Operation               Energy (pJ)    Relative Cost
32-bit FP Multiply      3.7           1x
32-bit Register Read    2.0           0.5x
32-bit L1 Cache Read    5.0           1.4x
32-bit L2 Cache Read    20            5.4x
32-bit DRAM Read        200           54x
32-bit Read from SSD    10,000        2,700x

Moving data from DRAM costs 54x more energy than computing on it!

The 3 R’s of Performance

Reuse - Use data multiple times once loaded
Reference Locality - Access nearby data
Reduction - Minimize data movement

Distribution Requirements by Scale

Distributed Computing

Distribution Patterns

What is Distributed Computing?

Single Machine Limitations

One Computer:
┌──────────────────┐
│ CPU: 16 cores    │
│ RAM: 128 GB      │
│ GPU: 1× A100     │
│ Disk: 2 TB       │
└──────────────────┘

Distributed System

Multiple Computers Working Together:
┌─────────┐  Network  ┌─────────┐
│ Node 1  │←─────────→│ Node 2  │
└─────────┘           └─────────┘
     ↑                     ↑
     └──────────┬──────────┘
           ┌─────────┐
           │ Node 3  │
           └─────────┘

Primary Challenge: Coordination and communication

Why Distribute?

Scale-Up vs Scale-Out

Scale-Up (Vertical) - Buy bigger machine - Limits: Physics and cost - 128-core CPU: $50,000 - 8× A100 server: $200,000

Scale-Out (Horizontal) - Add more machines - Limits: Communication overhead - 8× single-GPU machines: $80,000 - Can add incrementally

Distribution Drivers

Capacity: Problem doesn’t fit on one machine
Throughput: Need more operations/second
Availability: System stays up if nodes fail
Cost: Commodity hardware cheaper than supercomputers

Ring All-Reduce Algorithm

Bandwidth Analysis

Total data per GPU: 40 GB
Data transferred per GPU: 60 GB
Bandwidth efficiency: 66.7%

Collective Communication Patterns

Broadcast

     [A]
   /  |  \
[A]  [A]  [A]

One to all
Use: Distribute hyperparameters

Scatter

[A,B,C,D]
    ↓
[A] [B] [C] [D]

Divide among processes
Use: Distribute dataset

Gather

[A] [B] [C] [D]
    ↓
[A,B,C,D]

Collect from all
Use: Collect metrics

AllReduce

[A] [B] [C] [D]
    ↓
[A+B+C+D] on all

Reduce and broadcast
Use: Gradient synchronization

Pipeline Schedule Optimization

Pipeline Bubble Time Analysis

Naive schedule: 240ms idle time per GPU
1F1B schedule: 30ms idle time per GPU
Improvement: 87.5% reduction in idle time

Tensor Dimensions Explained

What is a Tensor?

Scalar: 0D (single number)
Vector: 1D array [1, 2, 3]
Matrix: 2D array [[1,2], [3,4]]
Tensor: N-D array

ML Tensor Conventions

# Common shapes
batch_tensor = [B, S, H]
# B = Batch size (independent samples)
# S = Sequence length (tokens, time)  
# H = Hidden dimension (features)

weight_matrix = [H_in, H_out]
# H_in = Input features
# H_out = Output features

image_batch = [B, C, H, W]
# B = Batch, C = Channels
# H = Height, W = Width

Containers for ML

Virtual Machines vs Containers

Why Containers for ML?

The Dependency Problem

Your laptop: Python 3.9, CUDA 11.2, PyTorch 1.10
Production: Python 3.8, CUDA 11.6, PyTorch 1.12
Result: Configuration mismatch leads to deployment failures

Container Advantages

Reproducibility: Consistent environment across systems
Isolation: Separated dependencies
Portability: Platform-independent execution
Efficiency: Lower overhead than VMs
Scalability: Simple replication

Container Limitations

Linux-only at kernel level (Windows/Mac use VM)
Not true security isolation
GPU access requires configuration
Storage persistence needs volumes

GPU Access in Containers

Basic Docker Workflow for ML

Dockerfile Example

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y python3-pip

# Install ML libraries
RUN pip3 install torch torchvision numpy pandas

# Copy code
COPY train.py /app/train.py
WORKDIR /app

# Run training
CMD ["python3", "train.py"]

Build and Run

# Build image
docker build -t my-ml-model .

# Run with GPU
docker run --gpus all my-ml-model

# Interactive development
docker run --gpus all -it -v $(pwd):/app my-ml-model bash

Resource Orchestration

The Scheduling Problem

Service Discovery

The Problem

Training script needs to know:
- Where is the parameter server?
- What are the worker IP addresses?
- Which GPU should I use?
- Where is the shared storage?

But in dynamic clusters:
- Nodes can fail and restart
- IPs change
- Services move

Implementation Methods

Environment Variables

PS_HOST=10.0.0.5
WORKER_HOSTS=10.0.0.10,10.0.0.11
RANK=2
WORLD_SIZE=4

DNS-Based

parameter_server = "ps.training.cluster.local"
workers = ["worker-0", "worker-1", "worker-2"]

Configuration Service

# etcd, Consul, Zookeeper
config = etcd_client.get("/training/job1/workers")

Basic Kubernetes for ML

Networking for Distributed ML

Datacenter Network Hierarchy

NCCL: GPU Communication Library

Placement Groups and Network Proximity

Network Performance Observations

Co-location: Reduces communication overhead by 60%
Topology awareness: Affects distributed training performance
AWS Placement Groups: Provide predictable network performance
Cross-AZ training: Should be avoided when possible