
EE 547 - Unit 1
Fall 2025

Each GPU contains multiple SMs (A100 has 108 SMs): - Warp: 32 threads executing in lockstep - Shared Memory: 164KB per SM - fast local storage - Tensor Cores: Specialized for matrix multiply (4x4 matrices)
Matrix Multiply Example:
C = A × B where A is 1000×1000, B is 1000×1000
CPU approach: 3 nested loops
for i in range(1000):
for j in range(1000):
for k in range(1000):
C[i][j] += A[i][k] * B[k][j]
Time: 1 billion operations sequentially
GPU approach: All operations in parallel
Each output C[i][j] computed by separate thread
Time: ~1000 operations (depth of computation)


FLOP: Floating Point Operation
Basic operations (1 FLOP each): - Addition: a + b - Multiplication: a * b - Fused Multiply-Add (FMA): a * b + c (counts as 2 FLOPS)
A100 GPU: 312 TFLOPS (FP16) - 312 trillion operations per second - But only for perfectly parallel operations - Real utilization: 30-80% depending on problem
Multiplying two N×N matrices:
FLOPS = 2N³ (multiply-add for each output)
1000×1000 matrices:
2 × 1000³ = 2 billion FLOPS
Time on A100: 2×10⁹ / 312×10¹² = 6.4 microseconds (theoretical)
Actual time: ~50 microseconds (memory bound)
FP32 (Single): [Sign:1][Exponent:8][Mantissa:23]
Range: ±1.2×10⁻³⁸ to ±3.4×10³⁸
Precision: ~7 decimal digits
FP16 (Half): [Sign:1][Exponent:5][Mantissa:10]
Range: ±6.1×10⁻⁵ to ±65,504
Precision: ~3 decimal digits
BF16 (BFloat16): [Sign:1][Exponent:8][Mantissa:7]
Range: ±1.2×10⁻³⁸ to ±3.4×10³⁸ (same as FP32)
Precision: ~2 decimal digits
Operation Energy (pJ) Relative Cost
32-bit FP Multiply 3.7 1x
32-bit Register Read 2.0 0.5x
32-bit L1 Cache Read 5.0 1.4x
32-bit L2 Cache Read 20 5.4x
32-bit DRAM Read 200 54x
32-bit Read from SSD 10,000 2,700x
Moving data from DRAM costs 54x more energy than computing on it!


One Computer:
┌──────────────────┐
│ CPU: 16 cores │
│ RAM: 128 GB │
│ GPU: 1× A100 │
│ Disk: 2 TB │
└──────────────────┘
Multiple Computers Working Together:
┌─────────┐ Network ┌─────────┐
│ Node 1 │←─────────→│ Node 2 │
└─────────┘ └─────────┘
↑ ↑
└──────────┬──────────┘
┌─────────┐
│ Node 3 │
└─────────┘
Primary Challenge: Coordination and communication
Scale-Up (Vertical) - Buy bigger machine - Limits: Physics and cost - 128-core CPU: $50,000 - 8× A100 server: $200,000
Scale-Out (Horizontal) - Add more machines - Limits: Communication overhead - 8× single-GPU machines: $80,000 - Can add incrementally

[A]
/ | \
[A] [A] [A]
One to all
Use: Distribute hyperparameters
[A,B,C,D]
↓
[A] [B] [C] [D]
Divide among processes
Use: Distribute dataset
[A] [B] [C] [D]
↓
[A,B,C,D]
Collect from all
Use: Collect metrics
[A] [B] [C] [D]
↓
[A+B+C+D] on all
Reduce and broadcast
Use: Gradient synchronization

# Common shapes
batch_tensor = [B, S, H]
# B = Batch size (independent samples)
# S = Sequence length (tokens, time)
# H = Hidden dimension (features)
weight_matrix = [H_in, H_out]
# H_in = Input features
# H_out = Output features
image_batch = [B, C, H, W]
# B = Batch, C = Channels
# H = Height, W = Width
Your laptop: Python 3.9, CUDA 11.2, PyTorch 1.10
Production: Python 3.8, CUDA 11.6, PyTorch 1.12
Result: Configuration mismatch leads to deployment failures

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
# Install Python
RUN apt-get update && apt-get install -y python3-pip
# Install ML libraries
RUN pip3 install torch torchvision numpy pandas
# Copy code
COPY train.py /app/train.py
WORKDIR /app
# Run training
CMD ["python3", "train.py"]
Training script needs to know:
- Where is the parameter server?
- What are the worker IP addresses?
- Which GPU should I use?
- Where is the shared storage?
But in dynamic clusters:
- Nodes can fail and restart
- IPs change
- Services move
Environment Variables
DNS-Based
Configuration Service



