High Bias, Low Variance Low Bias, High Variance
(Underfitting) (Overfitting)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Simple model (linear) ◄──────────► Complex model (deep net)
Poor on train & test Sweet spot Great on train, bad on test
Fix underfitting: Fix overfitting:
• More features • More data
• More complex model • Regularization (L1/L2, dropout)
• Fewer regularization • Early stopping
• Data augmentation
#
PyTorch
Tensors & Basics
import torch
import torch.nn as nn
# Tensor creation
x = torch.tensor([1, 2, 3], dtype=torch.float32)
z = torch.zeros(3, 4)
r = torch.randn(3, 4) # standard normal
device = torch.device("cuda"if torch.cuda.is_available() else"cpu")
x = x.to(device)
# Operations
y = x @ W.T # matrix multiply
y = x.view(-1, 784) # reshape
y = x.unsqueeze(0) # add batch dim
y = torch.cat([a, b], dim=0) # concatenate# Autograd
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x
y.backward()
print(x.grad) # dy/dx = 2x + 3 = 7.0
import mlflow
mlflow.set_experiment("my_experiment")
with mlflow.start_run():
mlflow.log_param("lr", 1e-3)
mlflow.log_param("epochs", 10)
# ... training ...
mlflow.log_metric("accuracy", 0.95)
mlflow.log_metric("f1", 0.93)
mlflow.pytorch.log_model(model, "model")
Model Serving (FastAPI)
from fastapi import FastAPI
import torch
app = FastAPI()
model = torch.load("model.pt")
model.eval()
@app.post("/predict")
async defpredict(data: dict):
tensor = torch.tensor(data["features"])
with torch.no_grad():
pred = model(tensor)
return {"prediction": pred.tolist()}
#
GPU & Performance
# Check GPU
torch.cuda.is_available()
torch.cuda.device_count()
torch.cuda.get_device_name(0)
# Mixed precision training (2x speed, less memory)
scaler = torch.cuda.amp.GradScaler()
for batch in loader:
with torch.cuda.amp.autocast():
output = model(batch)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# Multi-GPU (DataParallel — simple)
model = nn.DataParallel(model)
# Memory tips
torch.cuda.empty_cache()
# Use gradient accumulation for large effective batch size# Use gradient checkpointing to trade compute for memory
#
Math Cheat Sheet
Concept
Formula / Description
Used In
Sigmoid
σ(x) = 1 / (1 + e⁻ˣ) → maps to (0,1)
Binary classification output
Softmax
softmax(xᵢ) = eˣⁱ / Σeˣʲ → probability distribution