CNNs and Computer Vision¶

Convolutional Neural Networks exploit spatial structure in images through local pattern detection and translation invariance. From classification to generation, CNNs revolutionized visual understanding.

Image as Data¶

Image = 3D tensor (height x width x channels). Grayscale: 1 channel. RGB: 3 channels. Each pixel = integer [0, 255].

For simple models: flatten to 1D (28x28 -> 784 features). Problem: loses spatial structure. CNNs solve this.

Convolution Layer¶

Slide small filter (kernel) across input. Each filter detects one pattern (edge, texture, shape).

Kernel size: 3x3, 5x5 typical. Small kernels stacked = large receptive field
Stride: step size. Stride 2 halves spatial dimensions
Padding: "same" preserves size, "valid" reduces
Channels: input channels matched by filter depth; output channels = number of filters
1x1 convolution: linear combination of channels (bottleneck, dimensionality reduction)

Pooling¶

Reduce spatial dimensions. Max pooling (take max in window) most common. Typical: 2x2, stride 2.

CNN Pattern¶

Input -> [Conv -> BN -> ReLU -> (Pool)] x N -> Flatten -> Dense -> Output

Architecture Evolution¶

Architecture	Year	Key Innovation
AlexNet	2012	First deep CNN to win ImageNet. ReLU, dropout, data augmentation
VGG	2014	Only 3x3 convs stacked deeply. Simple, uniform
GoogLeNet/Inception	2014	Parallel convs of different sizes, concatenated. 1x1 bottlenecks
ResNet	2015	Skip connections: output = F(x) + x. 50-152 layers
DenseNet	2017	Each layer connected to all previous layers
MobileNet	2017	Depthwise separable convolutions for mobile/edge
EfficientNet	2019	Compound scaling (width, depth, resolution)
ViT	2020	Vision Transformer - apply transformer to image patches

ResNet Skip Connections¶

Solve vanishing gradient for deep networks:

input -> Conv -> BN -> ReLU -> Conv -> BN -> (+input) -> ReLU

The identity shortcut lets gradients flow directly through the network.

Transfer Learning¶

Use pre-trained model (ImageNet: 1.2M images, 1000 classes) as starting point.

import torchvision.models as models
import torch.nn as nn

model = models.resnet50(pretrained=True)

# Option A: Feature extraction (freeze everything, replace head)
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)

# Option B: Fine-tune last layers
for param in model.layer4.parameters():
    param.requires_grad = True
model.fc = nn.Linear(2048, num_classes)

Rule of thumb: smaller dataset = freeze more layers; larger dataset = fine-tune more.

Object Detection¶

Locate objects with bounding boxes + class labels.

Two-Stage Detectors¶

R-CNN: region proposals (selective search) -> CNN per region -> classify
Fast R-CNN: CNN on full image, extract features per region from feature map
Faster R-CNN: learned Region Proposal Network (RPN). Fully end-to-end

One-Stage Detectors¶

YOLO: divide image into SxS grid, each cell predicts boxes + classes. Single forward pass. Real-time
SSD: multi-scale feature maps for objects at different sizes

Detection Metrics¶

IoU: intersection / union of predicted and ground truth box. >= 0.5 = correct
[email protected]: mean Average Precision at IoU threshold 0.5
[email protected]:0.95: averaged over IoU thresholds 0.5 to 0.95 (step 0.05)
NMS (Non-Maximum Suppression): remove overlapping detections, keep highest confidence

Segmentation¶

Semantic Segmentation¶

Classify every pixel. No instance distinction. - FCN: replace FC layers with convolutions, upsample back - U-Net: encoder-decoder with skip connections. Excellent for medical imaging - DeepLab: atrous (dilated) convolutions for larger receptive field

Instance Segmentation¶

Detect objects AND segment pixel boundaries. - Mask R-CNN: Faster R-CNN + mask prediction branch per detected box

Segmentation Metrics¶

mIoU: mean IoU across classes (primary metric)
Dice coefficient: 2*|A intersect B| / (|A| + |B|). Equivalent to F1

Generative Models¶

GAN: Generator vs Discriminator adversarial training. Challenges: mode collapse, instability
VAE: encode to latent distribution, sample, decode. Smooth interpolation
Diffusion: iteratively denoise from pure noise. Current state-of-the-art for image quality
CycleGAN: unpaired image-to-image translation (style transfer, domain adaptation)

3D Vision¶

Depth estimation: predict distance per pixel from 2D image
Point cloud processing: PointNet for 3D point data
NeRF: learn 3D scene from 2D images, render novel views

Gotchas¶

Always normalize images (ImageNet stats: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
Data augmentation is almost always beneficial for vision tasks
Larger input resolution = better accuracy but quadratic compute cost
Pre-trained models expect specific input sizes and normalization
YOLO versions vary widely - check which variant for fair comparison