CNNs and Computer Vision¶

★★★★★ Advanced

Convolutional Neural Networks exploit spatial structure in images through local pattern detection and translation invariance. From classification to generation, CNNs revolutionized visual understanding.

Image as Data¶

Image = 3D tensor (height x width x channels). Grayscale: 1 channel. RGB: 3 channels. Each pixel = integer [0, 255].

For simple models: flatten to 1D (28x28 -> 784 features). Problem: loses spatial structure. CNNs solve this.

Convolution Layer¶

Slide small filter (kernel) across input. Each filter detects one pattern (edge, texture, shape). Convolution reduces to two operations: multiply element-wise, then sum. Three objects: input image, filter (kernel), output image (feature map).

Kernel size: 3x3, 5x5 typical. Small kernels stacked = large receptive field
Stride: step size. Stride 2 halves spatial dimensions
Padding: "same" preserves size, "valid" reduces
Channels: input channels matched by filter depth; output channels = number of filters
1x1 convolution: linear combination of channels (bottleneck, dimensionality reduction)
Bias term: each filter has a scalar bias added after the convolution sum, before activation

Convolution as Feature Detection¶

Different kernels detect different features: - Blur kernel: averaging filter smooths the image (noise reduction) - Edge detection: Sobel-style filters highlight boundaries between regions - Sharpen kernel: amplifies differences between adjacent pixels

In a trained CNN, kernels are not hand-crafted - they are learned via backpropagation. Early layers learn edge/texture detectors; deeper layers learn complex pattern detectors (eyes, wheels, text).

Pooling¶

Reduce spatial dimensions (downsampling). Max pooling (take max in window) most common. Typical: 2x2, stride 2.

Pool size 2 with stride 2: 100x100 -> 50x50 (halves each dimension)
Achieves translation invariance - small shifts in input don't change pooled output
Average pooling: take mean instead of max. Used in some architectures (GoogLeNet)
Global Average Pooling: reduce entire feature map to single value per channel. Replaces flatten+dense in modern architectures (fewer parameters, less overfitting)

CNN Two-Stage Architecture¶

A CNN has two stages:

Feature extraction stage: alternating Conv + Pool layers. Hierarchical - each layer detects features from the previous layer's output. This stage is a specialized image feature transformer
Classification stage: standard dense (fully connected) layers that take extracted features and perform classification/regression

Input -> [Conv -> BN -> ReLU -> Pool] x N -> Flatten -> Dense -> Output

This separation explains why transfer learning works: stage 1 learns general image features (edges, textures, shapes) while stage 2 learns task-specific classification.

Architecture Evolution¶

Architecture	Year	Key Innovation
AlexNet	2012	First deep CNN to win ImageNet. ReLU, dropout, data augmentation
VGG	2014	Only 3x3 convs stacked deeply. Simple, uniform
GoogLeNet/Inception	2014	Parallel convs of different sizes, concatenated. 1x1 bottlenecks
ResNet	2015	Skip connections: output = F(x) + x. 50-152 layers
DenseNet	2017	Each layer connected to all previous layers
MobileNet	2017	Depthwise separable convolutions for mobile/edge
EfficientNet	2019	Compound scaling (width, depth, resolution)
ViT	2020	Vision Transformer - apply transformer to image patches

ResNet Skip Connections¶

Solve vanishing gradient for deep networks:

input -> Conv -> BN -> ReLU -> Conv -> BN -> (+input) -> ReLU

The identity shortcut lets gradients flow directly through the network. ResNet demonstrated that with skip connections, training a 152-layer network is feasible and outperforms shallower alternatives.

Why Study Architecture Case Studies¶

Neural network architectures that work well on one computer vision task typically transfer to others. Key cross-pollination patterns: - ResNet's skip connections appeared in transformer architectures - Inception's multi-scale processing influenced feature pyramid networks - ViT showed that transformer attention can replace convolution entirely

Reading architecture papers builds intuition similar to how programmers learn by reading others' code.

Transfer Learning¶

Use pre-trained model (ImageNet: 1.2M images, 1000 classes) as starting point.

import torchvision.models as models
import torch.nn as nn

model = models.resnet50(pretrained=True)

# Option A: Feature extraction (freeze everything, replace head)
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)

# Option B: Fine-tune last layers
for param in model.layer4.parameters():
    param.requires_grad = True
model.fc = nn.Linear(2048, num_classes)

Rule of thumb: smaller dataset = freeze more layers; larger dataset = fine-tune more.

Object Detection¶

Locate objects with bounding boxes + class labels.

Two-Stage Detectors¶

R-CNN: region proposals (selective search) -> CNN per region -> classify
Fast R-CNN: CNN on full image, extract features per region from feature map
Faster R-CNN: learned Region Proposal Network (RPN). Fully end-to-end

One-Stage Detectors¶

YOLO: divide image into SxS grid, each cell predicts boxes + classes. Single forward pass. Real-time
SSD: multi-scale feature maps for objects at different sizes

Detection Metrics¶

IoU: intersection / union of predicted and ground truth box. >= 0.5 = correct
[email protected]: mean Average Precision at IoU threshold 0.5
[email protected]:0.95: averaged over IoU thresholds 0.5 to 0.95 (step 0.05)
NMS (Non-Maximum Suppression): remove overlapping detections, keep highest confidence

Segmentation¶

Semantic Segmentation¶

Classify every pixel. No instance distinction. - FCN: replace FC layers with convolutions, upsample back - U-Net: encoder-decoder with skip connections. Excellent for medical imaging - DeepLab: atrous (dilated) convolutions for larger receptive field

Instance Segmentation¶

Detect objects AND segment pixel boundaries. - Mask R-CNN: Faster R-CNN + mask prediction branch per detected box

Segmentation Metrics¶

mIoU: mean IoU across classes (primary metric)
Dice coefficient: 2*|A intersect B| / (|A| + |B|). Equivalent to F1

Generative Models¶

GAN: Generator vs Discriminator adversarial training. Challenges: mode collapse, instability
VAE: encode to latent distribution, sample, decode. Smooth interpolation
Diffusion: iteratively denoise from pure noise. Current state-of-the-art for image quality
CycleGAN: unpaired image-to-image translation (style transfer, domain adaptation)

3D Vision¶

Depth estimation: predict distance per pixel from 2D image
Point cloud processing: PointNet for 3D point data
NeRF: learn 3D scene from 2D images, render novel views

Convolution on Color Images (3D Kernels)¶

Grayscale convolution: 2D filter slides over 2D image. Color images: 3D filter slides over 3D input (H x W x C).

Input: (H, W, 3) for RGB. Kernel: (k, k, 3) - same depth as input channels
Sum across all three channels at each position -> single scalar output
A 3D kernel acts as a color pattern detector: a filter for "red circle" will not match blue circles
Multiple filters -> multiple output channels: N filters of (k, k, 3) produce output of (H', W', N)
Each successive conv layer: input channels = previous layer's filter count, not 3

This is why the first conv layer has few parameters (3 input channels) but deeper layers have many (64, 128, 256 input channels from previous filter counts).

CNNs for Time Series Classification¶

CNNs are not limited to images - they work on any data with local spatial/temporal structure.

1D Convolution for Time Series¶

Input shape: (batch, timesteps, features) = (N, T, D). Use Conv1D instead of Conv2D.

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Conv1D(32, kernel_size=5, activation='relu',
                           input_shape=(128, 9)),  # T=128, D=9
    tf.keras.layers.MaxPooling1D(pool_size=3),
    tf.keras.layers.Conv1D(64, kernel_size=3, activation='relu'),
    tf.keras.layers.GlobalMaxPooling1D(),  # alternatives: Flatten, GlobalAvgPooling1D
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

Kernel size selection: for longer sequences (T=128+), use larger initial kernels (5-7). For short sequences (T~10), smaller kernels (3). Larger T justifies larger kernels, same principle as larger images.

Human Activity Recognition (HAR)¶

Classic CNN application on sensor time series. Input: accelerometer/gyroscope readings from smartphone (T=128 timesteps, D=9 channels for x/y/z from 3 sensors).

Multi-input architecture: combine time-series CNN with tabular features (statistics, FFT features) via concatenation:

# Time series branch
ts_input = tf.keras.Input(shape=(128, 9))
x = tf.keras.layers.Conv1D(32, 5, activation='relu')(ts_input)
x = tf.keras.layers.MaxPooling1D(3)(x)
x = tf.keras.layers.GlobalMaxPooling1D()(x)

# Tabular branch
tab_input = tf.keras.Input(shape=(num_tabular_features,))
y = tf.keras.layers.Dense(64, activation='relu')(tab_input)

# Combine
combined = tf.keras.layers.Concatenate()([x, y])
output = tf.keras.layers.Dense(num_classes, activation='softmax')(combined)
model = tf.keras.Model(inputs=[ts_input, tab_input], outputs=output)

CNN for Time Series Forecasting¶

Use CNN as feature extractor, then dense layer for prediction. Reshape univariate series into (T, 1) input.

model = tf.keras.Sequential([
    tf.keras.layers.Conv1D(32, kernel_size=3, activation='relu',
                           input_shape=(window_size, 1)),
    tf.keras.layers.Conv1D(64, kernel_size=3, activation='relu'),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)  # single step forecast
])

Note: values don't divide evenly through pooling layers - always check model.summary() to verify output shapes at each layer.

Gotchas¶

Always normalize images (ImageNet stats: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
Data augmentation is almost always beneficial for vision tasks
Larger input resolution = better accuracy but quadratic compute cost
Pre-trained models expect specific input sizes and normalization
YOLO versions vary widely - check which variant for fair comparison
Conv1D for time series: kernel slides along time axis only, not spatial. Input shape differs from Conv2D
GlobalMaxPooling vs Flatten: GlobalMaxPooling reduces to fixed-size regardless of input length. Flatten requires fixed input size
Multi-input models need tf.keras.Model (functional API), not Sequential