Skip to content

Generative Models

Intermediate

Models that learn to generate new data samples from a learned distribution. From GANs to diffusion models - the technology behind image generation, style transfer, and data augmentation.

GANs (Generative Adversarial Networks)

Two networks competing: - Generator G: creates fake samples from random noise - Discriminator D: distinguishes real from fake

Training: D tries to correctly classify, G tries to fool D. Adversarial game drives both to improve.

GAN Variants

Variant Key Idea
Conditional GAN Generator and discriminator conditioned on class label
CycleGAN Unpaired image-to-image translation (A->B and B->A)
StyleGAN Style-based generator for high-quality face synthesis
Pix2Pix Paired image-to-image translation

CycleGAN

Learns bidirectional mapping without paired data. Applications: age progression, style transfer, domain adaptation.

GAN Challenges

  • Mode collapse: generator produces limited variety
  • Training instability: oscillations, failure to converge
  • Catastrophic forgetting: learning new categories erases old ones
  • Fix: Generative Replay (replay old generated samples during training)
  • Fix: EWC (Elastic Weight Consolidation) - penalize changes to important weights

VAE (Variational Autoencoder)

Encoder maps input to latent distribution (mean + variance), sample from it, decoder reconstructs.

Loss = reconstruction loss + KL divergence (keeps latent space close to N(0,1))

Advantages over GAN: stable training, smooth latent space interpolation, explicit density model. Disadvantage: outputs tend to be blurrier than GANs.

Diffusion Models

Iteratively denoise from pure Gaussian noise to generate samples. Current state-of-the-art for image quality.

Forward process: gradually add noise to data over T steps. Reverse process: learn to denoise at each step. Neural network predicts noise to subtract.

Key models: DDPM, Stable Diffusion, DALL-E, Midjourney.

Advantages: superior sample quality, stable training, flexible conditioning. Disadvantage: slow generation (many denoising steps), though distillation methods help.

3D Generative

  • NeRF (Neural Radiance Fields): learn 3D scene from 2D images, render novel views
  • Point cloud generation: PointNet-based generative models
  • 3D-aware GANs: generate 3D-consistent images

Neural Style Transfer

Combines content from one image with artistic style from another. Uses pre-trained CNN feature representations at different layers.

Three components: - Content image (C): the photo to transform - Style image (S): the artistic reference (e.g., Van Gogh's Starry Night) - Generated image (G): output combining C's structure with S's style

How it works: 1. Extract features from a pre-trained CNN (VGG-19) at multiple layers 2. Content loss: difference between C and G features at deeper layers (captures structure/objects) 3. Style loss: difference between Gram matrices of S and G features at multiple layers (captures textures/patterns/colors) 4. Total loss: weighted sum of content and style loss 5. Optimize G by gradient descent on the pixel values

Shallow CNN layers capture edges, textures, colors. Deeper layers capture objects, faces, composition. Style transfer exploits this hierarchy: match shallow-layer statistics (style) while preserving deep-layer activations (content).

Applications

  • Image generation: faces, art, product images
  • Data augmentation: generate training samples for rare classes
  • Neural style transfer: apply artistic style to photos using CNN features
  • Super-resolution: upscale low-resolution images
  • Inpainting: fill missing regions in images
  • Text-to-image: generate images from text descriptions

Gotchas

  • GAN training requires careful hyperparameter tuning and monitoring
  • Generated images can contain artifacts (extra fingers, text distortion)
  • Evaluation is hard - FID score is standard but imperfect
  • Copyright and ethical concerns with training data
  • Diffusion models need significant GPU memory and time for generation

See Also