Skip to content

Neural Networks and Deep Learning

Deep learning learns features automatically from raw data. Excels at images, text, audio, and sequences where manual feature engineering is impractical.

Building Blocks

Dense (Fully Connected) Layer

Every input connected to every output: y = Wx + b

Activation Functions

Function Formula Range Use Case
ReLU max(0, x) [0, inf) Default for hidden layers
LeakyReLU max(0.01x, x) (-inf, inf) Fixes dead neuron problem
Sigmoid 1/(1+e^(-x)) (0, 1) Binary classification output
Tanh (e^x-e^(-x))/(e^x+e^(-x)) (-1, 1) Zero-centered alternative
Softmax e^(x_i) / sum(e^(x_j)) (0, 1), sum=1 Multi-class output

Without activations, stacking linear layers = one linear layer.

Training

Loss Functions

  • Binary Cross-Entropy: L = -[ylog(p) + (1-y)log(1-p)]
  • Categorical Cross-Entropy: L = -sum(y_i * log(p_i))
  • MSE: for regression

Training Loop (PyTorch)

for epoch in range(num_epochs):
    for batch_x, batch_y in dataloader:
        predictions = model(batch_x)       # forward pass
        loss = criterion(predictions, batch_y)
        optimizer.zero_grad()               # clear gradients
        loss.backward()                     # backpropagation
        optimizer.step()                    # update weights

Optimizers

  • SGD: w -= lr * gradient. Simple, good with momentum
  • Adam: adaptive per-parameter learning rates. Default choice for most tasks
  • Learning rate: most important hyperparameter. Try 1e-3, 3e-4, 1e-4

Regularization

  • Dropout: randomly zero out fraction of neurons during training (typical: 0.2-0.5)
  • Batch Normalization: normalize layer inputs within mini-batch. Allows higher learning rates
  • Early Stopping: monitor validation loss, stop when it increases
  • Data Augmentation: random transforms (flip, rotate, crop for images)
  • Weight Decay: L2 regularization as optimizer parameter

PyTorch Example

import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

model = SimpleNet()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Keras/TensorFlow Example

from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

Hyperparameters

Parameter Typical Range Notes
Learning rate 1e-4 to 1e-2 Most important - tune first
Batch size 32, 64, 128, 256 Larger = faster, smaller = better generalization
Layers/units Task-dependent Start small, increase if underfitting
Dropout rate 0.1-0.5 Higher for larger models
Weight init He (ReLU), Xavier (tanh/sigmoid) Usually handled by framework
Epochs Use early stopping Don't fix a number

Gotchas

  • Vanishing gradients: deep networks with sigmoid/tanh. Fix: use ReLU, skip connections, BatchNorm
  • Exploding gradients: gradient clipping helps. Common in RNNs
  • Dead ReLU neurons: negative inputs -> zero gradient forever. Fix: LeakyReLU
  • Neural networks need MUCH more data than tree-based models for tabular data
  • Always normalize inputs (StandardScaler or similar)
  • GPU is practically required for non-trivial models

See Also