Transformer Architecture¶

The transformer is the foundational architecture behind all modern LLMs. Introduced in 2017 ("Attention Is All You Need"), it replaced recurrent approaches with parallelizable self-attention, enabling massive scaling of model size and training data.

Evolution: RNN -> LSTM -> Transformer¶

RNN: Processes tokens sequentially, maintaining hidden state. Suffers from vanishing/exploding gradients on long sequences. Cannot parallelize.

LSTM: Added gates (forget, input, output) to control information flow. Better at long-range dependencies but still sequential - cannot parallelize training.

Transformer: Fully parallelizable via attention mechanism. Processes all tokens simultaneously. Scales with compute and data. Foundation for GPT, BERT, Claude, Llama, and all modern LLMs.

Core Architecture¶

Encoder-Decoder Structure¶

The original transformer was designed for translation with both components:

Encoder: processes input sequence, builds contextual representations
Decoder: generates output sequence using encoder representations via cross-attention

Modern variants specialize: - Encoder-only (BERT): bidirectional attention, best for classification, NER, embeddings - Decoder-only (GPT, Claude, Llama): autoregressive (causal), best for text generation

Self-Attention Mechanism¶

Each token attends to all other tokens to determine relevance.

Query, Key, Value (Q, K, V): 1. Each token produces three vectors: Query (what am I looking for?), Key (what do I contain?), Value (what information do I provide?) 2. Attention score = Q * K^T (relevance of each token) 3. Scale by sqrt(d_k) to prevent gradient issues in softmax 4. Softmax produces attention weights (sum to 1) 5. Weighted sum of Value vectors = output

Formula: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Intuition: For "The cat sat on the mat", when processing "sat", the model gives high attention to "cat" (subject), medium to "mat" (location), low to "the" (article).

Multi-Head Attention¶

Run H parallel attention "heads", each with separate Q, K, V weight matrices. Each head learns different relationship types (syntactic, semantic, positional). Outputs are concatenated and projected.

Typical head counts: 12 (BERT-base), 32 (GPT-3), 96 (large frontier models).

Feed-Forward Network (FFN)¶

After attention, each token passes through a 2-layer network:

FFN(x) = max(0, xW1 + b1)W2 + b2

First layer expands dimension (typically 4x hidden size)
ReLU/GELU activation
Second layer projects back to hidden size
This is where much of the model's "knowledge" is stored

Residual Connections and Layer Normalization¶

Residual connections: output = LayerNorm(x + Sublayer(x)) - prevents vanishing gradients in deep networks
Layer normalization: normalizes activations across feature dimension
Pre-norm (normalize before sublayer) is more stable for deep models than post-norm

BERT vs GPT¶

Feature	BERT	GPT
Architecture	Encoder-only	Decoder-only
Attention	Bidirectional (full context)	Causal (sees only past tokens)
Training	Masked Language Model (predict masked tokens)	Autoregressive (predict next token)
Best for	Classification, NER, Q&A, embeddings	Text generation, chat, code

Positional Encoding¶

Transformers have no inherent sense of token order. Position information is injected via:

Type	Mechanism	Models	Extrapolation
Sinusoidal	Fixed sin/cos patterns	GPT-1, BERT, original Transformer	Poor beyond training length
Learned	Trainable position embeddings	GPT-2	Limited to trained length
RoPE (Rotary)	Rotates embeddings by position-dependent angle	LLaMA, Mistral, GPT-4	Good with NTK/YaRN extensions
ALiBi	Distance-based penalty on attention scores	Mistral variants	Good - works beyond training length

Attention Optimizations¶

FlashAttention: GPU-friendly tiled computation. Same accuracy, 2-4x faster, much less memory. Breaks Q/K/V into tiles instead of computing full attention matrix.
MQA (Multi-Query Attention): One K/V pair shared across all heads. Saves memory and compute with slight expressiveness tradeoff.
GQA (Grouped-Query Attention): Heads divided into groups, each sharing K/V. Used in Llama 2 70B. Middle ground between MHA and MQA.
MoE (Mixture of Experts): Only activate subset of parameters per token. Mixtral uses 2 of 8 experts per token (47B total, ~13B active).

Key Architectural Parameters¶

Parameter	BERT-base	GPT-3	Effect
Hidden size (d_model)	768	4096	Model capacity
Layers	12	96	Depth of reasoning
Attention heads	12	96	Attention pattern diversity
FFN inner dim	3072	16384	Knowledge storage
Vocab size	30K	50K	Language coverage
Max sequence	512	2048+	Context window

Gotchas¶

Attention complexity is O(n^2) with sequence length - this is why context windows are expensive
KV-cache memory grows as O(n * layers * d_model) per token during generation
"Lost in the middle" problem: information in the middle of long prompts is less likely to be used than beginning/end
Extending context beyond trained window risks quality degradation even with RoPE