Attention Mechanisms and Transformers for Data Science¶

Attention allows models to focus on relevant parts of the input when producing each output element. Self-attention (within same sequence) is the core of Transformers. Originally NLP, now applied to tabular data, time series, vision, and multi-modal learning.

Self-Attention Mechanism¶

For each element, compute how much attention to pay to every other element:

Project input into Query (Q), Key (K), Value (V) matrices
Attention weights = softmax(Q * K^T / sqrt(d_k))
Output = weighted sum of V

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.qkv = nn.Linear(embed_dim, 3 * embed_dim)
        self.out = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        # Scaled dot-product attention
        attn = (q @ k.transpose(-2, -1)) * (self.head_dim ** -0.5)
        attn = attn.softmax(dim=-1)

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        return self.out(x)

Multi-head attention: run multiple attention heads in parallel with different learned projections. Each head can attend to different aspects of the input.

Transformer Block¶

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, mlp_ratio=4, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(embed_dim)
        self.attn = SelfAttention(embed_dim, num_heads)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * mlp_ratio),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(embed_dim * mlp_ratio, embed_dim),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        x = x + self.attn(self.norm1(x))   # residual + pre-norm
        x = x + self.mlp(self.norm2(x))
        return x

NLP Transformer Models¶

Encoder-Only (BERT-style)¶

Bidirectional context. Best for classification, NER, embeddings.

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Machine learning is powerful", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state  # (batch, seq_len, 768)
cls_embedding = outputs.last_hidden_state[:, 0, :]  # [CLS] token

Decoder-Only (GPT-style)¶

Autoregressive generation. Causal masking - each token only attends to previous tokens.

Encoder-Decoder (T5, BART)¶

Seq2seq tasks: translation, summarization, question answering.

Transformers for Tabular Data¶

TabTransformer¶

Embed categorical features as tokens, apply transformer layers:

# Conceptual TabTransformer
class TabTransformer(nn.Module):
    def __init__(self, cat_dims, num_continuous, num_classes, dim=32, depth=6):
        super().__init__()
        # Categorical embeddings
        self.cat_embeds = nn.ModuleList([
            nn.Embedding(n, dim) for n in cat_dims
        ])
        # Transformer on categorical features
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=dim, nhead=8),
            num_layers=depth
        )
        # Combine with numerical features
        total_dim = len(cat_dims) * dim + num_continuous
        self.head = nn.Linear(total_dim, num_classes)

    def forward(self, x_cat, x_num):
        cat_embs = [emb(x_cat[:, i]) for i, emb in enumerate(self.cat_embeds)]
        cat_tensor = torch.stack(cat_embs, dim=1)
        cat_out = self.transformer(cat_tensor).flatten(1)
        combined = torch.cat([cat_out, x_num], dim=1)
        return self.head(combined)

FT-Transformer¶

State-of-art for tabular. Treats ALL features (numerical + categorical) as tokens.

Key finding: on many tabular benchmarks, gradient boosting still beats transformers. FT-Transformer is competitive but not consistently better than XGBoost/CatBoost.

Transformers for Time Series¶

# Informer-style time series forecasting (simplified)
class TimeSeriesTransformer(nn.Module):
    def __init__(self, input_dim, d_model, nhead, num_layers, pred_len):
        super().__init__()
        self.input_proj = nn.Linear(input_dim, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward=d_model*4),
            num_layers=num_layers
        )
        self.decoder = nn.Linear(d_model, pred_len)

    def forward(self, x):
        x = self.input_proj(x)
        x = self.pos_encoding(x)
        x = self.encoder(x)
        return self.decoder(x[:, -1, :])  # predict from last token

Positional Encoding¶

Transformers have no inherent notion of order. Add position information:

Sinusoidal: fixed, works well for moderate sequence lengths
Learned: trainable embeddings per position
Rotary (RoPE): relative position, extrapolates to longer sequences
ALiBi: attention bias based on distance, no extra parameters

Gotchas¶

Quadratic memory in sequence length: self-attention is O(n^2) in memory and compute. For sequences > 2048 tokens, use efficient attention variants (Flash Attention, linear attention) or chunk the input
Transformers need more data than CNNs/trees: without large datasets or pretraining, transformers underperform simpler architectures. For tabular data with < 10K samples, gradient boosting almost always wins. Use pretrained models when possible
Learning rate warmup is critical: transformers are sensitive to initial learning rate. Standard practice: linear warmup for first 5-10% of steps, then cosine or linear decay. Without warmup, training often diverges