NLP and Text Processing¶

From bag-of-words to transformers - NLP has evolved from manual feature engineering to pre-trained language models. Modern NLP: fine-tune a pre-trained model, don't build from scratch.

Text Preprocessing Pipeline¶

Tokenization¶

# Simple
tokens = text.split()

# NLTK
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

# Subword (BPE/WordPiece) - used by modern models
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(text)

Cleaning¶

import re
text = text.lower()
text = re.sub(r'[^a-zA-Z\s]', '', text)  # remove non-alpha

Stop Words and Lemmatization¶

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if w not in stop_words]

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('running', pos='v')  # 'run'

Stemming (crude suffix removal) vs Lemmatization (dictionary-based). Lemmatization is more accurate: "better" -> "good" (lemma), "better" -> "better" (stem fails).

Text Vectorization¶

Bag of Words (BoW)¶

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=5000)
X = vectorizer.fit_transform(texts)  # sparse matrix

TF-IDF¶

Weight words by importance: high TF-IDF = word is distinctive for this document.

TF(t,d) = count(t in d) / total_words(d) IDF(t) = log(total_docs / docs_containing(t)) TF-IDF = TF * IDF

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
X = tfidf.fit_transform(texts)

Word Embeddings¶

Dense vector representations. Similar words have similar vectors.

import gensim.downloader as api
model = api.load('word2vec-google-news-300')
vector = model['king']  # 300-dim
model.most_similar('king')
# king - man + woman ~ queen

Word2Vec: CBOW (predict word from context) or Skip-gram (predict context from word)
GloVe: trained on global co-occurrence statistics

Sequence Models¶

RNN / LSTM / GRU¶

Process tokens sequentially, maintaining hidden state.

Simple RNN: vanishing gradients for long sequences
LSTM: forget/input/output gates. Handles ~200-500 tokens
GRU: simplified LSTM with two gates. Similar performance, fewer parameters
Bidirectional: process forward AND backward, concatenate

import torch.nn as nn
lstm = nn.LSTM(input_size=300, hidden_size=128, num_layers=2,
               bidirectional=True, batch_first=True)

Transformer Architecture¶

Replaced RNNs. Key innovation: self-attention captures any-distance dependencies in one step.

Self-Attention¶

For each token, compute attention weights to ALL other tokens.

Q, K, V = linear projections of input Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V

Multi-Head Attention: parallel attention with different projections. Each head captures different relationship types.

Components¶

Positional Encoding: inject position information (no inherent order)
Layer Normalization: stabilize training
Feed-Forward Network: per-position MLP after attention
Residual Connections: around every sub-layer

BERT¶

Pre-trained bidirectional encoder. Fine-tune for downstream tasks.

Pre-training objectives: 1. Masked Language Model (MLM): predict 15% masked tokens 2. Next Sentence Prediction (NSP): predict if B follows A

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

inputs = tokenizer("Great movie!", return_tensors="pt",
                   padding=True, truncation=True, max_length=512)

# Fine-tuning with Trainer
from transformers import Trainer, TrainingArguments
args = TrainingArguments(output_dir='./results', num_train_epochs=3,
                         per_device_train_batch_size=16, learning_rate=2e-5)
trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=eval_ds)
trainer.train()

BERT Variants¶

DistilBERT: 6 layers, 40% smaller, 60% faster, 97% performance
RoBERTa: better pre-training (more data, no NSP)
ALBERT: parameter sharing, smaller model
XLNet: permutation-based bidirectional context

Common NLP Tasks¶

Text Classification: sentiment, spam, topic categorization
NER (Named Entity Recognition): extract persons, organizations, locations
Question Answering: extract answer span from context
Machine Translation: sequence-to-sequence with attention
Summarization: extractive (select sentences) or abstractive (generate)
Text Generation: GPT-family autoregressive models

Practical Tips¶

Start with pre-trained models - almost always better than from-scratch
BERT fine-tuning LR: 2e-5 to 5e-5 (much lower than training from scratch)
Max sequence: BERT = 512 tokens. Truncate or chunk longer texts
Tokenizer must match model (BERT tokenizer with BERT model)
For simple tasks + small data: TF-IDF + logistic regression is surprisingly competitive
Use HuggingFace Transformers library for unified API

Gotchas¶

Tokenizer mismatch crashes silently (wrong embeddings)
Subword tokenization means token count != word count
BERT is encoder-only (classification, NER), GPT is decoder-only (generation)
Fine-tuning on tiny datasets (< 1000 samples) may not improve over TF-IDF + classical ML
Multilingual models (mBERT, XLM-R) work but worse than language-specific models