Probabilistic Language Models¶
Statistical models that assign probabilities to sequences of words. Foundation for text generation, spelling correction, speech recognition, and cipher breaking. Understanding n-gram models is prerequisite for appreciating why transformers were transformative.
N-gram Models¶
An n-gram language model estimates the probability of the next word given the previous (n-1) words:
- Unigram: P(w) - word frequency, no context
- Bigram: P(w_i | w_{i-1}) - one word of context
- Trigram: P(w_i | w_{i-2}, w_{i-1}) - two words of context
from collections import Counter, defaultdict
def build_bigram_model(corpus):
"""Build bigram probabilities from tokenized corpus."""
bigram_counts = Counter()
unigram_counts = Counter()
for sentence in corpus:
tokens = sentence.lower().split()
for i in range(len(tokens)):
unigram_counts[tokens[i]] += 1
if i > 0:
bigram_counts[(tokens[i-1], tokens[i])] += 1
# P(w2 | w1) = count(w1, w2) / count(w1)
model = defaultdict(dict)
for (w1, w2), count in bigram_counts.items():
model[w1][w2] = count / unigram_counts[w1]
return model
Sentence Probability¶
The probability of a sentence is the product of conditional probabilities:
In practice, use log probabilities to avoid underflow:
import math
def sentence_log_prob(sentence, model):
tokens = sentence.lower().split()
log_prob = 0
for i in range(1, len(tokens)):
prev, curr = tokens[i-1], tokens[i]
prob = model.get(prev, {}).get(curr, 1e-10) # smoothing
log_prob += math.log(prob)
return log_prob
Higher log probability = sentence is more likely to be real English. This is the core insight used in cipher decryption and article spinning detection.
Cipher Decryption with Language Models¶
A substitution cipher replaces each letter with another. To break it without the key:
- Build a language model of the target language (character-level bigrams or trigrams)
- Use a genetic/evolutionary algorithm to evolve candidate decryption keys
- Score each candidate by the language model probability of the decrypted text
- Select and mutate top candidates, repeat
import random
import string
def fitness(decrypted_text, language_model):
"""Score how 'English-like' the decrypted text is."""
return sentence_log_prob(decrypted_text, language_model)
def mutate(key):
"""Swap two random letter mappings."""
key = list(key)
i, j = random.sample(range(26), 2)
key[i], key[j] = key[j], key[i]
return ''.join(key)
# Evolutionary loop
population = [random_key() for _ in range(pool_size)]
for epoch in range(num_epochs):
scored = [(fitness(decrypt(cipher, k), lm), k) for k in population]
scored.sort(reverse=True)
parents = [k for _, k in scored[:top_n]]
population = [mutate(random.choice(parents)) for _ in range(pool_size)]
The genetic algorithm explores the key space (26! permutations) using mutation + selection guided by the language model score.
Article Spinning and Detection¶
Article spinning replaces words with synonyms to create "unique" content. Trigram-based context checking catches bad replacements:
| Original | Naive Spin | Problem |
|---|---|---|
| quantum gate | quantum door | "gate" = logic gate, not physical gate |
| complex Hilbert space | complicated Hilbert space | "complex" = complex numbers, not difficulty |
Simple synonym replacement without context fails because many words have domain-specific meanings. Context-aware models (deep learning) perform better, but even trigram-level checking catches the worst errors.
Markov Models to PageRank¶
N-gram language models are Markov chains: the next state depends only on the previous N-1 states. This same principle underlies:
- PageRank: web pages as states, links as transitions. Stationary distribution = page importance
- TextRank: sentences as states, similarity as transition probability
- Hidden Markov Models (HMM): for speech recognition, POS tagging, biological sequence analysis
Historical Context¶
- 1900s: Andrei Markov applied chain models to vowel/consonant patterns in Russian text
- 1948: Claude Shannon used Markov models for text generation in his foundational information theory paper
- 2000s: HMMs dominated speech recognition until deep learning
- Modern: transformers replaced n-grams for most NLP tasks, but n-gram models remain useful for:
- Keyboard prediction (mobile)
- Spelling correction
- Language detection
- Low-resource environments
Gotchas¶
- Smoothing is essential. Unseen n-grams get zero probability, making entire sentence probability zero. Laplace smoothing (add-1), Kneser-Ney, or backoff are standard solutions.
- Character-level models are better for cipher decryption than word-level models. Word-level models have sparse coverage; character-level bigram/trigram models capture the structure of the language more robustly for this task.
- Article spinning IS detectable. Even simple bigram analysis flags unnatural word combinations. More sophisticated plagiarism tools use document similarity + perplexity analysis.
Cross-References¶
- nlp text processing - tokenization, TF-IDF, text preprocessing
- text summarization - TextRank uses Markov-chain-like scoring
- bayesian methods - probabilistic foundations
- monte carlo simulation - sampling from probability distributions