Text Summarization¶
Automatic generation of condensed versions of documents. Two fundamental approaches: extractive (select existing sentences) and abstractive (generate new text). Extractive methods require no training data and work from a single document.
Extractive vs Abstractive¶
| Approach | Method | Pros | Cons |
|---|---|---|---|
| Extractive | Select top-scoring sentences from document | Simple, no training data, faithful to source | Limited to existing sentences, may lack coherence |
| Abstractive | Generate new text expressing main ideas | More natural, can paraphrase | Requires seq2seq/transformer models, may hallucinate |
TF-IDF Sentence Scoring¶
The simplest extractive method: score sentences by the importance of their words.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import nltk
def summarize_tfidf(text, top_n=3):
# Step 1: Split into sentences
sentences = nltk.sent_tokenize(text)
# Step 2: Build TF-IDF matrix (sentences = documents)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences)
# Step 3: Score each sentence = mean of non-zero TF-IDF values
scores = []
for i in range(tfidf_matrix.shape[0]):
row = tfidf_matrix[i].toarray().flatten()
non_zero = row[row > 0]
score = non_zero.mean() if len(non_zero) > 0 else 0
scores.append(score)
# Step 4: Select top sentences (maintain original order)
ranked_idx = np.argsort(scores)[-top_n:]
ranked_idx = sorted(ranked_idx) # preserve document order
return ' '.join([sentences[i] for i in ranked_idx])
Why mean of non-zero values, not sum? - Sum biases toward longer sentences (more terms = higher sum) - Mean normalizes for sentence length - Non-zero only: the TF-IDF matrix is very sparse. Including zeros would dilute meaningful scores with zero-valued (absent) terms
Why not mean of the whole vector? - Sentences with large vocabulary variety would score low (many zeros dilute the mean) - We want the sentence with the most important words on average, not the most diverse vocabulary
TextRank (PageRank for Sentences)¶
Treats sentences as nodes in a graph. Edge weight = similarity between sentences. High-scoring sentences are those similar to many other important sentences.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import nltk
def textrank_summarize(text, top_n=3, damping=0.85, iterations=50):
sentences = nltk.sent_tokenize(text)
n = len(sentences)
# Build similarity matrix
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(sentences)
sim_matrix = cosine_similarity(tfidf)
# Normalize rows (transition probabilities)
for i in range(n):
row_sum = sim_matrix[i].sum()
if row_sum > 0:
sim_matrix[i] /= row_sum
# Power iteration (PageRank)
scores = np.ones(n) / n
for _ in range(iterations):
scores = (1 - damping) / n + damping * sim_matrix.T @ scores
# Top sentences in original order
top_idx = sorted(np.argsort(scores)[-top_n:])
return ' '.join([sentences[i] for i in top_idx])
TextRank is based on Google's PageRank: a sentence is important if it is similar to many other important sentences (recursive definition solved by eigenvector computation / power iteration).
Selection Strategies¶
After scoring, multiple ways to select which sentences appear in the summary:
| Strategy | Description | Use Case |
|---|---|---|
| Top-N sentences | Take N highest-scoring | Fixed-length summaries |
| Top-N words/chars | Take sentences until word/char budget met | Search engine snippets |
| Percentage | Top X% of sentences | Proportional to document length |
| Threshold | Sentences scoring above mean * factor | Variable length, adapts to content |
Libraries¶
# sumy: multiple algorithms
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=3)
# gensim (older versions)
from gensim.summarization import summarize # deprecated in gensim 4+
summary = summarize(text, word_count=100)
Gotchas¶
- Extractive summaries can be incoherent. Selected sentences may reference entities introduced in skipped sentences. Post-processing (pronoun resolution) can help but adds complexity.
- TF-IDF scoring requires enough sentences. With fewer than 5-6 sentences, the IDF component becomes unstable. For very short texts, simple sentence length + keyword overlap works better.
- TextRank can select redundant sentences. Two very similar sentences may both score high. Apply MMR (Maximal Marginal Relevance) to penalize selecting sentences too similar to already-selected ones.
Cross-References¶
- nlp text processing - tokenization, TF-IDF fundamentals
- attention mechanisms - abstractive summarization uses attention
- rnn sequences - seq2seq abstractive models