Tokenization¶

Tokenization converts raw text into sequences of integer token IDs that transformer models can process. The choice of tokenizer affects vocabulary size, sequence length, multilingual capability, and API cost.

Key Facts¶

A token is roughly 4 characters or 0.75 words in English
Non-English text (Russian, Chinese, Japanese) uses 2-4x more tokens per word - meaning higher API costs
Each LLM family has its own tokenizer - tokens from GPT differ from Llama's
Modern subword tokenizers virtually eliminate out-of-vocabulary (OOV) issues
Context window = maximum tokens the model processes (input + output combined)

Tokenization Algorithms¶

BPE (Byte Pair Encoding)¶

Iteratively merges the most frequent character/subword pairs. Simple frequency heuristic.

Used in: GPT family, RoBERTa, BART

from tokenizers import Tokenizer, models, pre_tokenizers, trainers

tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(vocab_size=1000)
tokenizer.train_from_iterator(corpus, trainer)
print(tokenizer.encode("Hello world").tokens)

WordPiece¶

Optimizes corpus likelihood rather than simple frequency. Uses ## prefix for continuation subwords ("playing" -> "play" + "##ing").

Used in: BERT

SentencePiece¶

Treats spaces as characters (_ symbol). Language-agnostic - works on raw text without pre-tokenization. Supports BPE and Unigram modes.

Used in: T5, mBART, LLaMA

import sentencepiece as spm
spm.SentencePieceTrainer.Train(
    '--input=corpus.txt --model_prefix=tok_sp --vocab_size=1000'
)
sp = spm.SentencePieceProcessor(model_file='tok_sp.model')

Special Tokens¶

Token	Purpose	BERT	GPT	T5
[CLS] / \<s>	Classification aggregate	Yes	No	\<s>
[SEP] / \</s>	Sentence separator	Yes	`\n\n`	\</s>
[PAD]	Batch padding	Yes	\<eos>	\<pad>
[MASK]	Masked language modeling	Yes	No	\<extra_id_N>
\<unk>	Unknown token	Yes	No (BPE covers all)	Yes

Multimodal markers: <image>, <video>, <speech> used in Flamingo, LLaVA, Kosmos-1.

Token Counting¶

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello world")
print(len(tokens))  # exact token count

# Cost estimation
input_tokens = len(enc.encode(prompt))
cost = input_tokens * 2.50 / 1_000_000  # GPT-4o input price

Context Windows¶

Model	Context Window
BERT	512 tokens
GPT-4	8K / 128K tokens
Claude 3	200K tokens
Gemini 1.5 Pro	1M tokens (up to 2M preview)
Llama 3	8K (extended variants)
Mistral Large	128K tokens

Extending Effective Context¶

Truncation: cut left/right context for local tasks
Sliding window + aggregation: process in windows, merge results
Map-reduce summarization: summarize chunks, merge summaries
RAG: index corpus, inject only relevant fragments
KV-cache reuse: cache static instructions separately from variable content

Gotchas¶

Token boundaries don't align with word boundaries - LLMs struggle with character-level tasks like string reversal
Same text tokenizes differently across models - always count with the specific model's tokenizer
Code and JSON are often more token-expensive than prose
System prompt, function schemas, and chat history all count toward context window
Output token limits are separate from context window (typically 4K-16K per response)
"Lost in the middle" problem: information in the middle of long contexts is less likely to be attended to