Skip to content

Tokenization

Tokenization converts raw text into sequences of integer token IDs that transformer models can process. The choice of tokenizer affects vocabulary size, sequence length, multilingual capability, and API cost.

Key Facts

  • A token is roughly 4 characters or 0.75 words in English
  • Non-English text (Russian, Chinese, Japanese) uses 2-4x more tokens per word - meaning higher API costs
  • Each LLM family has its own tokenizer - tokens from GPT differ from Llama's
  • Modern subword tokenizers virtually eliminate out-of-vocabulary (OOV) issues
  • Context window = maximum tokens the model processes (input + output combined)

Tokenization Algorithms

BPE (Byte Pair Encoding)

Iteratively merges the most frequent character/subword pairs. Simple frequency heuristic.

Used in: GPT family, RoBERTa, BART

from tokenizers import Tokenizer, models, pre_tokenizers, trainers

tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(vocab_size=1000)
tokenizer.train_from_iterator(corpus, trainer)
print(tokenizer.encode("Hello world").tokens)

WordPiece

Optimizes corpus likelihood rather than simple frequency. Uses ## prefix for continuation subwords ("playing" -> "play" + "##ing").

Used in: BERT

SentencePiece

Treats spaces as characters (_ symbol). Language-agnostic - works on raw text without pre-tokenization. Supports BPE and Unigram modes.

Used in: T5, mBART, LLaMA

import sentencepiece as spm
spm.SentencePieceTrainer.Train(
    '--input=corpus.txt --model_prefix=tok_sp --vocab_size=1000'
)
sp = spm.SentencePieceProcessor(model_file='tok_sp.model')

Special Tokens

Token Purpose BERT GPT T5
[CLS] / \<s> Classification aggregate Yes No \<s>
[SEP] / \</s> Sentence separator Yes \n\n \</s>
[PAD] Batch padding Yes \<eos> \<pad>
[MASK] Masked language modeling Yes No \<extra_id_N>
\<unk> Unknown token Yes No (BPE covers all) Yes

Multimodal markers: <image>, <video>, <speech> used in Flamingo, LLaVA, Kosmos-1.

Token Counting

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello world")
print(len(tokens))  # exact token count

# Cost estimation
input_tokens = len(enc.encode(prompt))
cost = input_tokens * 2.50 / 1_000_000  # GPT-4o input price

Context Windows

Model Context Window
BERT 512 tokens
GPT-4 8K / 128K tokens
Claude 3 200K tokens
Gemini 1.5 Pro 1M tokens (up to 2M preview)
Llama 3 8K (extended variants)
Mistral Large 128K tokens

Extending Effective Context

  • Truncation: cut left/right context for local tasks
  • Sliding window + aggregation: process in windows, merge results
  • Map-reduce summarization: summarize chunks, merge summaries
  • RAG: index corpus, inject only relevant fragments
  • KV-cache reuse: cache static instructions separately from variable content

Gotchas

  • Token boundaries don't align with word boundaries - LLMs struggle with character-level tasks like string reversal
  • Same text tokenizes differently across models - always count with the specific model's tokenizer
  • Code and JSON are often more token-expensive than prose
  • System prompt, function schemas, and chat history all count toward context window
  • Output token limits are separate from context window (typically 4K-16K per response)
  • "Lost in the middle" problem: information in the middle of long contexts is less likely to be attended to

See Also