How LLMs Convert Text to Tokens
Tokenization Deep Dive — Complete Summary
Course: Stanford CS336 — Language Models From Scratch
Student: Gaurav
Date: January 2026
1. What is Tokenization?
Tokenization is the process of converting raw text into a sequence of integers (tokens) that a language model can process.
- Encode: strings → tokens (integers)
- Decode: tokens → strings
- Vocabulary size: number of possible tokens
Why Do We Need It?
Language models work with numbers, not text. We need a systematic way to:
- Convert any text into a fixed vocabulary of integers
- Convert those integers back to text (losslessly)
- Keep sequences short enough for the model's context window
# Basic ASCII examples
print(f"'h' → {ord('h')}") # 104
print(f"'i' → {ord('i')}") # 105
print(f"104 → '{chr(104)}'") # reverse operation
# Simple ASCII string to bytes
s = "hi"
print(list(s.encode("utf-8"))) # [104, 105]
UTF-8 — Handling the Whole World
ASCII only covers English characters. UTF-8 extends this to handle all Unicode characters using variable-length encoding:
- ASCII characters (a-z, 0-9, etc.): 1 byte
- European accents, Greek, Cyrillic: 2 bytes
- Chinese, Japanese, Korean: 3 bytes
- Emojis: 4 bytes
# UTF-8 uses multiple bytes for non-ASCII characters
print(f"'你好' → {list('你好'.encode('utf-8'))}") # 6 bytes!
print(f"'🌍' → {list('🌍'.encode('utf-8'))}") # 4 bytes!
3. Tokenization Approaches Compared
| Approach | Vocab Size | Compression | Issues |
|---|---|---|---|
| Character | ~150K Unicode chars | ~1.0 | Large vocab, many rare chars |
| Byte | 256 | 1.0 | Terrible — sequences too long |
| Word | Huge (corpus-dependent) | Good | No fixed vocab, rare words → UNK |
| BPE | Configurable | Good | ✓ Best tradeoff |
The Core Trade-off
- Small vocabulary → Long sequences (model struggles with context)
- Large vocabulary → Rare tokens (model struggles to learn them)
BPE finds the sweet spot: automatically learns which character sequences are common and deserve their own token.
4. Byte Pair Encoding (BPE) — The Algorithm
Origin Story
- 1994: Philip Gage invented BPE for data compression
- 2015: Sennrich+ adapted it for neural machine translation
- 2019: GPT-2 popularized it for language models
🏛️ The Medieval Scribe Analogy
Imagine you're a medieval scribe copying books by hand. You notice you write "the" thousands of times. Wouldn't it be clever to invent a single symbol that means "the"? One stroke instead of three!
That's exactly what BPE does — but automatically, by looking at data.
BPE Training: Step-by-Step Walkthrough
Corpus: "the cat in the hat"
Step 1: Convert to Bytes
text = "the cat in the hat"
tokens = list(text.encode("utf-8"))
print(tokens)
# [116, 104, 101, 32, 99, 97, 116, 32, 105, 110, 32, 116, 104, 101, 32, 104, 97, 116]
# t h e _ c a t _ i n _ t h e _ h a t
We start with 18 tokens (one per byte).
Step 2: Count Adjacent Pairs
Go through the sequence and tally every adjacent pair:
| Pair | Characters | Count |
|---|---|---|
| (116, 104) | "th" | 2 |
| (104, 101) | "he" | 2 |
| (101, 32) | "e " | 2 |
| (97, 116) | "at" | 2 |
| (32, 99) | " c" | 1 |
| ... | ... | ... |
Winner: (116, 104) = "th" (first pair with highest count)
Step 3: Merge!
Create a new token (256) for "th" and replace all occurrences:
Before (18 tokens):
[116, 104, 101, 32, 99, 97, 116, 32, 105, 110, 32, 116, 104, 101, 32, 104, 97, 116]
After (16 tokens):
[256, 101, 32, 99, 97, 116, 32, 105, 110, 32, 256, 101, 32, 104, 97, 116]
th e _ c a t _ i n _ th e _ h a t
Step 4: Repeat!
Count pairs again. Now (256, 101) = "the" appears twice.
Create token 257 = "the":
After merge 2 (14 tokens):
[257, 32, 99, 97, 116, 32, 105, 110, 32, 257, 32, 104, 97, 116]
the _ c a t _ i n _ the _ h a t
We compressed 18 tokens → 14 tokens with just 2 merges!
Key Insight
The algorithm discovered that "the" is a common word and deserves its own token — no human told it about English words. It figured it out from pure statistics!
BPE Training Output
After training, we have:
-
Vocabulary: mapping from token index → bytes
- 0-255: individual bytes
- 256: b"th"
- 257: b"the"
- ...
-
Merges: ordered list of merge rules
- Merge 1: (116, 104) → 256
- Merge 2: (256, 101) → 257
- ...
5. BPE Encoding (Using the Trained Tokenizer)
Given a new string, apply merges in order:
def encode(string, merges):
tokens = list(string.encode("utf-8")) # Start with bytes
for (pair, new_token) in merges:
tokens = replace_all(tokens, pair, new_token)
return tokens
6. BPE Decoding
Simply look up each token in the vocabulary and concatenate:
def decode(tokens, vocab):
byte_strings = [vocab[t] for t in tokens]
return b"".join(byte_strings).decode("utf-8")
7. GPT-2 Pre-tokenization
GPT-2 adds a pre-tokenization step using a regex to split text into chunks before BPE:
GPT2_REGEX = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
This ensures:
- Words and their preceding space stay together (" hello")
- Contractions are split sensibly ("I'll" → "I" + "'ll")
- Punctuation is separate
8. Real-World Tools
| Library | Used By | Notes |
|---|---|---|
| tiktoken | OpenAI (GPT-2/3/4) | Fast Rust implementation |
| SentencePiece | Google, Meta | Supports BPE + Unigram |
| tokenizers | HuggingFace | Very flexible, fast |
# Using tiktoken (GPT-2 tokenizer)
import tiktoken
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("Hello, world!")
print(f"Tokens: {tokens}")
print(f"Decoded: {enc.decode(tokens)}")
9. Limitations & Trade-offs
BPE Limitations
- Language bias: Trained on English-heavy corpora → English words get single tokens, Chinese characters may need multiple
- Numbers: Often split awkwardly ("123456" → multiple tokens)
- Spelling sensitivity: "hello" and "Hello" may have different tokenizations
- No semantic awareness: Purely statistical, doesn't understand meaning
The Tokenizer-Free Future?
Promising research on working directly with bytes:
- ByT5 (2021)
- MEGABYTE (2023)
- Byte Latent Transformer (2024)
These haven't been scaled to frontier models yet, but may eliminate tokenization someday.
10. Quick Reference
Key Formulas
Compression ratio: $\frac{\text{num_bytes}}{\text{num_tokens}}$
Higher is better — means fewer tokens for same content.
BPE Training Algorithm (Pseudocode)
tokens = bytes(corpus)
vocab = {0: b'\x00', 1: b'\x01', ..., 255: b'\xff'}
merges = []
for i in range(num_merges):
counts = count_adjacent_pairs(tokens)
best_pair = argmax(counts)
new_token = 256 + i
vocab[new_token] = vocab[best_pair[0]] + vocab[best_pair[1]]
merges.append((best_pair, new_token))
tokens = replace_all(tokens, best_pair, new_token)
Typical Vocabulary Sizes
| Model | Vocab Size |
|---|---|
| GPT-2 | 50,257 |
| GPT-4 | ~100,000 |
| Llama 2 | 32,000 |
| Llama 3 | 128,000 |
TL;DR
- Tokenization converts text → integers for LLMs
- BPE is the dominant approach: starts with bytes, iteratively merges common pairs
- Key insight: BPE automatically discovers meaningful units (words, subwords) from pure statistics
- Trade-off: vocabulary size vs. sequence length
- Future: Byte-level models may eliminate tokenization entirely