Tokenization Deep Dive — Complete Summary

Course: Stanford CS336 — Language Models From Scratch
Student: Gaurav
Date: January 2026

1. What is Tokenization?

Tokenization is the process of converting raw text into a sequence of integers (tokens) that a language model can process.

Encode: strings → tokens (integers)
Decode: tokens → strings
Vocabulary size: number of possible tokens

Why Do We Need It?

Language models work with numbers, not text. We need a systematic way to:

Convert any text into a fixed vocabulary of integers
Convert those integers back to text (losslessly)
Keep sequences short enough for the model's context window

# Basic ASCII examples
print(f"'h' → {ord('h')}")  # 104
print(f"'i' → {ord('i')}")  # 105
print(f"104 → '{chr(104)}'")  # reverse operation

# Simple ASCII string to bytes
s = "hi"
print(list(s.encode("utf-8")))  # [104, 105]

UTF-8 — Handling the Whole World

ASCII only covers English characters. UTF-8 extends this to handle all Unicode characters using variable-length encoding:

ASCII characters (a-z, 0-9, etc.): 1 byte
European accents, Greek, Cyrillic: 2 bytes
Chinese, Japanese, Korean: 3 bytes
Emojis: 4 bytes

# UTF-8 uses multiple bytes for non-ASCII characters
print(f"'你好' → {list('你好'.encode('utf-8'))}")  # 6 bytes!
print(f"'🌍' → {list('🌍'.encode('utf-8'))}")      # 4 bytes!

3. Tokenization Approaches Compared

Approach	Vocab Size	Compression	Issues
Character	~150K Unicode chars	~1.0	Large vocab, many rare chars
Byte	256	1.0	Terrible — sequences too long
Word	Huge (corpus-dependent)	Good	No fixed vocab, rare words → UNK
BPE	Configurable	Good	✓ Best tradeoff

The Core Trade-off

Small vocabulary → Long sequences (model struggles with context)
Large vocabulary → Rare tokens (model struggles to learn them)

BPE finds the sweet spot: automatically learns which character sequences are common and deserve their own token.

4. Byte Pair Encoding (BPE) — The Algorithm

Origin Story

1994: Philip Gage invented BPE for data compression
2015: Sennrich+ adapted it for neural machine translation
2019: GPT-2 popularized it for language models

🏛️ The Medieval Scribe Analogy

Imagine you're a medieval scribe copying books by hand. You notice you write "the" thousands of times. Wouldn't it be clever to invent a single symbol that means "the"? One stroke instead of three!

That's exactly what BPE does — but automatically, by looking at data.

BPE Training: Step-by-Step Walkthrough

Corpus: "the cat in the hat"

Step 1: Convert to Bytes

text = "the cat in the hat"
tokens = list(text.encode("utf-8"))
print(tokens)
# [116, 104, 101, 32, 99, 97, 116, 32, 105, 110, 32, 116, 104, 101, 32, 104, 97, 116]
#   t    h    e   _   c   a   t   _   i    n   _   t    h    e   _   h   a   t

We start with 18 tokens (one per byte).

Step 2: Count Adjacent Pairs

Go through the sequence and tally every adjacent pair:

Pair	Characters	Count
(116, 104)	"th"	2
(104, 101)	"he"	2
(101, 32)	"e "	2
(97, 116)	"at"	2
(32, 99)	" c"	1
...	...	...

Winner: (116, 104) = "th" (first pair with highest count)

Step 3: Merge!

Create a new token (256) for "th" and replace all occurrences:

Before (18 tokens):

[116, 104, 101, 32, 99, 97, 116, 32, 105, 110, 32, 116, 104, 101, 32, 104, 97, 116]

After (16 tokens):

[256, 101, 32, 99, 97, 116, 32, 105, 110, 32, 256, 101, 32, 104, 97, 116]
 th   e   _   c   a   t   _   i    n   _   th   e   _   h   a   t

Step 4: Repeat!

Count pairs again. Now (256, 101) = "the" appears twice.

Create token 257 = "the":

After merge 2 (14 tokens):

[257, 32, 99, 97, 116, 32, 105, 110, 32, 257, 32, 104, 97, 116]
 the  _   c   a   t   _   i    n   _  the  _   h   a   t

We compressed 18 tokens → 14 tokens with just 2 merges!

Key Insight

The algorithm discovered that "the" is a common word and deserves its own token — no human told it about English words. It figured it out from pure statistics!

BPE Training Output

After training, we have:

Vocabulary: mapping from token index → bytes
- 0-255: individual bytes
- 256: b"th"
- 257: b"the"
- ...
Merges: ordered list of merge rules
- Merge 1: (116, 104) → 256
- Merge 2: (256, 101) → 257
- ...

5. BPE Encoding (Using the Trained Tokenizer)

Given a new string, apply merges in order:

def encode(string, merges):
    tokens = list(string.encode("utf-8"))  # Start with bytes
    for (pair, new_token) in merges:
        tokens = replace_all(tokens, pair, new_token)
    return tokens

6. BPE Decoding

Simply look up each token in the vocabulary and concatenate:

def decode(tokens, vocab):
    byte_strings = [vocab[t] for t in tokens]
    return b"".join(byte_strings).decode("utf-8")

7. GPT-2 Pre-tokenization

GPT-2 adds a pre-tokenization step using a regex to split text into chunks before BPE:

GPT2_REGEX = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

This ensures:

Words and their preceding space stay together (" hello")
Contractions are split sensibly ("I'll" → "I" + "'ll")
Punctuation is separate

8. Real-World Tools

Library	Used By	Notes
tiktoken	OpenAI (GPT-2/3/4)	Fast Rust implementation
SentencePiece	Google, Meta	Supports BPE + Unigram
tokenizers	HuggingFace	Very flexible, fast

# Using tiktoken (GPT-2 tokenizer)
import tiktoken

enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("Hello, world!")
print(f"Tokens: {tokens}")
print(f"Decoded: {enc.decode(tokens)}")

9. Limitations & Trade-offs

BPE Limitations

Language bias: Trained on English-heavy corpora → English words get single tokens, Chinese characters may need multiple
Numbers: Often split awkwardly ("123456" → multiple tokens)
Spelling sensitivity: "hello" and "Hello" may have different tokenizations
No semantic awareness: Purely statistical, doesn't understand meaning

The Tokenizer-Free Future?

Promising research on working directly with bytes:

ByT5 (2021)
MEGABYTE (2023)
Byte Latent Transformer (2024)

These haven't been scaled to frontier models yet, but may eliminate tokenization someday.

10. Quick Reference

Key Formulas

Compression ratio: $\frac{\text{num_bytes}}{\text{num_tokens}}$

Higher is better — means fewer tokens for same content.

BPE Training Algorithm (Pseudocode)

tokens = bytes(corpus)
vocab = {0: b'\x00', 1: b'\x01', ..., 255: b'\xff'}
merges = []

for i in range(num_merges):
    counts = count_adjacent_pairs(tokens)
    best_pair = argmax(counts)
    new_token = 256 + i
    vocab[new_token] = vocab[best_pair[0]] + vocab[best_pair[1]]
    merges.append((best_pair, new_token))
    tokens = replace_all(tokens, best_pair, new_token)

Typical Vocabulary Sizes

Model	Vocab Size
GPT-2	50,257
GPT-4	~100,000
Llama 2	32,000
Llama 3	128,000

TL;DR

Tokenization converts text → integers for LLMs
BPE is the dominant approach: starts with bytes, iteratively merges common pairs
Key insight: BPE automatically discovers meaningful units (words, subwords) from pure statistics
Trade-off: vocabulary size vs. sequence length
Future: Byte-level models may eliminate tokenization entirely

How LLMs Convert Text to Tokens