How LLMs Convert Text to Tokens

2026-01-24
NLPTokenization

Tokenization Deep Dive — Complete Summary

Course: Stanford CS336 — Language Models From Scratch
Student: Gaurav
Date: January 2026


1. What is Tokenization?

Tokenization is the process of converting raw text into a sequence of integers (tokens) that a language model can process.

  • Encode: strings → tokens (integers)
  • Decode: tokens → strings
  • Vocabulary size: number of possible tokens

Why Do We Need It?

Language models work with numbers, not text. We need a systematic way to:

  1. Convert any text into a fixed vocabulary of integers
  2. Convert those integers back to text (losslessly)
  3. Keep sequences short enough for the model's context window
# Basic ASCII examples
print(f"'h' → {ord('h')}")  # 104
print(f"'i' → {ord('i')}")  # 105
print(f"104 → '{chr(104)}'")  # reverse operation
# Simple ASCII string to bytes
s = "hi"
print(list(s.encode("utf-8")))  # [104, 105]

UTF-8 — Handling the Whole World

ASCII only covers English characters. UTF-8 extends this to handle all Unicode characters using variable-length encoding:

  • ASCII characters (a-z, 0-9, etc.): 1 byte
  • European accents, Greek, Cyrillic: 2 bytes
  • Chinese, Japanese, Korean: 3 bytes
  • Emojis: 4 bytes
# UTF-8 uses multiple bytes for non-ASCII characters
print(f"'你好' → {list('你好'.encode('utf-8'))}")  # 6 bytes!
print(f"'🌍' → {list('🌍'.encode('utf-8'))}")      # 4 bytes!

3. Tokenization Approaches Compared

Approach Vocab Size Compression Issues
Character ~150K Unicode chars ~1.0 Large vocab, many rare chars
Byte 256 1.0 Terrible — sequences too long
Word Huge (corpus-dependent) Good No fixed vocab, rare words → UNK
BPE Configurable Good ✓ Best tradeoff

The Core Trade-off

  • Small vocabulary → Long sequences (model struggles with context)
  • Large vocabulary → Rare tokens (model struggles to learn them)

BPE finds the sweet spot: automatically learns which character sequences are common and deserve their own token.


4. Byte Pair Encoding (BPE) — The Algorithm

Origin Story

  • 1994: Philip Gage invented BPE for data compression
  • 2015: Sennrich+ adapted it for neural machine translation
  • 2019: GPT-2 popularized it for language models

🏛️ The Medieval Scribe Analogy

Imagine you're a medieval scribe copying books by hand. You notice you write "the" thousands of times. Wouldn't it be clever to invent a single symbol that means "the"? One stroke instead of three!

That's exactly what BPE does — but automatically, by looking at data.

BPE Training: Step-by-Step Walkthrough

Corpus: "the cat in the hat"


Step 1: Convert to Bytes

text = "the cat in the hat"
tokens = list(text.encode("utf-8"))
print(tokens)
# [116, 104, 101, 32, 99, 97, 116, 32, 105, 110, 32, 116, 104, 101, 32, 104, 97, 116]
#   t    h    e   _   c   a   t   _   i    n   _   t    h    e   _   h   a   t

We start with 18 tokens (one per byte).


Step 2: Count Adjacent Pairs

Go through the sequence and tally every adjacent pair:

Pair Characters Count
(116, 104) "th" 2
(104, 101) "he" 2
(101, 32) "e " 2
(97, 116) "at" 2
(32, 99) " c" 1
... ... ...

Winner: (116, 104) = "th" (first pair with highest count)


Step 3: Merge!

Create a new token (256) for "th" and replace all occurrences:

Before (18 tokens):

[116, 104, 101, 32, 99, 97, 116, 32, 105, 110, 32, 116, 104, 101, 32, 104, 97, 116]

After (16 tokens):

[256, 101, 32, 99, 97, 116, 32, 105, 110, 32, 256, 101, 32, 104, 97, 116]
 th   e   _   c   a   t   _   i    n   _   th   e   _   h   a   t

Step 4: Repeat!

Count pairs again. Now (256, 101) = "the" appears twice.

Create token 257 = "the":

After merge 2 (14 tokens):

[257, 32, 99, 97, 116, 32, 105, 110, 32, 257, 32, 104, 97, 116]
 the  _   c   a   t   _   i    n   _  the  _   h   a   t

We compressed 18 tokens → 14 tokens with just 2 merges!

Key Insight

The algorithm discovered that "the" is a common word and deserves its own token — no human told it about English words. It figured it out from pure statistics!


BPE Training Output

After training, we have:

  1. Vocabulary: mapping from token index → bytes

    • 0-255: individual bytes
    • 256: b"th"
    • 257: b"the"
    • ...
  2. Merges: ordered list of merge rules

    • Merge 1: (116, 104) → 256
    • Merge 2: (256, 101) → 257
    • ...

5. BPE Encoding (Using the Trained Tokenizer)

Given a new string, apply merges in order:

def encode(string, merges):
    tokens = list(string.encode("utf-8"))  # Start with bytes
    for (pair, new_token) in merges:
        tokens = replace_all(tokens, pair, new_token)
    return tokens

6. BPE Decoding

Simply look up each token in the vocabulary and concatenate:

def decode(tokens, vocab):
    byte_strings = [vocab[t] for t in tokens]
    return b"".join(byte_strings).decode("utf-8")

7. GPT-2 Pre-tokenization

GPT-2 adds a pre-tokenization step using a regex to split text into chunks before BPE:

GPT2_REGEX = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

This ensures:

  • Words and their preceding space stay together (" hello")
  • Contractions are split sensibly ("I'll" → "I" + "'ll")
  • Punctuation is separate

8. Real-World Tools

Library Used By Notes
tiktoken OpenAI (GPT-2/3/4) Fast Rust implementation
SentencePiece Google, Meta Supports BPE + Unigram
tokenizers HuggingFace Very flexible, fast
# Using tiktoken (GPT-2 tokenizer)
import tiktoken

enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("Hello, world!")
print(f"Tokens: {tokens}")
print(f"Decoded: {enc.decode(tokens)}")

9. Limitations & Trade-offs

BPE Limitations

  1. Language bias: Trained on English-heavy corpora → English words get single tokens, Chinese characters may need multiple
  2. Numbers: Often split awkwardly ("123456" → multiple tokens)
  3. Spelling sensitivity: "hello" and "Hello" may have different tokenizations
  4. No semantic awareness: Purely statistical, doesn't understand meaning

The Tokenizer-Free Future?

Promising research on working directly with bytes:

  • ByT5 (2021)
  • MEGABYTE (2023)
  • Byte Latent Transformer (2024)

These haven't been scaled to frontier models yet, but may eliminate tokenization someday.


10. Quick Reference

Key Formulas

Compression ratio: $\frac{\text{num_bytes}}{\text{num_tokens}}$

Higher is better — means fewer tokens for same content.

BPE Training Algorithm (Pseudocode)

tokens = bytes(corpus)
vocab = {0: b'\x00', 1: b'\x01', ..., 255: b'\xff'}
merges = []

for i in range(num_merges):
    counts = count_adjacent_pairs(tokens)
    best_pair = argmax(counts)
    new_token = 256 + i
    vocab[new_token] = vocab[best_pair[0]] + vocab[best_pair[1]]
    merges.append((best_pair, new_token))
    tokens = replace_all(tokens, best_pair, new_token)

Typical Vocabulary Sizes

Model Vocab Size
GPT-2 50,257
GPT-4 ~100,000
Llama 2 32,000
Llama 3 128,000

TL;DR

  1. Tokenization converts text → integers for LLMs
  2. BPE is the dominant approach: starts with bytes, iteratively merges common pairs
  3. Key insight: BPE automatically discovers meaningful units (words, subwords) from pure statistics
  4. Trade-off: vocabulary size vs. sequence length
  5. Future: Byte-level models may eliminate tokenization entirely