Skip to content
Article Issue #5168

Tokenization

What to know

Tokenization is the preprocessing step that converts a string of text into a sequence of integer token IDs, and the corresponding postprocessing step that decodes those IDs back into text; Most production LLMs use Byte Pair Encoding or its variants; Builders need to count tokens before sending requests to avoid exceeding context limits and to estimate costs accurately

Tokenization, WikiWalls Glossary illustration

« Back to Glossary Index

Tokenization is the preprocessing step that converts a string of text into a sequence of integer token IDs, and the corresponding postprocessing step that decodes those IDs back into text. Every LLM has a fixed vocabulary and a paired tokenizer; using a mismatched tokenizer produces incorrect token counts and malformed inputs.

How it works

Most production LLMs use Byte Pair Encoding or its variants. The algorithm starts with individual bytes or characters and iteratively merges the most frequent adjacent pair into a new token, until the vocabulary reaches a target size. At inference time, the tokenizer applies the learned merge rules to split new text deterministically.

Key facts

  • Algorithm: BPE (GPT models), SentencePiece (T5, Gemma), and WordPiece (BERT) are the dominant approaches.
  • Special tokens: Tokenizers include control tokens like BOS, EOS, and PAD that the model uses for structure.
  • Byte fallback: Modern tokenizers can represent any Unicode character via raw byte tokens, preventing unknown-token failures.
  • Tool: OpenAI’s tiktoken and Hugging Face tokenizers libraries are standard for counting tokens before API calls.

For builders

Builders need to count tokens before sending requests to avoid exceeding context limits and to estimate costs accurately. Tokenization also affects string splitting for RAG chunking: splitting on character count rather than token count can cause chunk boundaries that waste context space or cut mid-sentence in ways that degrade retrieval quality.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.