Article Issue #5168

Tokenization

What to know

Tokenization is the preprocessing step that converts a string of text into a sequence of integer token IDs, and the corresponding postprocessing step that decodes those IDs back into text; Most production LLMs use Byte Pair Encoding or its variants; Builders need to count tokens before sending requests to avoid exceeding context limits and to estimate costs accurately

Wikiwalls Team Administrator

May 15, 2026 2 min read

« Back to Glossary Index

Tokenization is the preprocessing step that converts a string of text into a sequence of integer token IDs, and the corresponding postprocessing step that decodes those IDs back into text. Every LLM has a fixed vocabulary and a paired tokenizer; using a mismatched tokenizer produces incorrect token counts and malformed inputs.

How it works

Most production LLMs use Byte Pair Encoding or its variants. The algorithm starts with individual bytes or characters and iteratively merges the most frequent adjacent pair into a new token, until the vocabulary reaches a target size. At inference time, the tokenizer applies the learned merge rules to split new text deterministically.

Key facts

Algorithm: BPE (GPT models), SentencePiece (T5, Gemma), and WordPiece (BERT) are the dominant approaches.
Special tokens: Tokenizers include control tokens like BOS, EOS, and PAD that the model uses for structure.
Byte fallback: Modern tokenizers can represent any Unicode character via raw byte tokens, preventing unknown-token failures.
Tool: OpenAI’s tiktoken and Hugging Face tokenizers libraries are standard for counting tokens before API calls.

For builders

Builders need to count tokens before sending requests to avoid exceeding context limits and to estimate costs accurately. Tokenization also affects string splitting for RAG chunking: splitting on character count rather than token count can cause chunk boundaries that waste context space or cut mid-sentence in ways that degrade retrieval quality.

Sources

« Back to Definition Index

If this saved you an afternoon — and we will send the next one straight to your inbox.

Wikiwalls Team

Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

How it works

Key facts

For builders

Sources

More from WikiWalls

Cursor vs Copilot vs Cody vs Windsurf, after a 30-day production diary

The Cheapest Production-Grade LLM, ranked at constant output quality

Best Mini-PC for Homelab: Beelink, Minisforum, GMKtec Tested

Best AI Note Apps: Mem vs Reflect vs Tana vs Saner.ai

One careful fix in your inbox each Wednesday.