Tokenization
Tokenization is the preprocessing step that converts a string of text into a sequence of integer token IDs, and the corresponding postprocessing step that decodes those IDs back into text; Most production LLMs use Byte Pair Encoding or its variants; Builders need to count tokens before sending requests to avoid exceeding context limits and to estimate costs accurately
Tokenization is the preprocessing step that converts a string of text into a sequence of integer token IDs, and the corresponding postprocessing step that decodes those IDs back into text. Every LLM has a fixed vocabulary and a paired tokenizer; using a mismatched tokenizer produces incorrect token counts and malformed inputs.
How it works
Most production LLMs use Byte Pair Encoding or its variants. The algorithm starts with individual bytes or characters and iteratively merges the most frequent adjacent pair into a new token, until the vocabulary reaches a target size. At inference time, the tokenizer applies the learned merge rules to split new text deterministically.
Key facts
- Algorithm: BPE (GPT models), SentencePiece (T5, Gemma), and WordPiece (BERT) are the dominant approaches.
- Special tokens: Tokenizers include control tokens like BOS, EOS, and PAD that the model uses for structure.
- Byte fallback: Modern tokenizers can represent any Unicode character via raw byte tokens, preventing unknown-token failures.
- Tool: OpenAI’s tiktoken and Hugging Face tokenizers libraries are standard for counting tokens before API calls.
For builders
Builders need to count tokens before sending requests to avoid exceeding context limits and to estimate costs accurately. Tokenization also affects string splitting for RAG chunking: splitting on character count rather than token count can cause chunk boundaries that waste context space or cut mid-sentence in ways that degrade retrieval quality.
Sources
- Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762. arxiv.org
- Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165. arxiv.org
- Bommasani, R., et al. (2021). On the Opportunities and Risks of Foundation Models. Stanford CRFM. arxiv.org
- NIST. (2023). AI Risk Management Framework (AI RMF 1.0). nist.gov
- Stanford HAI. Foundation Models research portal. hai.stanford.edu