Article Issue #5167

Token (AI)

What to know

Token (AI) is the atomic unit of text processed by a language model; Tokenizers like Byte Pair Encoding (BPE) or SentencePiece segment text by merging the most frequent character pairs iteratively until a fixed vocabulary size is reached; Token awareness matters for cost control and context window planning

Wikiwalls Team Administrator

May 15, 2026 2 min read

« Back to Glossary Index

Token (AI) is the atomic unit of text processed by a language model. Rather than operating on characters or whole words, models operate on subword pieces produced by a tokenizer trained on the same corpus. A single English word may be one token, multiple tokens, or share a token with punctuation depending on frequency in the training data.

How it works

Tokenizers like Byte Pair Encoding (BPE) or SentencePiece segment text by merging the most frequent character pairs iteratively until a fixed vocabulary size is reached. The model receives token IDs, processes them through embedding layers, and outputs probability distributions over the vocabulary at each generation step.

Key facts

Average ratio: Roughly 1 token per 4 English characters, or about 75 tokens per 100 words.
Vocabulary size: Most modern LLMs use vocabularies of 32,000 to 128,000 tokens.
Non-English languages: Languages underrepresented in training data use more tokens per word, increasing cost.
Pricing: API costs are expressed in price per million input or output tokens.

For builders

Token awareness matters for cost control and context window planning. Builders should profile token counts in staging before scaling, especially for multilingual apps or code-heavy prompts where token density differs from English prose. Libraries like tiktoken let you count tokens locally before making an API call.

Sources

« Back to Definition Index

If this saved you an afternoon — and we will send the next one straight to your inbox.

Wikiwalls Team

Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

How it works

Key facts

For builders

Sources

More from WikiWalls

Cursor vs Copilot vs Cody vs Windsurf, after a 30-day production diary

The Cheapest Production-Grade LLM, ranked at constant output quality

Best Mini-PC for Homelab: Beelink, Minisforum, GMKtec Tested

Best AI Note Apps: Mem vs Reflect vs Tana vs Saner.ai

One careful fix in your inbox each Wednesday.