Skip to content
Article Issue #5167

Token (AI)

What to know

Token (AI) is the atomic unit of text processed by a language model; Tokenizers like Byte Pair Encoding (BPE) or SentencePiece segment text by merging the most frequent character pairs iteratively until a fixed vocabulary size is reached; Token awareness matters for cost control and context window planning

Token (AI), WikiWalls Glossary illustration

« Back to Glossary Index

Token (AI) is the atomic unit of text processed by a language model. Rather than operating on characters or whole words, models operate on subword pieces produced by a tokenizer trained on the same corpus. A single English word may be one token, multiple tokens, or share a token with punctuation depending on frequency in the training data.

How it works

Tokenizers like Byte Pair Encoding (BPE) or SentencePiece segment text by merging the most frequent character pairs iteratively until a fixed vocabulary size is reached. The model receives token IDs, processes them through embedding layers, and outputs probability distributions over the vocabulary at each generation step.

Key facts

  • Average ratio: Roughly 1 token per 4 English characters, or about 75 tokens per 100 words.
  • Vocabulary size: Most modern LLMs use vocabularies of 32,000 to 128,000 tokens.
  • Non-English languages: Languages underrepresented in training data use more tokens per word, increasing cost.
  • Pricing: API costs are expressed in price per million input or output tokens.

For builders

Token awareness matters for cost control and context window planning. Builders should profile token counts in staging before scaling, especially for multilingual apps or code-heavy prompts where token density differs from English prose. Libraries like tiktoken let you count tokens locally before making an API call.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.