Token (AI)
Token (AI) is the atomic unit of text processed by a language model; Tokenizers like Byte Pair Encoding (BPE) or SentencePiece segment text by merging the most frequent character pairs iteratively until a fixed vocabulary size is reached; Token awareness matters for cost control and context window planning
Token (AI) is the atomic unit of text processed by a language model. Rather than operating on characters or whole words, models operate on subword pieces produced by a tokenizer trained on the same corpus. A single English word may be one token, multiple tokens, or share a token with punctuation depending on frequency in the training data.
How it works
Tokenizers like Byte Pair Encoding (BPE) or SentencePiece segment text by merging the most frequent character pairs iteratively until a fixed vocabulary size is reached. The model receives token IDs, processes them through embedding layers, and outputs probability distributions over the vocabulary at each generation step.
Key facts
- Average ratio: Roughly 1 token per 4 English characters, or about 75 tokens per 100 words.
- Vocabulary size: Most modern LLMs use vocabularies of 32,000 to 128,000 tokens.
- Non-English languages: Languages underrepresented in training data use more tokens per word, increasing cost.
- Pricing: API costs are expressed in price per million input or output tokens.
For builders
Token awareness matters for cost control and context window planning. Builders should profile token counts in staging before scaling, especially for multilingual apps or code-heavy prompts where token density differs from English prose. Libraries like tiktoken let you count tokens locally before making an API call.
Sources
- Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762. arxiv.org
- Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165. arxiv.org
- Bommasani, R., et al. (2021). On the Opportunities and Risks of Foundation Models. Stanford CRFM. arxiv.org
- NIST. (2023). AI Risk Management Framework (AI RMF 1.0). nist.gov
- Stanford HAI. Foundation Models research portal. hai.stanford.edu