AI Inference
AI Inference is the computational step where a trained model processes an input and produces an output, as opposed to training, which adjusts model weights; During autoregressive inference, the model generates one token at a time; Understanding inference mechanics helps builders choose between self-hosted and managed inference providers based on latency, throughput, and cost profiles
AI Inference is the computational step where a trained model processes an input and produces an output, as opposed to training, which adjusts model weights. For LLMs, inference involves feeding a tokenized prompt through the model’s layers in a single forward pass to produce a probability distribution, then sampling tokens autoregressively until an end condition is met.
How it works
During autoregressive inference, the model generates one token at a time. Each new token is appended to the context and the forward pass runs again. A KV (key-value) cache stores intermediate attention computations for the existing context to avoid redundant work, which is why throughput scales better with output length than input length.
Key facts
- Latency components: Time-to-first-token (TTFT) and tokens-per-second (TPS) are the two primary latency metrics.
- Hardware: GPUs and custom accelerators (TPUs, Trainium) dominate inference workloads due to matrix multiplication throughput.
- Batching: Running multiple requests simultaneously on shared hardware improves GPU utilization and reduces per-token cost.
- Quantization: Reducing model weights from FP16 to INT8 or INT4 lowers memory requirements and speeds inference with minimal quality loss.
For builders
Understanding inference mechanics helps builders choose between self-hosted and managed inference providers based on latency, throughput, and cost profiles. Streaming completions expose the token-by-token generation process to users, improving perceived responsiveness. Batch inference is appropriate for asynchronous, non-interactive workloads where latency is less critical than cost.
Sources
- Zhao, W. X., et al. (2023). A Survey of Large Language Models. arXiv:2303.18223. arxiv.org
- Yu, G., et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI. usenix.org
- Kwon, W., et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention. vllm-project. github.com
- Anthropic. Claude API documentation. docs.anthropic.com
- OpenAI. API reference. platform.openai.com