Skip to content
Article Issue #5174

AI Inference

What to know

AI Inference is the computational step where a trained model processes an input and produces an output, as opposed to training, which adjusts model weights; During autoregressive inference, the model generates one token at a time; Understanding inference mechanics helps builders choose between self-hosted and managed inference providers based on latency, throughput, and cost profiles

AI Inference, WikiWalls Glossary illustration

« Back to Glossary Index

AI Inference is the computational step where a trained model processes an input and produces an output, as opposed to training, which adjusts model weights. For LLMs, inference involves feeding a tokenized prompt through the model’s layers in a single forward pass to produce a probability distribution, then sampling tokens autoregressively until an end condition is met.

How it works

During autoregressive inference, the model generates one token at a time. Each new token is appended to the context and the forward pass runs again. A KV (key-value) cache stores intermediate attention computations for the existing context to avoid redundant work, which is why throughput scales better with output length than input length.

Key facts

  • Latency components: Time-to-first-token (TTFT) and tokens-per-second (TPS) are the two primary latency metrics.
  • Hardware: GPUs and custom accelerators (TPUs, Trainium) dominate inference workloads due to matrix multiplication throughput.
  • Batching: Running multiple requests simultaneously on shared hardware improves GPU utilization and reduces per-token cost.
  • Quantization: Reducing model weights from FP16 to INT8 or INT4 lowers memory requirements and speeds inference with minimal quality loss.

For builders

Understanding inference mechanics helps builders choose between self-hosted and managed inference providers based on latency, throughput, and cost profiles. Streaming completions expose the token-by-token generation process to users, improving perceived responsiveness. Batch inference is appropriate for asynchronous, non-interactive workloads where latency is less critical than cost.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.