Article Issue #5174

AI Inference

What to know

AI Inference is the computational step where a trained model processes an input and produces an output, as opposed to training, which adjusts model weights; During autoregressive inference, the model generates one token at a time; Understanding inference mechanics helps builders choose between self-hosted and managed inference providers based on latency, throughput, and cost profiles

Wikiwalls Team Administrator

May 15, 2026 2 min read

« Back to Glossary Index

AI Inference is the computational step where a trained model processes an input and produces an output, as opposed to training, which adjusts model weights. For LLMs, inference involves feeding a tokenized prompt through the model’s layers in a single forward pass to produce a probability distribution, then sampling tokens autoregressively until an end condition is met.

How it works

During autoregressive inference, the model generates one token at a time. Each new token is appended to the context and the forward pass runs again. A KV (key-value) cache stores intermediate attention computations for the existing context to avoid redundant work, which is why throughput scales better with output length than input length.

Key facts

Latency components: Time-to-first-token (TTFT) and tokens-per-second (TPS) are the two primary latency metrics.
Hardware: GPUs and custom accelerators (TPUs, Trainium) dominate inference workloads due to matrix multiplication throughput.
Batching: Running multiple requests simultaneously on shared hardware improves GPU utilization and reduces per-token cost.
Quantization: Reducing model weights from FP16 to INT8 or INT4 lowers memory requirements and speeds inference with minimal quality loss.

For builders

Understanding inference mechanics helps builders choose between self-hosted and managed inference providers based on latency, throughput, and cost profiles. Streaming completions expose the token-by-token generation process to users, improving perceived responsiveness. Batch inference is appropriate for asynchronous, non-interactive workloads where latency is less critical than cost.

Sources

« Back to Definition Index

If this saved you an afternoon — and we will send the next one straight to your inbox.

Wikiwalls Team

Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

How it works

Key facts

For builders

Sources

More from WikiWalls

Cursor vs Copilot vs Cody vs Windsurf, after a 30-day production diary

The Cheapest Production-Grade LLM, ranked at constant output quality

Best Mini-PC for Homelab: Beelink, Minisforum, GMKtec Tested

Best AI Note Apps: Mem vs Reflect vs Tana vs Saner.ai

One careful fix in your inbox each Wednesday.