Skip to content
Article Issue #5196

Streaming Completions

What to know

Streaming Completions is a delivery mode for LLM API responses where the server sends each generated token or token chunk to the client as soon as it is produced, using Server-Sent Events (SSE) or chunked HTTP transfer encoding; The client sets stream=true in the API request; Streaming should be the default for any user-facing generation feature where responses are longer than a few sentences

Streaming Completions, WikiWalls Glossary illustration

« Back to Glossary Index

Streaming Completions is a delivery mode for LLM API responses where the server sends each generated token or token chunk to the client as soon as it is produced, using Server-Sent Events (SSE) or chunked HTTP transfer encoding. The client can begin rendering output to users before generation is finished, dramatically improving perceived responsiveness for long responses.

How it works

The client sets stream=true in the API request. The server sends a series of delta events, each containing one or more newly generated tokens, followed by a final done event. The client accumulates the deltas to reconstruct the full response. For structured output and function calling, streaming requires the client to buffer and parse incomplete JSON before the final token arrives.

Key facts

  • Protocol: Server-Sent Events (SSE) over HTTP/1.1 or HTTP/2 is the standard delivery mechanism.
  • TTFT vs. TPS: Streaming exposes time-to-first-token as a distinct latency metric separate from total generation time.
  • Structured output: Streaming JSON responses requires handling partial JSON parsing; most SDKs provide helpers for this.
  • Cancellation: Clients can cancel the stream mid-generation to avoid paying for tokens the user dismissed, reducing costs.

For builders

Streaming should be the default for any user-facing generation feature where responses are longer than a few sentences. The reduction in perceived latency significantly improves user experience and completion rates. For backend pipelines processing completions programmatically rather than displaying them, non-streaming is simpler and equally efficient. Always implement stream cancellation to avoid charging full completion costs when users navigate away.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.