Streaming Completions
Streaming Completions is a delivery mode for LLM API responses where the server sends each generated token or token chunk to the client as soon as it is produced, using Server-Sent Events (SSE) or chunked HTTP transfer encoding; The client sets stream=true in the API request; Streaming should be the default for any user-facing generation feature where responses are longer than a few sentences
Streaming Completions is a delivery mode for LLM API responses where the server sends each generated token or token chunk to the client as soon as it is produced, using Server-Sent Events (SSE) or chunked HTTP transfer encoding. The client can begin rendering output to users before generation is finished, dramatically improving perceived responsiveness for long responses.
How it works
The client sets stream=true in the API request. The server sends a series of delta events, each containing one or more newly generated tokens, followed by a final done event. The client accumulates the deltas to reconstruct the full response. For structured output and function calling, streaming requires the client to buffer and parse incomplete JSON before the final token arrives.
Key facts
- Protocol: Server-Sent Events (SSE) over HTTP/1.1 or HTTP/2 is the standard delivery mechanism.
- TTFT vs. TPS: Streaming exposes time-to-first-token as a distinct latency metric separate from total generation time.
- Structured output: Streaming JSON responses requires handling partial JSON parsing; most SDKs provide helpers for this.
- Cancellation: Clients can cancel the stream mid-generation to avoid paying for tokens the user dismissed, reducing costs.
For builders
Streaming should be the default for any user-facing generation feature where responses are longer than a few sentences. The reduction in perceived latency significantly improves user experience and completion rates. For backend pipelines processing completions programmatically rather than displaying them, non-streaming is simpler and equally efficient. Always implement stream cancellation to avoid charging full completion costs when users navigate away.
Sources
- Zhao, W. X., et al. (2023). A Survey of Large Language Models. arXiv:2303.18223. arxiv.org
- Yu, G., et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI. usenix.org
- Kwon, W., et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention. vllm-project. github.com
- Anthropic. Claude API documentation. docs.anthropic.com
- OpenAI. API reference. platform.openai.com