Streaming Response
Sending an LLM's answer token by token so the user sees the text appear immediately.
What is a streaming response?
A streaming response is a way for an LLM to return its answer - not as one block, but token by token as the model generates each token. The user watches the text "type itself" in real time, similar to ChatGPT.
How streaming works
- The client sends an HTTP request with the
stream: trueparameter to the API. - The server opens a long-lived connection (Server-Sent Events - SSE) and pushes tokens as soon as they are generated.
- The client receives the chunks and renders them progressively.
Benefits of streaming
- Perceived speed: The user sees the first words in milliseconds instead of waiting several seconds
- Better chatbot UX: Conversations feel more natural, users can interrupt an answer
- Time to first token (TTFT): An important metric for chatbots and assistants
Drawbacks
- More complex client-side handling (progressive parsing)
- For structured output (e.g., JSON), you either wait for the full response or deal with incremental parsing