Streaming Response

Sending an LLM's answer token by token so the user sees the text appear immediately.

What is a streaming response?

A streaming response is a way for an LLM to return its answer - not as one block, but token by token as the model generates each token. The user watches the text "type itself" in real time, similar to ChatGPT.

How streaming works

The client sends an HTTP request with the stream: true parameter to the API.
The server opens a long-lived connection (Server-Sent Events - SSE) and pushes tokens as soon as they are generated.
The client receives the chunks and renders them progressively.

Benefits of streaming

Perceived speed: The user sees the first words in milliseconds instead of waiting several seconds
Better chatbot UX: Conversations feel more natural, users can interrupt an answer
Time to first token (TTFT): An important metric for chatbots and assistants

Drawbacks

More complex client-side handling (progressive parsing)
For structured output (e.g., JSON), you either wait for the full response or deal with incremental parsing