info Open to new work opportunities! Contact me
Daniel Hladik AI Automation Engineer

← All terms

Streaming Response

Sending an LLM's answer token by token so the user sees the text appear immediately.

What is a streaming response?

A streaming response is a way for an LLM to return its answer - not as one block, but token by token as the model generates each token. The user watches the text "type itself" in real time, similar to ChatGPT.

How streaming works

  1. The client sends an HTTP request with the stream: true parameter to the API.
  2. The server opens a long-lived connection (Server-Sent Events - SSE) and pushes tokens as soon as they are generated.
  3. The client receives the chunks and renders them progressively.

Benefits of streaming

  • Perceived speed: The user sees the first words in milliseconds instead of waiting several seconds
  • Better chatbot UX: Conversations feel more natural, users can interrupt an answer
  • Time to first token (TTFT): An important metric for chatbots and assistants

Drawbacks

  • More complex client-side handling (progressive parsing)
  • For structured output (e.g., JSON), you either wait for the full response or deal with incremental parsing