EN DA
AI
AI

TTFT — Time to First Token

The most critical metric for perceived AI responsiveness

TTFTLatencyStreamingPerformanceInference

Overview

Time to First Token (TTFT) measures the latency between the moment your application sends a request to an LLM and the moment the first token of the response is received. In streaming applications, this is the "time to first word" that appears to the user — the instant the application stops making the user wait and starts proving it is working.

A 500ms difference in TTFT can be the difference between a user perceiving your AI feature as "instant" or "laggy." While total response time matters, TTFT is the single most critical metric for user perception in streaming applications because it controls when users see the first sign of progress.

Key Concepts

  • Three components of TTFT: TTFT is not a single operation — it is a sequence. (1) Network latency: travel time from your client to the API and back. (2) Request queuing: time waiting for available compute under load. (3) Prefill phase: the model processes the entire prompt and builds its internal key-value cache before generating the first output token. The prefill phase is usually the dominant factor, especially with long prompts.
  • TTFT vs. total latency: Total latency = TTFT + (number of output tokens × time-per-token). For streaming applications, TTFT is more critical than total latency because it determines when the user first sees feedback and sets the initial perception of responsiveness.
  • Streaming is non-negotiable: TTFT is invisible in non-streaming calls — the full response is buffered before delivery. Always use streaming for accurate TTFT measurement and for delivering responsive user experiences. Streaming does not reduce the underlying TTFT, but it exposes it directly to users rather than hiding it.
  • Prompt length impact: Every token in your prompt must be processed during the prefill phase before a single output token can be generated. A 10,000-token prompt can add 500ms or more to TTFT regardless of model or hardware. In RAG applications, trimming retrieved context is often the highest-leverage TTFT optimisation.
  • Model selection and infrastructure: Smaller, faster models (e.g. GPT-4o mini, Claude 3.5 Haiku) offer substantially lower TTFT than larger models at a fraction of the cost. Dedicated capacity (Provisioned Throughput Units / PTUs) eliminates queuing delays under load. Separating interactive workloads from batch jobs prevents queue time from compounding.

Key Facts

  • Applications with TTFT under 500ms see 23% higher session completion rates compared to slower equivalents, according to Azure OpenAI research.
  • Industry SLO targets: interactive chat at 500ms p95, copilot and autocomplete features at 300ms p95. Batch processing has no strict TTFT SLO but should be monitored for anomalies.
  • Optimising TTFT often involves infrastructure changes that also reduce total token processing costs by 15–30%, making it a cost and UX win simultaneously.
  • A 300ms average TTFT can hide 2-second p99 spikes. Always monitor p95 and p99 percentiles and track goodput — the percentage of requests meeting your SLO — rather than average TTFT alone.
  • Azure OpenAI's content safety filters add overhead to every request. For low-risk applications, modified content filtering policies can reduce TTFT meaningfully.
  • TTFT should always be measured alongside Inter-Token Latency (ITL) and Tokens Per Second (TPS) — TTFT tells you when the first token arrives, but not how fast the model generates subsequent tokens.