Latency
The delay between a caller's speech and the AI's audible response. Under 400ms is considered natural.
Latency in AI voice agents refers to the time elapsed between when a caller finishes speaking and when they first hear the AI's response. It is one of the most critical metrics for perceived conversation quality.
Why latency matters
Human conversations naturally have response gaps of 200–500 milliseconds. Response delays beyond 700ms are perceived as awkward pauses — the listener instinctively senses that the other party is struggling. At 1,000ms (1 second) or more, callers assume the call has dropped or that something is wrong. These delays are a primary reason why early AI phone systems felt robotic and unnatural.
Components of end-to-end latency
| Stage | Typical Duration |
|---|---|
| STT processing | 50–150ms (streaming) |
| Network round-trip (STT) | 20–80ms |
| LLM generation (first token) | 100–300ms |
| TTS synthesis (first audio chunk) | 50–150ms |
| Network round-trip (TTS) | 20–80ms |
| Total (optimized) | 240–760ms |
Optimization strategies
- Streaming STT begins processing before the caller finishes speaking
- Streaming TTS begins generating audio before the full LLM response is written
- Co-location of STT, LLM, and TTS services on the same compute cluster eliminates inter-service network hops
- Predictive turn-taking uses acoustic cues to anticipate when the caller will finish speaking
TurboCall achieves sub-400ms end-to-end latency using all four optimization strategies simultaneously.