Definition

AI Voice Agent

Software that conducts real-time phone conversations using STT, an LLM, and TTS.

An AI voice agent is software that conducts real-time, two-way phone conversations with humans using artificial intelligence. It listens to what a caller says (speech-to-text), processes the meaning using a large language model, and responds in natural-sounding speech (text-to-speech) — all within a fraction of a second.

Unlike traditional IVR systems that force callers through rigid menu trees, an AI voice agent understands natural language. A caller can say "I need to reschedule my appointment to sometime next Thursday afternoon" and the agent parses the intent, asks a clarifying question if needed, and takes action — no menu prompts required.

Core pipeline stages

  1. STT (Speech-to-Text): Converts the caller's audio into text in real time using models like Deepgram or Whisper.
  2. LLM (Large Language Model): Interprets the text, determines the caller's intent, generates a response, and decides whether to take an action (book a slot, look up an order, transfer to a human).
  3. TTS (Text-to-Speech): Converts the generated response back into natural-sounding audio using a neural voice engine.

The full round-trip — from the moment the caller stops speaking to when the agent begins responding — is called end-to-end latency. TurboCall achieves sub-400ms latency by co-locating all three stages on the same inference cluster.

Business use cases: Inbound call handling, outbound lead qualification, appointment scheduling, order status lookups, payment collection, appointment reminders, and post-call surveys.