Definition

STT (Speech-to-Text)

Technology that converts spoken audio into written text in real time.

Speech-to-Text (STT), also called Automatic Speech Recognition (ASR), is the technology that converts a caller's spoken words into written text. It is the first stage in an AI voice agent's processing pipeline.

Modern STT engines use deep neural networks trained on hundreds of thousands of hours of speech data. Leading implementations include Deepgram, AssemblyAI, and OpenAI Whisper. State-of-the-art STT achieves over 95% word accuracy across standard English speech, with lower accuracy for heavy accents, domain-specific jargon, and noisy audio environments.

Streaming vs. batch STT

  • Streaming STT transcribes audio in real time as the caller speaks, allowing the language model to begin processing before the caller has finished their sentence. This reduces end-to-end latency.
  • Batch STT waits until the caller stops speaking before processing the full audio. Simpler to implement but adds hundreds of milliseconds of latency.

TurboCall uses streaming STT to achieve sub-400ms total response latency.

STT accuracy factors

  • Background noise
  • Speaker accent and dialect
  • Domain-specific vocabulary (medical, legal, technical)
  • Audio quality and sample rate
  • Microphone placement

Custom vocabulary: Most enterprise STT systems support custom vocabulary files that improve recognition accuracy for brand names, product terms, and industry jargon. TurboCall's pronunciation customization system works at both the STT and TTS layer.