Definition

STT (Speech-to-Text)

Technology that converts spoken audio into written text in real time.

Speech-to-Text (STT), also called Automatic Speech Recognition (ASR), is the technology that converts a caller's spoken words into written text. It is the first stage in an AI voice agent's processing pipeline.

Modern STT engines use deep neural networks trained on hundreds of thousands of hours of speech data. Leading implementations include Deepgram, AssemblyAI, and OpenAI Whisper. State-of-the-art STT achieves over 95% word accuracy across standard English speech, with lower accuracy for heavy accents, domain-specific jargon, and noisy audio environments.

Streaming vs. batch STT

Streaming STT transcribes audio in real time as the caller speaks, allowing the language model to begin processing before the caller has finished their sentence. This reduces end-to-end latency.
Batch STT waits until the caller stops speaking before processing the full audio. Simpler to implement but adds hundreds of milliseconds of latency.

TurboCall uses streaming STT to achieve sub-400ms total response latency.

STT accuracy factors

Background noise
Speaker accent and dialect
Domain-specific vocabulary (medical, legal, technical)
Audio quality and sample rate
Microphone placement

Custom vocabulary: Most enterprise STT systems support custom vocabulary files that improve recognition accuracy for brand names, product terms, and industry jargon. TurboCall's pronunciation customization system works at both the STT and TTS layer.

Related Terms

Tts Ai Voice Agent Latency

Related Resources

TurboCall AI Voice Agent How AI Voice Agents Work

← Back to Glossary

Healthcare

Professional Services

Commerce & Retail

Business Services

Home & Automotive

Lifestyle