AI Infrastructure Stack

Voice AI Stack

Transcription, text-to-speech, and voice agents. Whether you are adding voice features to an existing product or building a standalone audio pipeline, these are the building blocks.

🎀 Speech-to-text πŸ”Š Text-to-speech πŸŽ™οΈ Voice agents
Hand-drawn illustration of a voice AI pipeline

Things to keep in mind

  • For voice agents, the whole pipeline (STT β†’ LLM β†’ TTS) needs to stay under ~1 second total for conversation to feel natural. Each component adds latency, so pick providers with low streaming latency and test the full round trip.
  • Self-hosting Whisper is viable for batch transcription but hard to beat the managed APIs on streaming latency and accuracy. If real-time is not a requirement, Whisper large-v3-turbo is a good self-hosted option.
  • Voice cloning and custom voices are available from ElevenLabs, Cartesia, and LMNT. If your product has a brand voice, this matters.
  • Open-source TTS has improved significantly. Kokoro-82M and Fish Speech are self-hostable with good quality. Worth evaluating if you need to control costs at scale.

Frequently asked questions

What is the best speech-to-text API for real-time use?

Deepgram (Nova-3) is the most used API for real-time voice agents, with streaming latency under 300ms and good noise robustness. AssemblyAI offers strong accuracy. For EU hosting, Gladia (France) is an option.

Which text-to-speech API has the lowest latency?

Cartesia (Sonic 3) and ElevenLabs (Flash v2.5) both offer very low time-to-first-audio for voice agent use cases. For highest voice quality without latency constraints, ElevenLabs Eleven v3 is a common choice.

How do you build a real-time voice agent?

A voice agent pipeline runs: microphone to speech-to-text to LLM to text-to-speech to speaker, all in real time. LiveKit Agents is the most popular open-source framework for this. The target is under 1 second total round-trip latency.

Can I self-host speech-to-text?

Yes. Whisper large-v3-turbo is 6x faster than the original with only a slight accuracy drop. NVIDIA Parakeet is faster still for streaming. Self-hosting is practical for batch transcription but hard to beat managed APIs on streaming latency.

Last updated: April 2026

Is your product missing?

Add it here →