Home / Audio

🔊 Audio

Generative AI audio models create realistic text-to-speech voices and music, offering a human touch to applications and enabling accessibility, as well as enriching user experiences with audio interactions.

Read our comparison guide →

🕵️‍♀️ Agents 🔊 Audio 🧠 Fine-tuning 🏗️ Frameworks & Stacks 🤖 Inference APIs 📊 Observability & Analytics ✍️ Prompt engineering 🗄️ Vector databases

21 tools

Text to Speech 13

MusicGPT

AI audio API for generating songs, speech, and sound, with stem splitting, voice conversion, and mastering

Free Trial

SpeechifyAI

Text-to-speech API with sub-100ms streaming, zero-shot voice cloning, and 30+ locales

Free Trial

VoxCPM

Tokenizer-free open-source text-to-speech with voice cloning across 30 languages

Open Source Free Trial

Samtal

Swedish-hosted voice AI API with TTS, ASR, voice cloning, and conversational agents. ElevenLabs-compatible

PlayHT

AI voice generator acquired by Meta (July 2025) and shut down (December 2025). See alternatives for text-to-speech.

LemonFox

Affordable speech-to-text and text-to-speech API with 100+ language support

Free Trial

Rime AI

Text-to-speech API with 200+ voices, sub-200ms latency, and on-premise deployment

Free Trial

LMNT

Low-latency text-to-speech API built for real-time conversational AI

Free Trial

Fish Audio

Open-source text-to-speech and voice cloning with low latency in 13+ languages

Open Source Free Trial

Cartesia

Real-time voice AI with ultra-low latency text-to-speech and voice cloning in 40+ languages

Free Trial

Resemble AI

Generative Voice AI built for Enterprise.

Free Trial

Eleven Labs

Natural Text to Speech & AI Voice Generator.

Free Trial

Suno

Make a song with Suno.

Speech to Text 4

Gladia

Fast speech-to-text API with real-time transcription and speaker diarization

Free Trial

Speechmatics

Enterprise speech-to-text API supporting 55+ languages with high accuracy

Free Trial

AssemblyAI

Speech-to-text APIs with audio intelligence, speaker diarization, and real-time streaming

Free Trial

Deepgram

Build Voice AI into your apps.

Free Trial

Other 4

LiveKit Agents

Open-source framework for building real-time voice and multimodal AI agents over WebRTC

Open Source Free Trial

Hume AI

Empathic voice AI that detects and responds to human emotion in real-time

Free Trial

Cekura

Testing and monitoring platform for AI voice and chat agents

Free Trial

OpenAI

API access to GPT, o-series reasoning, DALL-E, and Whisper models

Free Trial

Audio overview

AI audio tools cover text-to-speech synthesis, speech-to-text transcription, voice cloning, and music generation. These models have reached production quality, with synthetic voices that are nearly indistinguishable from human speech and transcription accuracy that rivals professional services.

The tools in this category serve diverse use cases: podcast production, accessibility features, real-time voice assistants, content localization, and audio content creation. Many offer APIs that integrate directly into applications, while others provide studio-like interfaces for content creators.

Key differentiators include voice quality, language coverage, real-time streaming capability, customization options (voice cloning, emotion control), and pricing models. Latency matters for conversational applications, while batch processing efficiency matters for content production workflows.

Related stacks

See how audio tools fit into a full infrastructure stack.

🎙️ Voice AI Stack

Frequently Asked Questions

What is AI text-to-speech?

AI text-to-speech (TTS) uses neural network models to convert written text into natural-sounding audio. Modern TTS systems produce voices with realistic intonation, pacing, and emotion, going far beyond the robotic-sounding synthesis of earlier generations.

How accurate is AI speech-to-text transcription?

Leading AI transcription models achieve word error rates below 5% for clear English audio, approaching human-level accuracy. Performance varies by language, accent, audio quality, and domain-specific vocabulary. Most providers offer custom vocabulary features to improve accuracy for specialized terms.

Can I clone a voice with AI?

Yes, several platforms support voice cloning from audio samples. The amount of reference audio needed ranges from a few seconds to several minutes depending on the provider and desired quality. Voice cloning raises ethical considerations, so most platforms require consent verification and prohibit impersonation.

Is your product missing?

Add it here →