Audio & Voice¶

Modern speech systems span three core areas: recognizing spoken language (ASR), synthesizing voices (TTS), and converting or cloning speaker identity. Articles here cover current model families (Whisper, Qwen3-ASR, NVIDIA Canary, F5-TTS, CosyVoice), latency budgets for real-time voice agents, fine-tuning infrastructure on rented GPUs, and multilingual deployment trade-offs. Each piece is a dense reference: architecture diagrams, commands, benchmarks, and integration gotchas — not tutorials.

Speech & Recognition¶

speech recognition - ASR models, transcription, pronunciation assessment

Text-to-Speech¶

tts models - TTS model comparison, latency benchmarks, multilingual support
voice cloning - Voice cloning, voice mixing, naturalness benchmarks
voice conversion - Voice conversion techniques and pipelines
audio generation - Audio generation models and workflows

Voice Applications¶

voice agent pipelines - Voice agent pipelines and frameworks for real-time applications
podcast processing - Podcast processing, transcription, and analysis

Audio & Voice¶

Speech & Recognition¶

Text-to-Speech¶

Voice Applications¶

Stay updated