Skip to content

Audio & Voice

Modern speech systems span three core areas: recognizing spoken language (ASR), synthesizing voices (TTS), and converting or cloning speaker identity. Articles here cover current model families (Whisper, Qwen3-ASR, NVIDIA Canary, F5-TTS, CosyVoice), latency budgets for real-time voice agents, fine-tuning infrastructure on rented GPUs, and multilingual deployment trade-offs. Each piece is a dense reference: architecture diagrams, commands, benchmarks, and integration gotchas — not tutorials.

Speech & Recognition

Text-to-Speech

Voice Applications