Audio & Voice¶
Modern speech systems span three core areas: recognizing spoken language (ASR), synthesizing voices (TTS), and converting or cloning speaker identity. Articles here cover current model families (Whisper, Qwen3-ASR, NVIDIA Canary, F5-TTS, CosyVoice), latency budgets for real-time voice agents, fine-tuning infrastructure on rented GPUs, and multilingual deployment trade-offs. Each piece is a dense reference: architecture diagrams, commands, benchmarks, and integration gotchas — not tutorials.
Speech & Recognition¶
- speech recognition - ASR models, transcription, pronunciation assessment
Text-to-Speech¶
- tts models - TTS model comparison, latency benchmarks, multilingual support
- voice cloning - Voice cloning, voice mixing, naturalness benchmarks
- voice conversion - Voice conversion techniques and pipelines
- audio generation - Audio generation models and workflows
Voice Applications¶
- voice agent pipelines - Voice agent pipelines and frameworks for real-time applications
- podcast processing - Podcast processing, transcription, and analysis