Speech Recognition¶
Automatic Speech Recognition (ASR) converts spoken audio to text. As of early 2026, the field has moved past Whisper dominance - Qwen3-ASR, NVIDIA Canary-Qwen, and Voxtral Realtime all significantly outperform Whisper on accuracy and/or speed. Whisper large-v3 remains a reliable baseline but has not been updated.
Key Facts¶
- Qwen3-ASR (Jan 2026) is the new open-source SOTA for multilingual ASR - 52 languages, Apache 2.0, trained on ~40M hours
- MAI-Transcribe-1 (Apr 2026) is the new commercial SOTA - 3.8% WER on FLEURS top-25 languages, proprietary
- Whisper large-v3 remains viable but has not received a v4 update - the ecosystem has moved past it
- Streaming ASR is now production-ready with Voxtral Realtime (13 langs) and Nemotron Speech (EN, <24ms)
- Edge ASR viable at <1GB RAM with Moonshine v2 (27M params, 50ms) and Parakeet.cpp (pure C++)
- SALM (Speech-Augmented Language Model) architecture combines ASR encoder + LLM decoder for transcription + reasoning in one model
Model Comparison¶
| Model | Architecture | Params | Languages | Streaming | Key WER | Strength |
|---|---|---|---|---|---|---|
| Qwen3-ASR 1.7B | AuT + Qwen3 LLM | 1.7B | 52 | Yes (dynamic) | TEDLIUM 4.50 | Best open multilingual |
| MAI-Transcribe-1 | Enc-Dec Transformer | ? | 25 | No | FLEURS 3.8% | Best commercial accuracy |
| Canary-Qwen 2.5B | FastConformer + Qwen3 | 2.5B | EN (+LLM) | No | LS-Clean 1.6% | ASR + summarization/QA |
| Voxtral Realtime 4B | LM + encoder | 4B | 13 | Yes (80ms-2.4s) | Near offline quality | Configurable latency |
| Whisper large-v3 | Enc-Dec Transformer | 1.55B | 100+ | No | LS 2.7% | Most languages, stable |
| Parakeet-TDT-V3 | FastConformer-TDT | 0.6B | 25 EU | Yes | RTFx ~2000 | Fastest open EN ASR |
| Canary-1B-V2 | Multi-task CTC | 1B | 25 EU (incl. RU) | No | 10x faster than 3x larger | Translation built-in |
| Nemotron Speech | Cache-Aware Conformer | 0.6B | EN | Yes (<24ms) | 7.2-7.8% | 560 streams/H100 |
| Moonshine v2 Tiny | Ergodic Streaming Enc | 27M | EN | Yes (50ms) | On-par 6x size | Edge, <500MB RAM |
| Fun-ASR-Nano | End-to-end | ? | 31 (Asia focus) | Yes | 93% in noise | Chinese dialects/accents |
| Whisper.cpp | GGML Whisper | 1.55B | 100+ | Chunked | ~3% | CPU/edge deployment |
Qwen3-ASR (Recommended Open-Source, 2026)¶
Best open-source multilingual ASR as of early 2026. Apache 2.0. Trained on ~40M hours of pseudo-labeled data.
Architecture:
AuT (Audio Transformer) encoder -> learned projector -> Qwen3 LLM decoder
0.6B variant: AuT 180M params (hidden 896) + Qwen3 decoder
1.7B variant: AuT 300M params (hidden 1024) + Qwen3 decoder
Key benchmarks (1.7B):
TEDLIUM (EN): 4.50 WER (vs GPT-4o 7.69, Whisper-lv3 6.84)
WenetSpeech (ZH): 4.97 (vs GPT-4o 15.30, Whisper-lv3 9.86)
Language ID (30 langs): 97.9% (vs Whisper-lv3 94.1%)
Streaming: dynamic flash attention, window 1s-8s
Latency (0.6B): 92ms TTFT, 2000x throughput at concurrency 128
Long audio: up to 20 minutes per pass
VRAM: ~2GB (0.6B), ~6GB (1.7B)
Also released: Qwen3-ForcedAligner-0.6B for text-speech alignment in 11 languages.
Whisper Usage (Legacy Baseline)¶
Basic Transcription¶
import whisper
model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3", language="en")
print(result["text"])
for segment in result["segments"]:
print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")
Faster-Whisper (Production)¶
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, vad_filter=True)
print(f"Detected language: {info.language} ({info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Word-Level Timestamps¶
# Faster-Whisper word timestamps
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)
for segment in segments:
for word in segment.words:
print(f" {word.start:.2f}s - {word.end:.2f}s: '{word.word}' (p={word.probability:.2f})")
Streaming ASR¶
As of 2026, native streaming ASR is production-ready in multiple languages.
Native streaming models (recommended):
1. Voxtral Realtime 4B - 13 languages (incl. EN/ZH/RU), configurable latency
At 480ms: competitive with Whisper + ElevenLabs Scribe v2
At 960ms: surpasses both
At 2400ms: within 1% of offline quality
Open-source, Apache 2.0, ~8-10GB VRAM
2. Nemotron Speech Streaming 600M - English only, <24ms final transcripts
Cache-Aware FastConformer-RNNT, chunk sizes 80-1120ms
560 concurrent streams on single H100 at 320ms chunk
3x throughput vs traditional buffered streaming
3. Qwen3-ASR 0.6B - 52 languages, streaming via dynamic flash attention
Window sizes 1s-8s, 92ms TTFT
Best multilingual streaming option
4. Moonshine v2 - 27M-medium params, edge streaming
50ms (Tiny) to 258ms (Medium) latency
<1GB memory, designed for edge processors
Legacy approaches (still valid for Whisper):
5. Chunked Whisper - split into overlapping 30s chunks
Latency: chunk_duration + inference (2-5 sec)
6. VAD + Whisper - transcribe complete utterances
Latency: utterance_end + inference
VAD-Assisted Pipeline¶
# Silero VAD + Faster-Whisper pipeline
import torch
from faster_whisper import WhisperModel
vad_model, utils = torch.hub.load('snakers4/silero-vad', 'silero_vad')
(get_speech_timestamps, _, read_audio, _, _) = utils
wav = read_audio('audio.wav', sampling_rate=16000)
speech_timestamps = get_speech_timestamps(wav, vad_model, sampling_rate=16000)
# Each timestamp = a speech segment to transcribe independently
WhisperX - Enhanced Alignment¶
WhisperX adds forced phoneme alignment for precise word boundaries, essential for speech editing and subtitle generation.
WhisperX pipeline:
1. Whisper transcription (batch mode for speed)
2. VAD-based segmentation (cut on silence)
3. Forced alignment via wav2vec2 / MMS alignment model
4. Speaker diarization (optional, via pyannote)
Output: word-level timestamps with <50ms accuracy
Multilingual ASR Selection Guide¶
Best model per language combination (open-source, 2026):
| Target | Best Model | Fallback | Notes |
|---|---|---|---|
| ZH + EN + RU | Qwen3-ASR 1.7B | Voxtral Realtime | Qwen3 destroys Whisper on Chinese |
| EN only | Canary-Qwen 2.5B | Parakeet-TDT V3 | 1.6% WER + LLM reasoning mode |
| EN streaming | Nemotron Speech | Voxtral Realtime | <24ms, 560 streams/H100 |
| Multi streaming | Voxtral Realtime | Qwen3-ASR 0.6B | 13 langs, configurable latency |
| Chinese dialects | Fun-ASR-Nano | Qwen3-ASR | 7 dialects, 26 accents |
| Edge / mobile | Moonshine v2 Tiny | Sherpa-ONNX | 27M params, <500MB RAM |
| Commercial SOTA | MAI-Transcribe-1 | Qwen3.5-Omni ASR | 3.8% WER FLEURS, API only |
Language-Specific Notes¶
- Russian: Qwen3-ASR and Voxtral Realtime both support RU. Canary-1B-V2 includes RU in its 25 European languages. Whisper still viable for RU-only
- Chinese: Qwen3-ASR is the clear winner - WER 4.97 on WenetSpeech vs Whisper's 9.86. Fun-ASR-Nano excels at Chinese dialects (Wu, Cantonese, Hokkien, etc.)
- Code-switching (mixing languages): Qwen3-ASR and Fun-ASR-Nano handle free language switching better than Whisper
- Accented speech: Fun-ASR-Nano covers 26 Chinese regional accent varieties. For other languages, Whisper's diverse training data still provides reasonable accent robustness
On-Device / Edge ASR¶
Deployment options (2026):
Moonshine v2 Tiny: 27M params, <500MB RAM, 50ms latency
Parakeet.cpp: Pure C++, no Python/ONNX, Metal GPU 27ms/10s audio
Sherpa-ONNX: Android/iOS/HarmonyOS/RPi/RISC-V, 12 languages
Qwen3-ASR 0.6B: MLX port for Apple Silicon (mlx-qwen3-asr)
whisper.cpp: CPU/Metal, still viable for Whisper models
Edge deployment guidelines:
- INT4/INT8 quantization essential for mobile
- Streaming architectures reduce memory 40%+ vs standard transformers
- Tiny/Base variants: <500MB RAM target
Pronunciation Assessment¶
Open-source pronunciation scoring remains a gap. Proprietary APIs (Azure Speech, SpeechSuper) dominate.
Best open-source path: Qwen3-ASR + forced alignment (Qwen3-ForcedAligner-0.6B) + goodness-of-pronunciation scoring.
Available resources: - SpeechOcean762 - public dataset (5000 EN sentences with quality scores) - Kaldi GOP - Goodness of Pronunciation metric, mature - Qwen3-ForcedAligner-0.6B - text-speech alignment in 11 languages, building block for pronunciation scoring
Gotchas¶
- Whisper hallucinates on silence - if input contains long silent segments, Whisper generates phantom text (repeated phrases, random words). Always apply VAD filtering before transcription (
vad_filter=Truein faster-whisper). Newer models (Qwen3-ASR, Voxtral) are more robust to this - 30-second window boundary cuts words - Whisper's fixed 30s context window can split words at boundaries. Native streaming models (Voxtral Realtime, Nemotron) avoid this entirely with sliding-window architectures
- Language detection is per-file, not per-segment - Whisper detects language once for the entire audio. Qwen3-ASR has per-segment language ID with 97.9% accuracy across 30 languages
- Timestamp accuracy varies by speech rate - word timestamps from Whisper cross-attention are approximate (+-200ms). For precise alignment, use WhisperX forced alignment or Qwen3-ForcedAligner
- Whisper is no longer the best choice for most tasks - as of early 2026, Qwen3-ASR outperforms Whisper on accuracy for supported languages, Voxtral Realtime beats it on streaming, and Moonshine v2 beats it on edge. Whisper's advantage is only language count (100+ vs 52)
- VRAM vs accuracy tradeoff shifted - Qwen3-ASR 0.6B at ~2GB VRAM outperforms Whisper large-v3 at ~10GB on Chinese. Check benchmarks before defaulting to larger models
See Also¶
- podcast processing - full pipeline using ASR + diarization + editing
- tts models - TTS models that use ASR for quality evaluation
- voice cloning - WhisperX alignment used in speech editing workflows