Skip to content

Speech Recognition

Advanced

Automatic Speech Recognition (ASR) converts spoken audio to text. As of early 2026, the field has moved past Whisper dominance - Qwen3-ASR, NVIDIA Canary-Qwen, and Voxtral Realtime all significantly outperform Whisper on accuracy and/or speed. Whisper large-v3 remains a reliable baseline but has not been updated.

Key Facts

  • Qwen3-ASR (Jan 2026) is the new open-source SOTA for multilingual ASR - 52 languages, Apache 2.0, trained on ~40M hours
  • MAI-Transcribe-1 (Apr 2026) is the new commercial SOTA - 3.8% WER on FLEURS top-25 languages, proprietary
  • Whisper large-v3 remains viable but has not received a v4 update - the ecosystem has moved past it
  • Streaming ASR is now production-ready with Voxtral Realtime (13 langs) and Nemotron Speech (EN, <24ms)
  • Edge ASR viable at <1GB RAM with Moonshine v2 (27M params, 50ms) and Parakeet.cpp (pure C++)
  • SALM (Speech-Augmented Language Model) architecture combines ASR encoder + LLM decoder for transcription + reasoning in one model

Model Comparison

Model Architecture Params Languages Streaming Key WER Strength
Qwen3-ASR 1.7B AuT + Qwen3 LLM 1.7B 52 Yes (dynamic) TEDLIUM 4.50 Best open multilingual
MAI-Transcribe-1 Enc-Dec Transformer ? 25 No FLEURS 3.8% Best commercial accuracy
Canary-Qwen 2.5B FastConformer + Qwen3 2.5B EN (+LLM) No LS-Clean 1.6% ASR + summarization/QA
Voxtral Realtime 4B LM + encoder 4B 13 Yes (80ms-2.4s) Near offline quality Configurable latency
Whisper large-v3 Enc-Dec Transformer 1.55B 100+ No LS 2.7% Most languages, stable
Parakeet-TDT-V3 FastConformer-TDT 0.6B 25 EU Yes RTFx ~2000 Fastest open EN ASR
Canary-1B-V2 Multi-task CTC 1B 25 EU (incl. RU) No 10x faster than 3x larger Translation built-in
Nemotron Speech Cache-Aware Conformer 0.6B EN Yes (<24ms) 7.2-7.8% 560 streams/H100
Moonshine v2 Tiny Ergodic Streaming Enc 27M EN Yes (50ms) On-par 6x size Edge, <500MB RAM
Fun-ASR-Nano End-to-end ? 31 (Asia focus) Yes 93% in noise Chinese dialects/accents
Whisper.cpp GGML Whisper 1.55B 100+ Chunked ~3% CPU/edge deployment

Best open-source multilingual ASR as of early 2026. Apache 2.0. Trained on ~40M hours of pseudo-labeled data.

Architecture:
  AuT (Audio Transformer) encoder -> learned projector -> Qwen3 LLM decoder

  0.6B variant: AuT 180M params (hidden 896) + Qwen3 decoder
  1.7B variant: AuT 300M params (hidden 1024) + Qwen3 decoder

Key benchmarks (1.7B):
  TEDLIUM (EN): 4.50 WER  (vs GPT-4o 7.69, Whisper-lv3 6.84)
  WenetSpeech (ZH): 4.97  (vs GPT-4o 15.30, Whisper-lv3 9.86)
  Language ID (30 langs): 97.9% (vs Whisper-lv3 94.1%)

Streaming: dynamic flash attention, window 1s-8s
Latency (0.6B): 92ms TTFT, 2000x throughput at concurrency 128
Long audio: up to 20 minutes per pass
VRAM: ~2GB (0.6B), ~6GB (1.7B)

Also released: Qwen3-ForcedAligner-0.6B for text-speech alignment in 11 languages.

Whisper Usage (Legacy Baseline)

Basic Transcription

import whisper

model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3", language="en")

print(result["text"])
for segment in result["segments"]:
    print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")

Faster-Whisper (Production)

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, vad_filter=True)

print(f"Detected language: {info.language} ({info.language_probability:.2f})")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Word-Level Timestamps

# Faster-Whisper word timestamps
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)
for segment in segments:
    for word in segment.words:
        print(f"  {word.start:.2f}s - {word.end:.2f}s: '{word.word}' (p={word.probability:.2f})")

Streaming ASR

As of 2026, native streaming ASR is production-ready in multiple languages.

Native streaming models (recommended):
  1. Voxtral Realtime 4B - 13 languages (incl. EN/ZH/RU), configurable latency
     At 480ms: competitive with Whisper + ElevenLabs Scribe v2
     At 960ms: surpasses both
     At 2400ms: within 1% of offline quality
     Open-source, Apache 2.0, ~8-10GB VRAM

  2. Nemotron Speech Streaming 600M - English only, <24ms final transcripts
     Cache-Aware FastConformer-RNNT, chunk sizes 80-1120ms
     560 concurrent streams on single H100 at 320ms chunk
     3x throughput vs traditional buffered streaming

  3. Qwen3-ASR 0.6B - 52 languages, streaming via dynamic flash attention
     Window sizes 1s-8s, 92ms TTFT
     Best multilingual streaming option

  4. Moonshine v2 - 27M-medium params, edge streaming
     50ms (Tiny) to 258ms (Medium) latency
     <1GB memory, designed for edge processors

Legacy approaches (still valid for Whisper):
  5. Chunked Whisper - split into overlapping 30s chunks
     Latency: chunk_duration + inference (2-5 sec)

  6. VAD + Whisper - transcribe complete utterances
     Latency: utterance_end + inference

VAD-Assisted Pipeline

# Silero VAD + Faster-Whisper pipeline
import torch
from faster_whisper import WhisperModel

vad_model, utils = torch.hub.load('snakers4/silero-vad', 'silero_vad')
(get_speech_timestamps, _, read_audio, _, _) = utils

wav = read_audio('audio.wav', sampling_rate=16000)
speech_timestamps = get_speech_timestamps(wav, vad_model, sampling_rate=16000)
# Each timestamp = a speech segment to transcribe independently

WhisperX - Enhanced Alignment

WhisperX adds forced phoneme alignment for precise word boundaries, essential for speech editing and subtitle generation.

WhisperX pipeline:
  1. Whisper transcription (batch mode for speed)
  2. VAD-based segmentation (cut on silence)
  3. Forced alignment via wav2vec2 / MMS alignment model
  4. Speaker diarization (optional, via pyannote)

Output: word-level timestamps with <50ms accuracy

Multilingual ASR Selection Guide

Best model per language combination (open-source, 2026):

Target Best Model Fallback Notes
ZH + EN + RU Qwen3-ASR 1.7B Voxtral Realtime Qwen3 destroys Whisper on Chinese
EN only Canary-Qwen 2.5B Parakeet-TDT V3 1.6% WER + LLM reasoning mode
EN streaming Nemotron Speech Voxtral Realtime <24ms, 560 streams/H100
Multi streaming Voxtral Realtime Qwen3-ASR 0.6B 13 langs, configurable latency
Chinese dialects Fun-ASR-Nano Qwen3-ASR 7 dialects, 26 accents
Edge / mobile Moonshine v2 Tiny Sherpa-ONNX 27M params, <500MB RAM
Commercial SOTA MAI-Transcribe-1 Qwen3.5-Omni ASR 3.8% WER FLEURS, API only

Language-Specific Notes

  • Russian: Qwen3-ASR and Voxtral Realtime both support RU. Canary-1B-V2 includes RU in its 25 European languages. Whisper still viable for RU-only
  • Chinese: Qwen3-ASR is the clear winner - WER 4.97 on WenetSpeech vs Whisper's 9.86. Fun-ASR-Nano excels at Chinese dialects (Wu, Cantonese, Hokkien, etc.)
  • Code-switching (mixing languages): Qwen3-ASR and Fun-ASR-Nano handle free language switching better than Whisper
  • Accented speech: Fun-ASR-Nano covers 26 Chinese regional accent varieties. For other languages, Whisper's diverse training data still provides reasonable accent robustness

On-Device / Edge ASR

Deployment options (2026):
  Moonshine v2 Tiny:  27M params, <500MB RAM, 50ms latency
  Parakeet.cpp:       Pure C++, no Python/ONNX, Metal GPU 27ms/10s audio
  Sherpa-ONNX:        Android/iOS/HarmonyOS/RPi/RISC-V, 12 languages
  Qwen3-ASR 0.6B:     MLX port for Apple Silicon (mlx-qwen3-asr)
  whisper.cpp:         CPU/Metal, still viable for Whisper models

Edge deployment guidelines:
  - INT4/INT8 quantization essential for mobile
  - Streaming architectures reduce memory 40%+ vs standard transformers
  - Tiny/Base variants: <500MB RAM target

Pronunciation Assessment

Open-source pronunciation scoring remains a gap. Proprietary APIs (Azure Speech, SpeechSuper) dominate.

Best open-source path: Qwen3-ASR + forced alignment (Qwen3-ForcedAligner-0.6B) + goodness-of-pronunciation scoring.

Available resources: - SpeechOcean762 - public dataset (5000 EN sentences with quality scores) - Kaldi GOP - Goodness of Pronunciation metric, mature - Qwen3-ForcedAligner-0.6B - text-speech alignment in 11 languages, building block for pronunciation scoring

Gotchas

  • Whisper hallucinates on silence - if input contains long silent segments, Whisper generates phantom text (repeated phrases, random words). Always apply VAD filtering before transcription (vad_filter=True in faster-whisper). Newer models (Qwen3-ASR, Voxtral) are more robust to this
  • 30-second window boundary cuts words - Whisper's fixed 30s context window can split words at boundaries. Native streaming models (Voxtral Realtime, Nemotron) avoid this entirely with sliding-window architectures
  • Language detection is per-file, not per-segment - Whisper detects language once for the entire audio. Qwen3-ASR has per-segment language ID with 97.9% accuracy across 30 languages
  • Timestamp accuracy varies by speech rate - word timestamps from Whisper cross-attention are approximate (+-200ms). For precise alignment, use WhisperX forced alignment or Qwen3-ForcedAligner
  • Whisper is no longer the best choice for most tasks - as of early 2026, Qwen3-ASR outperforms Whisper on accuracy for supported languages, Voxtral Realtime beats it on streaming, and Moonshine v2 beats it on edge. Whisper's advantage is only language count (100+ vs 52)
  • VRAM vs accuracy tradeoff shifted - Qwen3-ASR 0.6B at ~2GB VRAM outperforms Whisper large-v3 at ~10GB on Chinese. Check benchmarks before defaulting to larger models

See Also