Text-to-Speech Models¶

★★★★★ Intermediate

Modern TTS has moved from concatenative and parametric approaches to neural end-to-end models. Three dominant architectures: autoregressive (token-by-token, flexible but slow), non-autoregressive / flow-matching (fixed steps, faster inference), and diffusion language models (text to multi-codebook acoustic tokens).

Key Facts¶

Zero-shot TTS = clone any voice from a short reference clip (3-30 sec) without fine-tuning
Flow matching (FM) models use fixed NFE steps (typically 16-32), making inference time predictable
Autoregressive (AR) codec models generate audio tokens sequentially, better prosody but variable latency
Diffusion-based TTS adds gaussian noise to mel-spectrograms and learns to reverse the process
Diffusion Language Models (OmniVoice) map text directly to multi-codebook acoustic tokens - a third paradigm combining flow-based and AR strengths
Tokenizer-free architectures (VoxCPM2) skip discrete codec tokens entirely, preserving prosodic nuance that tokenization destroys
Most modern TTS outputs mel-spectrograms or audio codec tokens, then a vocoder (Vocos, HiFi-GAN) reconstructs waveform
Sampling rate matters: 16kHz (telephony), 24kHz (standard), 44.1/48kHz (studio quality)
End-to-end omni-models (Qwen3.5-Omni) combine ASR + TTS + LLM reasoning in a single architecture with real-time conversation support

Architecture Families¶

Flow Matching (F5-TTS lineage)¶

Non-autoregressive, fixed-step generation. Reference audio + text -> masked mel-spectrogram -> flow matching fills the mask.

Pipeline:
  reference_audio -> mel_spectrogram -> [MASK target region]
  text -> phoneme_encoder -> duration_predictor -> alignment
  flow_matching(masked_mel, text_embedding, NFE=32) -> full_mel
  vocoder(full_mel) -> waveform

Key models: - F5-TTS - foundational flow-matching TTS, high quality, multilingual - LEMAS-TTS (0.3B) - F5-TTS based, 10 languages including Russian, 150K+ hours training data - CosyVoice 2 (Alibaba) - streaming-capable flow matching, Mandarin-focused

Autoregressive Codec¶

Generate discrete audio tokens left-to-right, then decode with codec decoder.

Pipeline:
  text -> LLM backbone -> audio_codec_tokens (e.g. EnCodec, DAC)
  codec_decoder(tokens) -> waveform

Key models: - VoxCPM2 (2B) - tokenizer-free, 4-stage pipeline (LocEnc -> TSLM -> RALM -> LocDiT), diffusion-AR hybrid on MiniCPM-4 backbone, 30+ languages, 48kHz native with built-in super-resolution, Apache 2.0 - Orpheus TTS (Canopy AI) - LLM-native, emotional control via tags - Fish Speech - fast AR codec, good CJK support

Diffusion Language Model¶

Maps text directly to multi-codebook acoustic tokens via diffusion process - avoids single-codebook bottleneck of AR models.

Pipeline:
  text -> diffusion_language_model -> multi_codebook_acoustic_tokens
  codec_decoder(multi_codebook_tokens) -> waveform

  Key advantage: multi-codebook = richer acoustic representation
  than single-stream AR, while still being faster than flow matching

Key models: - OmniVoice (k2-fsa) - 600+ languages, RTF 0.025 (40x realtime), zero-shot cross-lingual cloning, noise-robust reference intake, Apache 2.0 - Voxtral 4B (Mistral) - 4B params, 9 languages (EN/FR/DE/ES/NL/PT/IT/HI/AR, no RU/ZH), 70ms latency, cloning from 3 sec, captures accent and disfluencies, open weights

End-to-End Omni-Models¶

Combined ASR + TTS + LLM reasoning in a single model. Input: audio or text. Output: text + speech simultaneously.

Architecture (Qwen3.5-Omni):
  Input audio/text -> Thinker module (reasoning, MoE)
  Thinker -> Talker module (speech synthesis)
  ARIA alignment: dynamic text-speech synchronization mid-generation

  Capabilities: real-time conversation, turn-taking detection,
  mid-conversation control (volume, tempo, emotion)

Key models: - Qwen3.5-Omni (Alibaba) - Thinker-Talker dual-module, 113 languages ASR / 36 languages TTS / 55 voices, real-time turn-taking, WER 6.24 on seed-hard (vs GPT-Audio 8.19), API only (not open-source), 3 variants: Plus/Flash/Light

Hybrid / Other¶

XTTS v2 (Coqui) - GPT-based + HiFi-GAN vocoder, proven multilingual (RU + EN), voice cloning from 6 sec
Kokoro-82M - tiny model (82M params), 100x realtime on CPU, English-focused
StyleTTS 2 - style diffusion + duration predictor, fast inference
Chatterbox (Resemble AI) - emotion control, cloning from short samples
Dia (Nari Labs) - dialogue-focused, multi-speaker generation

Model Comparison¶

Model	Params	Languages	Sample Rate	Architecture	Strength
OmniVoice	?	600+	?	Diffusion LM	Widest language coverage, fastest RTF
VoxCPM2	2B	30+	48kHz	Tokenizer-free Diff-AR	Studio quality, voice design
Qwen3.5-Omni	Large	36 TTS / 113 ASR	?	Omni (Thinker-Talker)	End-to-end conversation
LEMAS-TTS	0.3B	10	24kHz	Flow matching	Multilingual + word-level edit
Voxtral 4B	4B	9	?	Streaming AR	Low latency, accent capture
XTTS v2	~0.5B	17	24kHz	GPT + vocoder	Proven, stable
Kokoro-82M	82M	EN mainly	24kHz	StyleTTS-like	Speed, CPU-friendly
F5-TTS	0.3B	Multi	24kHz	Flow matching	Base for many forks
CosyVoice 2	~0.5B	Multi	22.05kHz	Flow matching	Streaming support
Fish Speech	~0.5B	Multi	44.1kHz	AR codec	Fast, good CJK

Inference Parameters¶

Common TTS parameters:
  NFE steps (flow matching): 16-32, higher = better quality, slower
  CFG strength: 1.0-3.0, controls adherence to text vs naturalness
  Temperature: 0.5-1.0, controls variation in AR models
  Speed: 0.8-1.2x, pitch-preserving time stretch
  Top-k / Top-p: AR sampling parameters, same as LLM text generation

Speech Editing¶

LEMAS-Edit and VoiceCraft enable word-level editing - replace specific words in a recording without regenerating the entire utterance. Two backends:

Flow-matching backend: faster, 10 languages
AR codec backend: 7 languages, requires WhisperX + MMS alignment for word boundaries

Evaluation Metrics¶

MOS (Mean Opinion Score) - human rating 1-5, gold standard but expensive
MUSHRA - multi-stimulus comparison test, better for comparing models
WER (Word Error Rate) - transcribe generated speech, compare to input text
Speaker similarity - cosine similarity of speaker embeddings between reference and generated
PESQ / POLQA - perceptual quality, correlates with MOS

Voice Design (Text-Described Voice Creation)¶

Some models can create a voice from a text description, no reference audio needed:

VoxCPM2 voice design:
  Input: "A warm female voice, age 30, slight accent, cheerful"
  -> Voice design module generates synthetic speaker embedding
  -> TTS generates speech with described characteristics

OmniVoice attribute control:
  Attributes: gender, age, pitch, dialect, whisper, speaking rate
  -> Fine-grained parametric control over synthetic voice
  -> Can combine: "young female, whispering, fast pace"

Gotchas¶

Reference audio quality is everything - noisy, reverberant, or music-contaminated references produce poor clones regardless of model quality. Apply UVR5 or DeepFilterNet denoising before use. Exception: OmniVoice is explicitly noise-robust and accepts noisy samples
Multilingual zero-shot has uneven quality - most models excel at English/Chinese but degrade on lower-resource languages. OmniVoice (600+ langs) and VoxCPM2 (30+ langs) have the widest coverage, but always test target language specifically
Flow matching NFE tradeoff is non-linear - going from 16 to 32 steps improves quality noticeably, but 32 to 64 shows diminishing returns while doubling latency
Codec-based models hallucinate under long inputs - AR models can loop, stutter, or skip words on texts >500 characters. Split long texts into sentence-level chunks
API-only models lock you in - Qwen3.5-Omni offers the best end-to-end experience but is API-only. Plan for fallback to open models (OmniVoice, VoxCPM2) if cost or availability becomes an issue
Tokenizer-free vs codec tradeoff - tokenizer-free models (VoxCPM2) preserve prosodic detail that codec-based models lose during tokenization, but they tend to be larger and slower