Skip to content

Voice Cloning

Intermediate

Voice cloning reproduces a target speaker's voice characteristics (timbre, pitch, rhythm) from a reference audio sample. Modern zero-shot approaches need only 5-30 seconds of clean audio, eliminating the need for hours of training data per speaker.

Key Facts

  • Zero-shot cloning extracts a speaker embedding from reference audio and conditions TTS generation on it
  • Speaker embeddings are typically 256-512 dimensional vectors from models like ECAPA-TDNN, WavLM, or Resemblyzer
  • Cross-lingual cloning preserves voice identity while generating speech in a language the speaker never recorded
  • Fine-tuned cloning (speaker adaptation) produces higher quality but requires 1-5 hours of target audio + GPU training
  • Quality depends heavily on: reference audio cleanliness, recording conditions, speaker consistency in reference

Approaches

Zero-Shot (No Training)

Extract speaker embedding from reference, inject into TTS decoder. No fine-tuning required.

Reference audio (5-30 sec)
  -> Speaker encoder (ECAPA-TDNN / WavLM)
  -> Speaker embedding (256-512 dim vector)
  -> TTS decoder conditioned on embedding + text
  -> Generated speech in target voice

Models supporting zero-shot cloning: - LEMAS-TTS, F5-TTS, CosyVoice 2 (flow matching) - XTTS v2, VoxCPM2, Fish Speech (autoregressive) - OmniVoice - 600+ languages, tag control for emotions

Speaker Adaptation (Fine-Tuning)

Fine-tune a base TTS model on target speaker data. Higher quality, but requires compute.

Base TTS model + 1-5 hours of target speaker audio
  -> Fine-tune decoder layers (freeze encoder)
  -> Speaker-specific model checkpoint
  -> Generate with text input only (no reference needed at inference)

Typically 500-2000 training steps, 10-30 minutes on a single GPU.

Voice Design (Text-Described)

Multiple models now support generating a voice from a text description rather than audio reference.

VoxCPM2 voice design:
  "A warm, deep male voice with slight British accent, age 40-50"
  -> Voice design module (built into tokenizer-free pipeline)
  -> Synthetic speaker embedding
  -> TTS generation at 48kHz studio quality

OmniVoice attribute-based design:
  Attributes: gender=male, age=40-50, pitch=low, dialect=british, style=warm
  -> Parametric attribute control (not free-text)
  -> Can specify: whisper, shouting, specific dialect
  -> Works across 600+ languages

Comparison: - VoxCPM2: free-text description, more flexible, 30+ languages - OmniVoice: attribute-based, more predictable results, 600+ languages

Cross-Lingual Cloning

The speaker's voice identity transfers across languages they never spoke. Quality varies by language distance.

Quality hierarchy for cross-lingual transfer:
  Same language family (EN -> DE):     High quality
  Same script (EN -> FR):              Good quality  
  Different family (EN -> ZH):         Moderate quality
  Low-resource target (EN -> Swahili): Lower quality

Key challenge: prosody patterns are language-specific. A cloned English speaker in Mandarin may sound foreign in rhythm even if timbre is perfect. VoxCPM2 and OmniVoice specifically optimize for this.

Cloning from minimal reference (2026 state of the art):

Model Min Reference Languages Cross-Lingual Notes
OmniVoice Few seconds 600+ Yes, zero-shot Noise-robust intake
VoxCPM2 Short clip 30+ Yes Diffusion-AR preserves emotion
Voxtral 4B 3 seconds 9 Limited Captures accent + disfluencies
Qwen3.5-Omni Voice upload 36 TTS Yes, 20 languages tested WER 1.87 across languages, cosine sim 0.79

Qwen3.5-Omni achieves the best measured cloning fidelity (WER 1.87 across 20 languages) but is API-only.

Emotion and Style Control

Modern cloning models support tag-based emotion injection:

Tag-controlled generation (OmniVoice, Orpheus):
  Input: "I can't believe it! [surprise] This is amazing [joy]"
  Tags: [laugh], [surprise], [sadness], [whisper], [shouting]

  The model generates speech with corresponding emotional inflection
  at tagged positions, while maintaining target speaker's voice identity.

Reference Audio Preparation

Optimal reference characteristics:
  Duration:    10-30 seconds (sweet spot for most models)
  Content:     Natural speech, varied intonation, no music/effects
  Quality:     Clean recording, low noise floor, no reverb
  Format:      WAV 16-bit, 24kHz+ sample rate, mono
  Speaker:     Single speaker, consistent voice (no whispering then shouting)

Preprocessing pipeline:
  1. Source separation (UVR5 / Demucs) - remove background music
  2. Noise reduction (DeepFilterNet / RNNoise) - reduce ambient noise
  3. Trim silence (ffmpeg silenceremove) - remove dead air
  4. Normalize loudness (ffmpeg loudnorm) - consistent level
  5. Resample to model's expected rate (usually 24kHz)

Voice Mixing: Creating Novel Voice Identities

Modern TTS models expose speaker embeddings as vectors. Mixing embeddings creates voices that don't correspond to any real person.

Linear Interpolation (Cartesia Sonic)

Cartesia Sonic uses 192-dimensional speaker embeddings (floats -1 to 1):

import numpy as np

def mix_voices(embedding_a: np.ndarray, embedding_b: np.ndarray, 
               weight_a: float = 0.5) -> np.ndarray:
    """Linear interpolation between two speaker embeddings."""
    weight_b = 1.0 - weight_a
    return weight_a * embedding_a + weight_b * embedding_b

# Via Cartesia API:
import requests

new_voice = requests.post("https://api.cartesia.ai/voices", json={
    "name": "Custom Mix",
    "embedding": mix_voices(voice_a_embedding, voice_b_embedding, 0.6).tolist()
}).json()

Warning: Perception does NOT change linearly with interpolation coefficient. A 50/50 mix may require non-equal weights (e.g. 0.7/0.3) to achieve perceptually balanced output. Use prototype embeddings (speed/emotion controls) to fine-tune characteristics.

Spherical Linear Interpolation (Slerp) - Research Method

Slerp preserves vector magnitude on the high-dimensional hypersphere, minimizing artifacts compared to linear interpolation. Used in VoxMorph (ICASSP 2026):

def slerp(a: np.ndarray, b: np.ndarray, t: float) -> np.ndarray:
    """Spherical linear interpolation between two embedding vectors."""
    a_norm = a / np.linalg.norm(a)
    b_norm = b / np.linalg.norm(b)

    dot = np.clip(np.dot(a_norm, b_norm), -1.0, 1.0)
    theta = np.arccos(dot)

    if np.abs(theta) < 1e-6:
        return a  # vectors nearly identical

    return (np.sin((1 - t) * theta) * a + np.sin(t * theta) * b) / np.sin(theta)

VoxMorph advantage: Disentangles prosody embedding (rhythm, pitch) from timbre embedding (vocal tract, formants). Can mix speaking style from speaker A with voice texture from speaker B independently. Input: 5 seconds per source speaker, zero-shot.

INSIDE Method (Novel Identity Generation)

Selects pairs of nearby speaker embeddings and computes intermediate identities via Slerp. Creates voices that "fill gaps" in embedding space between real speakers. Validated: +5.24% speaker verification improvement, +13.44% gender classification gain.

What Makes Synthetic Voices Sound Unnatural

The most common naturalness failures and which models address them:

Problem Symptom Best Solutions
Flat prosody Monotone regardless of content Sesame CSM (context-aware), Fish S2 (15K+ tags)
Missing micro-pauses Uniform word spacing Sesame CSM (natural umms/uhhs), VoxCPM2 (tokenizer-free)
Absent breathing No breath intakes between sentences Orpheus TTS (<sigh>, natural breathing)
Emotion flatness Same intonation for joke vs tragedy Fish S2 Pro (sub-word emotion), IndexTTS-2.5 (8D vector)
Pitch range compression Narrower pitch than humans VoxCPM2 48kHz (no tokenization loss), CosyVoice 3 (RL)
Context ignorance Words read, not understood Sesame CSM (conversational context modeling)
Robotic transitions "Concatenated" word boundaries RL-aligned models (Fish S2, CosyVoice 3)

Practical workarounds: - Inject commas, ellipses, micro-pause markers to guide prosody - Use model-specific tags ([laugh], [excited], [whisper]) for expression - Provide conversational history for context-aware models (Sesame CSM) - RL-aligned post-training improves naturalness without architecture changes

Naturalness Benchmarks (2026)

Model MOS Win Rate Notes
Sesame CSM 4.7 near-human (without context) English only
Cartesia Sonic 2 4.7 ~90ms TTFB
Orpheus TTS 4.6 Llama-3B based
Fish S2 Pro (EN) 4.50 exp / 4.21 nat Bradley-Terry 3.07 #1 vs all major competitors
Fish S2 Pro (ZH) 4.94 exp / 4.40 nat Strongest in Chinese
Human speech ~4.5-4.8 Reference Varies by recording

Key trend: Quality gap between top open-source and commercial TTS shrunk from ~1.0 MOS (2023) to 0.1-0.2 MOS (2026). Fish S2 Pro beats gpt-4o-mini-tts on EmergentTTS-Eval (81.88% win rate). Voxtral beats ElevenLabs Flash v2.5 in blind tests.

Safety and Ethics

  • Consent: voice cloning without speaker consent is illegal in many jurisdictions
  • Deepfake detection: generated speech can be detected via spectral analysis, watermarking (AudioSeal by Meta), or artifact detection
  • Watermarking: some models (LEMAS, CosyVoice) embed inaudible watermarks in generated audio
  • Rate limiting: production deployments should limit cloning requests per user/API key
  • Speaker verification: pair cloning APIs with speaker verification to confirm the uploader is the voice owner

Gotchas

  • Short references degrade on rare phonemes - a 5-second clip may not contain enough phonetic diversity. The model invents missing sounds, sometimes inconsistently. Use 15-30 sec references with diverse content. Exception: Voxtral 4B is specifically designed for 3-second references
  • Background noise in reference transfers to output - even small amounts of room reverb or air conditioning hum get encoded in the speaker embedding and appear in all generated speech. Always denoise first. Exception: OmniVoice explicitly handles noisy reference samples
  • Cross-lingual accent bleed - if the reference audio is in English, generated Chinese speech often has an English-accented rhythm. Providing reference audio in the target language (even a few seconds) dramatically improves results
  • API-only cloning = vendor lock-in - Qwen3.5-Omni has the best measured cloning quality but is API-only. Build fallback paths to open models (OmniVoice, VoxCPM2) for production systems

See Also