Skip to content

Voice Conversion

Intermediate

Voice conversion (VC) transforms the speaker identity in existing audio while preserving linguistic content, prosody, and timing. Unlike voice cloning (which generates new speech from text), VC works on existing recordings - the input is audio, not text.

Key Facts

  • Voice conversion operates on existing audio: input speech -> output speech with different voice identity
  • Key distinction from TTS cloning: VC preserves the original timing, intonation, and emotion
  • RVC (Retrieval-based Voice Conversion) is the dominant open-source approach, based on VITS + retrieval
  • Real-time VC achievable at <20ms latency on GPU, enabling live voice changing
  • Singing voice conversion (SVC) is a major application - apply any singer's voice to a recording
  • VC models typically need 10-30 minutes of target speaker training data

Architecture

RVC (Retrieval-based Voice Conversion)

RVC pipeline:
  Input audio
    -> Pitch extraction (RMVPE / CREPE / Harvest)
    -> Content encoder (HuBERT / ContentVec) -> content features
    -> FAISS index retrieval -> find similar features from target speaker
    -> Blend retrieved features with encoder output (index_rate: 0.0-1.0)
    -> VITS decoder (conditioned on target speaker embedding)
    -> Pitch-shifted, voice-converted output

Training: 10-30 min of target speaker audio, ~20 min on GPU
Inference: real-time capable on modern GPU

SO-VITS-SVC

Specialized for singing voice conversion.

SO-VITS-SVC pipeline:
  Input singing audio
    -> Content encoder (ContentVec)
    -> F0 (pitch) extraction and explicit pitch curve
    -> VITS decoder with speaker conditioning
    -> Output: same melody, different voice

Key difference from RVC: explicit pitch modeling makes it
better for singing where pitch accuracy is critical.
Training: 1-4 hours of target singer audio recommended.

Real-Time Voice Conversion

Requirements for real-time VC (<50ms latency):
  - GPU: RTX 3060+ for comfortable real-time
  - Buffer size: 128-512 samples (3-11ms at 44.1kHz)
  - Lookahead: minimal (adds latency but improves quality)
  - Model: RVC with ONNX export or TensorRT optimization

Practical latency breakdown:
  Audio capture:     ~5ms (WASAPI/ASIO)
  Feature extraction: ~3ms (HuBERT on GPU)
  Conversion:        ~5ms (VITS decoder on GPU)
  Audio output:      ~5ms (WASAPI/ASIO)
  Total:             ~18ms (imperceptible)

Streaming Pipeline

# Conceptual real-time VC pipeline
import sounddevice as sd
import numpy as np

BLOCK_SIZE = 512  # ~11ms at 44.1kHz
SAMPLE_RATE = 44100

def callback(indata, outdata, frames, time, status):
    # indata: captured microphone audio
    audio_chunk = indata[:, 0]

    # Extract content features (HuBERT)
    features = content_encoder(audio_chunk)

    # Extract pitch (RMVPE)
    f0 = pitch_extractor(audio_chunk)

    # Convert voice identity
    converted = vits_decoder(features, f0, target_speaker_embedding)

    outdata[:, 0] = converted

stream = sd.Stream(
    samplerate=SAMPLE_RATE,
    blocksize=BLOCK_SIZE,
    channels=1,
    callback=callback
)

Voice Conversion vs Voice Cloning

Aspect Voice Conversion Voice Cloning (TTS)
Input Audio recording Text
Preserves Timing, prosody, emotion Nothing (generates from scratch)
Output timing Same as input Model-determined
Use case Change speaker in existing audio Generate new speech
Training data 10-30 min target speaker 5 sec (zero-shot) to hours
Real-time Yes (with optimization) Depends on model
Singing Excellent (SVC models) Poor (TTS models can't sing well)

Training a VC Model

RVC training workflow:
  1. Collect 10-30 min of clean target speaker audio
  2. Preprocess: denoise, normalize, segment into 5-15s clips
  3. Extract features: HuBERT content features + pitch (F0)
  4. Build FAISS index from target speaker features
  5. Train VITS decoder: 200-500 epochs, ~20 min on RTX 3090
  6. Test: convert a sample, check for artifacts

Quality tips:
  - More diverse training data (different emotions, speeds) = better model
  - Clean audio matters more than quantity
  - Pitch extraction method: RMVPE > CREPE > Harvest (quality vs speed)

Applications

  • Singing voice: apply any singer's timbre to your own recordings
  • Dubbing: replace voice actor identity while keeping original performance timing
  • Privacy: anonymize speaker identity in recordings
  • Accessibility: convert speech to a voice the listener finds easier to understand
  • Live streaming: real-time voice changing for content creators

Gotchas

  • Pitch range mismatch causes artifacts - converting a deep male voice to a high female voice (or vice versa) produces warbling and robotic artifacts at pitch extremes. Keep source and target within 1 octave of each other, or use explicit pitch shifting as a preprocessing step
  • RVC index_rate is a critical but poorly documented parameter - at 0.0, no retrieval is used (pure encoder output, less like target). At 1.0, full retrieval (more like target but can sound choppy). Sweet spot is typically 0.3-0.6 depending on how different source and target voices are
  • Background music bleeds through conversion - VC models are trained on clean speech. Music/noise in input passes through unchanged or creates artifacts. Always separate vocals first (Demucs/UVR5), convert, then remix

See Also

  • voice cloning - text-to-speech based voice cloning, complementary approach
  • tts models - TTS architectures that share components with VC (VITS, HuBERT)
  • audio generation - music generation, relevant for SVC applications