Voice Conversion¶

★★★★★ Intermediate

Voice conversion (VC) transforms the speaker identity in existing audio while preserving linguistic content, prosody, and timing. Unlike voice cloning (which generates new speech from text), VC works on existing recordings - the input is audio, not text.

Key Facts¶

Voice conversion operates on existing audio: input speech -> output speech with different voice identity
Key distinction from TTS cloning: VC preserves the original timing, intonation, and emotion
RVC (Retrieval-based Voice Conversion) is the dominant open-source approach, based on VITS + retrieval
Real-time VC achievable at <20ms latency on GPU, enabling live voice changing
Singing voice conversion (SVC) is a major application - apply any singer's voice to a recording
VC models typically need 10-30 minutes of target speaker training data

Architecture¶

RVC (Retrieval-based Voice Conversion)¶

RVC pipeline:
  Input audio
    -> Pitch extraction (RMVPE / CREPE / Harvest)
    -> Content encoder (HuBERT / ContentVec) -> content features
    -> FAISS index retrieval -> find similar features from target speaker
    -> Blend retrieved features with encoder output (index_rate: 0.0-1.0)
    -> VITS decoder (conditioned on target speaker embedding)
    -> Pitch-shifted, voice-converted output

Training: 10-30 min of target speaker audio, ~20 min on GPU
Inference: real-time capable on modern GPU

SO-VITS-SVC¶

Specialized for singing voice conversion.

SO-VITS-SVC pipeline:
  Input singing audio
    -> Content encoder (ContentVec)
    -> F0 (pitch) extraction and explicit pitch curve
    -> VITS decoder with speaker conditioning
    -> Output: same melody, different voice

Key difference from RVC: explicit pitch modeling makes it
better for singing where pitch accuracy is critical.
Training: 1-4 hours of target singer audio recommended.

Real-Time Voice Conversion¶

Requirements for real-time VC (<50ms latency):
  - GPU: RTX 3060+ for comfortable real-time
  - Buffer size: 128-512 samples (3-11ms at 44.1kHz)
  - Lookahead: minimal (adds latency but improves quality)
  - Model: RVC with ONNX export or TensorRT optimization

Practical latency breakdown:
  Audio capture:     ~5ms (WASAPI/ASIO)
  Feature extraction: ~3ms (HuBERT on GPU)
  Conversion:        ~5ms (VITS decoder on GPU)
  Audio output:      ~5ms (WASAPI/ASIO)
  Total:             ~18ms (imperceptible)

Streaming Pipeline¶

# Conceptual real-time VC pipeline
import sounddevice as sd
import numpy as np

BLOCK_SIZE = 512  # ~11ms at 44.1kHz
SAMPLE_RATE = 44100

def callback(indata, outdata, frames, time, status):
    # indata: captured microphone audio
    audio_chunk = indata[:, 0]

    # Extract content features (HuBERT)
    features = content_encoder(audio_chunk)

    # Extract pitch (RMVPE)
    f0 = pitch_extractor(audio_chunk)

    # Convert voice identity
    converted = vits_decoder(features, f0, target_speaker_embedding)

    outdata[:, 0] = converted

stream = sd.Stream(
    samplerate=SAMPLE_RATE,
    blocksize=BLOCK_SIZE,
    channels=1,
    callback=callback
)

Voice Conversion vs Voice Cloning¶

Aspect	Voice Conversion	Voice Cloning (TTS)
Input	Audio recording	Text
Preserves	Timing, prosody, emotion	Nothing (generates from scratch)
Output timing	Same as input	Model-determined
Use case	Change speaker in existing audio	Generate new speech
Training data	10-30 min target speaker	5 sec (zero-shot) to hours
Real-time	Yes (with optimization)	Depends on model
Singing	Excellent (SVC models)	Poor (TTS models can't sing well)

Training a VC Model¶

RVC training workflow:
  1. Collect 10-30 min of clean target speaker audio
  2. Preprocess: denoise, normalize, segment into 5-15s clips
  3. Extract features: HuBERT content features + pitch (F0)
  4. Build FAISS index from target speaker features
  5. Train VITS decoder: 200-500 epochs, ~20 min on RTX 3090
  6. Test: convert a sample, check for artifacts

Quality tips:
  - More diverse training data (different emotions, speeds) = better model
  - Clean audio matters more than quantity
  - Pitch extraction method: RMVPE > CREPE > Harvest (quality vs speed)

Applications¶

Singing voice: apply any singer's timbre to your own recordings
Dubbing: replace voice actor identity while keeping original performance timing
Privacy: anonymize speaker identity in recordings
Accessibility: convert speech to a voice the listener finds easier to understand
Live streaming: real-time voice changing for content creators

Gotchas¶

Pitch range mismatch causes artifacts - converting a deep male voice to a high female voice (or vice versa) produces warbling and robotic artifacts at pitch extremes. Keep source and target within 1 octave of each other, or use explicit pitch shifting as a preprocessing step
RVC index_rate is a critical but poorly documented parameter - at 0.0, no retrieval is used (pure encoder output, less like target). At 1.0, full retrieval (more like target but can sound choppy). Sweet spot is typically 0.3-0.6 depending on how different source and target voices are
Background music bleeds through conversion - VC models are trained on clean speech. Music/noise in input passes through unchanged or creates artifacts. Always separate vocals first (Demucs/UVR5), convert, then remix