Skip to content

TTS Fine-Tuning Infrastructure and GPU Platforms

Intermediate

GPU rental platform comparison and deployment patterns for fine-tuning and serving 2B-4B TTS models. Based on April 2026 pricing; rates change frequently — verify before committing.

GPU Rental Platform Comparison

On-Demand Pods

Provider GPU USD/hr Billing Notes
RunPod Community H100 80GB PCIe $1.99 per-second Best price/stability for fine-tune
RunPod Community H100 SXM 80GB $2.69 per-second NVLink for multi-GPU
RunPod Community A100 80GB $1.19-1.59 per-second Base option
RunPod Secure H100 80GB +$0.10-0.40/hr per-second SOC2 compliance
Vast.ai H100 80GB (spot) $0.90-1.65 per-minute Interruptible — risky for long runs
Vast.ai RTX 4090 24GB (spot) $0.29-0.59 per-minute Cost-optimal for LoRA 2B models
Lambda Labs H100 SXM 80GB $3.78 per-hour Zero interruptions, guaranteed
Lambda Labs A100 80GB $1.48 per-hour Most stable A100 option
Nebius H100 80GB $2.95 per-hour Up to 35% commit discount
Nebius L40S 48GB $1.55-1.82 per-hour 48GB covers all 2B-4B TTS models

Serverless / Scale-to-Zero

Provider GPU USD/hr effective Cold Start Notes
RunPod Serverless Flex H100 80GB $2.72 1-2s (FlashBoot) Best DX/cost for inference
RunPod Serverless Active H100 80GB $3.35 ~0 When >25% uptime expected
Modal H100 $3.95 <5s Clean Python API, no Docker/YAML
Modal A100 80GB $2.50 <5s Sufficient for VoxCPM2 inference
Modal A100 40GB $2.10 <5s Min for 2B model inference

Selection Guide

Scenario Recommended Reason
First LoRA run, 2B model RunPod Community H100 $1.99/hr, per-second, no interruption risk
Cheapest LoRA (established pipeline) Vast.ai spot RTX 4090 $0.30-0.60/hr, save checkpoints every N steps
Guaranteed compute for full fine-tune Lambda Labs A100 Zero interruption SLA
Production inference (moderate traffic) RunPod Serverless Flex Scale-to-zero + FlashBoot
Rapid prototype inference Modal A100 40GB Clean Python, per-second billing
Cost-optimized inference (48GB needed) Nebius L40S $1.55/hr, underrated tier

Fine-Tune Cost Estimates

VoxCPM2 (2B, tokenizer-free diffusion autoregressive)

LoRA adds <1% parameters. GPU requirements: ≥8GB VRAM minimum, 24GB for normal batch sizes.

Dataset scale: 60-120 min clean speech = solid LoRA tier (2-4× recommended minimum of 5-30 min). Full fine-tune is overkill for single-speaker.

Scenario GPU Steps Time Cost
LoRA, 60 min dataset RTX 4090 (Vast spot) 6-8K 4-6h $2-4
LoRA, 120 min dataset RTX 4090 (Vast spot) 10-15K 6-10h $2-6
LoRA, 120 min H100 RunPod 10-15K 2-4h $4-8
Full fine-tune, 120 min H100 RunPod 20-30K 10-18h $20-36

Fish-Speech S2 Pro (4B)

+50-80% time vs VoxCPM2. H100 80GB recommended for LoRA; A100 80GB with gradient accumulation.

Scenario GPU Time Cost
LoRA, 120 min H100 4-7h $8-14
Full fine-tune, 120 min H100 18-30h $36-60

Inference Real-Time Factor (RTF)

RTF = inference time / audio duration. RTF < 1 = faster than real-time.

System GPU RTF Notes
NVIDIA Riva (TensorRT compiled) A100 61.4× Compiled TRT engines, not general
NVIDIA Riva V100 33.7×
Qwen3-TTS (StaticCache + CUDAGraph) RTX 4090 5.6× Streaming
Qwen3-TTS H100 4.2×
Typical 2B-4B model (no deep optimization) A100/H100 2-8× 125-500ms per second of audio

Cost per 1M characters (rough, ~18.5 hours of audio):

GPU RTF GPU-hours Cost (RunPod H100)
H100 ~3.7h ~$7.3
A100 80GB ~6.2h

With batching and LoRA hot-swap, effective cost 2-3× lower.

Multi-LoRA Serving

vLLM Multi-LoRA

For TTS models with compatible architectures:

# Enable runtime LoRA loading
VLLM_ALLOW_RUNTIME_LORA_UPDATING=True python -m vllm.entrypoints.openai.api_server \
    --model base_tts_model \
    --enable-lora \
    --max-lora-rank 32

# Load adapter per request via lora_request parameter

Limitation: vLLM targets LLM architectures. Diffusion AR (VoxCPM2) and dual-AR (Fish S2 Pro) may not be compatible out of the box — verify per model.

PEFT Manual Switching

For non-vLLM-compatible TTS architectures:

from peft import PeftModel
import time

base_model = load_base_model()  # loaded once, stays in VRAM
adapter_cache = {}

def get_model_for_teacher(teacher_id: str):
    if teacher_id not in adapter_cache:
        path = f"adapters/{teacher_id}"
        adapter_cache[teacher_id] = path

    base_model.load_adapter(adapter_cache[teacher_id], adapter_name=teacher_id)
    base_model.set_adapter(teacher_id)
    return base_model
    # Switch time: ~50-200ms depending on adapter size

Storage per Adapter

LoRA rank 16-32 for TTS → 20-100 MB per adapter (compressible). 100 adapters ≈ 5 GB on network volume ≈ $0.35/month on RunPod. Negligible at scale.

Deployment Patterns

Pattern A: Persistent Pod (first production version)

# RunPod Pod with custom FastAPI inference server
# Persistent network volume: $0.07/GB/mo
# Structure:
# /volume/
#   base_model/          # Base TTS weights (~10-20 GB)
#   adapters/            # LoRA adapters (100 × 50MB = 5 GB)
#   datasets/            # Training data (optional)

# HTTP endpoint via pod's public URL
# One H100 handles 10-30 parallel audio generation streams

Pattern B: Serverless Flex (scale-to-zero)

# RunPod Serverless: custom Docker image
# - Base model + adapters mounted from network volume
# - Handler returns base64-encoded audio or file URL
# - FlashBoot: 1-2s cold start with cached image
# - Active workers: add N always-on workers when traffic is regular

Pattern C: Modal

import modal

app = modal.App("tts-inference")
volume = modal.Volume.from_name("tts-models")

@app.function(
    gpu="A100",
    volumes={"/models": volume},
    timeout=120,
)
def synthesize(text: str, lora_id: str) -> bytes:
    model = load_with_adapter(f"/models/adapters/{lora_id}")
    return model.synthesize(text)

Voice Safety Infrastructure

Non-optional controls for voice cloning systems:

# At dataset upload time:
# 1. Server generates random phrase with 60s TTL
# 2. User reads phrase aloud (microphone, not file upload)
# 3. ASR verifies phrase content
# 4. Speaker verification confirms same voice as training data
# 5. Log consent event with timestamp, user ID, audio hash

Audio Watermarking

Embed imperceptible watermark in every synthesized audio output: - AudioSeal (Meta, open-source): neural watermark, verifiable via API - Resemble Neural Watermarking: commercial, traceable to account - ElevenLabs embeds watermarks in all generation by default

Embed at vocoder output stage — zero perceptible quality degradation.

Audit Trail

# Every synthesis request:
audit_log.append({
    "timestamp": datetime.utcnow(),
    "account_id": user_id,
    "lora_id": adapter_id,
    "text_hash": sha256(text),
    "audio_hash": sha256(audio_output),
    "duration_seconds": len(audio) / sample_rate,
})
# Retain for 30-90 days for forensic attribution

Gotchas

  • Vast.ai spot interruption risk: interruptible instances lose all non-checkpointed work without warning. Never start a 6-hour fine-tune on spot without checkpoint-every-N-steps configured. Use Vast.ai only after the pipeline is validated on non-interruptible hardware.
  • vLLM incompatibility with non-LLM TTS: vLLM's Multi-LoRA is designed for decoder-only transformer LLMs. VoxCPM2 (tokenizer-free diffusion AR) and Fish-Speech S2 Pro (dual-AR) architectures may require custom serving code. Test explicitly before assuming vLLM will work.
  • 60-120 min clean speech = LoRA, not full fine-tune: with this dataset size, full fine-tune overfits and does not improve over LoRA. Full fine-tune requires 1-5 hours of clean single-speaker audio. Using full fine-tune on a 60-min dataset wastes 2-4× compute with worse results.
  • ConTree microVMs have no confirmed GPU access: Nebius ConTree is a code execution sandbox for agents, not a GPU compute service. The separate Nebius cloud has H100 and L40S instances. Do not confuse the two when planning compute budget.

See Also