TTS Fine-Tuning Infrastructure and GPU Platforms¶

Q: Vast.ai spot interruption risk

interruptible instances lose all non-checkpointed work without warning. Never start a 6-hour fine-tune on spot without checkpoint-every-N-steps configured. Use Vast.ai only after the pipeline is validated on non-interruptible hardware.

Q: vLLM incompatibility with non-LLM TTS

vLLM's Multi-LoRA is designed for decoder-only transformer LLMs. VoxCPM2 (tokenizer-free diffusion AR) and Fish-Speech S2 Pro (dual-AR) architectures may require custom serving code. Test explicitly before assuming vLLM will work.

Q: 60-120 min clean speech = LoRA, not full fine-tune

with this dataset size, full fine-tune overfits and does not improve over LoRA. Full fine-tune requires 1-5 hours of clean single-speaker audio. Using full fine-tune on a 60-min dataset wastes 2-4× compute with worse results.

Q: ConTree microVMs have no confirmed GPU access

Nebius ConTree is a code execution sandbox for agents, not a GPU compute service. The separate Nebius cloud has H100 and L40S instances. Do not confuse the two when planning compute budget.

★★★★★ Intermediate

GPU rental platform comparison and deployment patterns for fine-tuning and serving 2B-4B TTS models. Based on April 2026 pricing; rates change frequently — verify before committing.

GPU Rental Platform Comparison¶

On-Demand Pods¶

Provider	GPU	USD/hr	Billing	Notes
RunPod Community	H100 80GB PCIe	$1.99	per-second	Best price/stability for fine-tune
RunPod Community	H100 SXM 80GB	$2.69	per-second	NVLink for multi-GPU
RunPod Community	A100 80GB	$1.19-1.59	per-second	Base option
RunPod Secure	H100 80GB	+$0.10-0.40/hr	per-second	SOC2 compliance
Vast.ai	H100 80GB (spot)	$0.90-1.65	per-minute	Interruptible — risky for long runs
Vast.ai	RTX 4090 24GB (spot)	$0.29-0.59	per-minute	Cost-optimal for LoRA 2B models
Lambda Labs	H100 SXM 80GB	$3.78	per-hour	Zero interruptions, guaranteed
Lambda Labs	A100 80GB	$1.48	per-hour	Most stable A100 option
Nebius	H100 80GB	$2.95	per-hour	Up to 35% commit discount
Nebius	L40S 48GB	$1.55-1.82	per-hour	48GB covers all 2B-4B TTS models

Serverless / Scale-to-Zero¶

Provider	GPU	USD/hr effective	Cold Start	Notes
RunPod Serverless Flex	H100 80GB	$2.72	1-2s (FlashBoot)	Best DX/cost for inference
RunPod Serverless Active	H100 80GB	$3.35	~0	When >25% uptime expected
Modal	H100	$3.95	<5s	Clean Python API, no Docker/YAML
Modal	A100 80GB	$2.50	<5s	Sufficient for VoxCPM2 inference
Modal	A100 40GB	$2.10	<5s	Min for 2B model inference

Selection Guide¶

Scenario	Recommended	Reason
First LoRA run, 2B model	RunPod Community H100	$1.99/hr, per-second, no interruption risk
Cheapest LoRA (established pipeline)	Vast.ai spot RTX 4090	$0.30-0.60/hr, save checkpoints every N steps
Guaranteed compute for full fine-tune	Lambda Labs A100	Zero interruption SLA
Production inference (moderate traffic)	RunPod Serverless Flex	Scale-to-zero + FlashBoot
Rapid prototype inference	Modal A100 40GB	Clean Python, per-second billing
Cost-optimized inference (48GB needed)	Nebius L40S	$1.55/hr, underrated tier

Fine-Tune Cost Estimates¶

VoxCPM2 (2B, tokenizer-free diffusion autoregressive)¶

LoRA adds <1% parameters. GPU requirements: ≥8GB VRAM minimum, 24GB for normal batch sizes.

Dataset scale: 60-120 min clean speech = solid LoRA tier (2-4× recommended minimum of 5-30 min). Full fine-tune is overkill for single-speaker.

Scenario	GPU	Steps	Time	Cost
LoRA, 60 min dataset	RTX 4090 (Vast spot)	6-8K	4-6h	$2-4
LoRA, 120 min dataset	RTX 4090 (Vast spot)	10-15K	6-10h	$2-6
LoRA, 120 min	H100 RunPod	10-15K	2-4h	$4-8
Full fine-tune, 120 min	H100 RunPod	20-30K	10-18h	$20-36

Fish-Speech S2 Pro (4B)¶

+50-80% time vs VoxCPM2. H100 80GB recommended for LoRA; A100 80GB with gradient accumulation.

Scenario	GPU	Time	Cost
LoRA, 120 min	H100	4-7h	$8-14
Full fine-tune, 120 min	H100	18-30h	$36-60

Inference Real-Time Factor (RTF)¶

RTF = inference time / audio duration. RTF < 1 = faster than real-time.

System	GPU	RTF	Notes
NVIDIA Riva (TensorRT compiled)	A100	61.4×	Compiled TRT engines, not general
NVIDIA Riva	V100	33.7×
Qwen3-TTS (StaticCache + CUDAGraph)	RTX 4090	5.6×	Streaming
Qwen3-TTS	H100	4.2×
Typical 2B-4B model (no deep optimization)	A100/H100	2-8×	125-500ms per second of audio

Cost per 1M characters (rough, ~18.5 hours of audio):

GPU	RTF	GPU-hours	Cost (RunPod H100)
H100	5×	~3.7h	~$7.3
A100 80GB	3×	~6.2h	—

With batching and LoRA hot-swap, effective cost 2-3× lower.

Multi-LoRA Serving¶

vLLM Multi-LoRA¶

For TTS models with compatible architectures:

# Enable runtime LoRA loading
VLLM_ALLOW_RUNTIME_LORA_UPDATING=True python -m vllm.entrypoints.openai.api_server \
    --model base_tts_model \
    --enable-lora \
    --max-lora-rank 32

# Load adapter per request via lora_request parameter

Limitation: vLLM targets LLM architectures. Diffusion AR (VoxCPM2) and dual-AR (Fish S2 Pro) may not be compatible out of the box — verify per model.

PEFT Manual Switching¶

For non-vLLM-compatible TTS architectures:

from peft import PeftModel
import time

base_model = load_base_model()  # loaded once, stays in VRAM
adapter_cache = {}

def get_model_for_teacher(teacher_id: str):
    if teacher_id not in adapter_cache:
        path = f"adapters/{teacher_id}"
        adapter_cache[teacher_id] = path

    base_model.load_adapter(adapter_cache[teacher_id], adapter_name=teacher_id)
    base_model.set_adapter(teacher_id)
    return base_model
    # Switch time: ~50-200ms depending on adapter size

Storage per Adapter¶

LoRA rank 16-32 for TTS → 20-100 MB per adapter (compressible). 100 adapters ≈ 5 GB on network volume ≈ $0.35/month on RunPod. Negligible at scale.

Deployment Patterns¶

Pattern A: Persistent Pod (first production version)¶

# RunPod Pod with custom FastAPI inference server
# Persistent network volume: $0.07/GB/mo
# Structure:
# /volume/
#   base_model/          # Base TTS weights (~10-20 GB)
#   adapters/            # LoRA adapters (100 × 50MB = 5 GB)
#   datasets/            # Training data (optional)

# HTTP endpoint via pod's public URL
# One H100 handles 10-30 parallel audio generation streams

Pattern B: Serverless Flex (scale-to-zero)¶

# RunPod Serverless: custom Docker image
# - Base model + adapters mounted from network volume
# - Handler returns base64-encoded audio or file URL
# - FlashBoot: 1-2s cold start with cached image
# - Active workers: add N always-on workers when traffic is regular

import modal

app = modal.App("tts-inference")
volume = modal.Volume.from_name("tts-models")

@app.function(
    gpu="A100",
    volumes={"/models": volume},
    timeout=120,
)
def synthesize(text: str, lora_id: str) -> bytes:
    model = load_with_adapter(f"/models/adapters/{lora_id}")
    return model.synthesize(text)

Voice Safety Infrastructure¶

Non-optional controls for voice cloning systems:

# At dataset upload time:
# 1. Server generates random phrase with 60s TTL
# 2. User reads phrase aloud (microphone, not file upload)
# 3. ASR verifies phrase content
# 4. Speaker verification confirms same voice as training data
# 5. Log consent event with timestamp, user ID, audio hash

Audio Watermarking¶

Embed imperceptible watermark in every synthesized audio output: - AudioSeal (Meta, open-source): neural watermark, verifiable via API - Resemble Neural Watermarking: commercial, traceable to account - ElevenLabs embeds watermarks in all generation by default

Embed at vocoder output stage — zero perceptible quality degradation.

Audit Trail¶

# Every synthesis request:
audit_log.append({
    "timestamp": datetime.utcnow(),
    "account_id": user_id,
    "lora_id": adapter_id,
    "text_hash": sha256(text),
    "audio_hash": sha256(audio_output),
    "duration_seconds": len(audio) / sample_rate,
})
# Retain for 30-90 days for forensic attribution

Gotchas¶

Vast.ai spot interruption risk: interruptible instances lose all non-checkpointed work without warning. Never start a 6-hour fine-tune on spot without checkpoint-every-N-steps configured. Use Vast.ai only after the pipeline is validated on non-interruptible hardware.
vLLM incompatibility with non-LLM TTS: vLLM's Multi-LoRA is designed for decoder-only transformer LLMs. VoxCPM2 (tokenizer-free diffusion AR) and Fish-Speech S2 Pro (dual-AR) architectures may require custom serving code. Test explicitly before assuming vLLM will work.
60-120 min clean speech = LoRA, not full fine-tune: with this dataset size, full fine-tune overfits and does not improve over LoRA. Full fine-tune requires 1-5 hours of clean single-speaker audio. Using full fine-tune on a 60-min dataset wastes 2-4× compute with worse results.
ConTree microVMs have no confirmed GPU access: Nebius ConTree is a code execution sandbox for agents, not a GPU compute service. The separate Nebius cloud has H100 and L40S instances. Do not confuse the two when planning compute budget.