X-Dub¶
Visual dubbing model that edits lip movements in video to match new audio, preserving identity and pose. Built on Wan2.2-TI2V-5B backbone. Key innovation: mask-free editing instead of masked inpainting. Handles videos >1 minute without quality drift.
Paper: arXiv:2512.25066 (Dec 2025). Code released March 2026. Authors: Kling Team (Kuaishou) + Tsinghua + Beihang + HKUST + CUHK.
Important: released model is NOT the internal paper model. Internal used proprietary 1B DiT + LoRA. Public uses Wan2.2-5B + multi-stage SFT. Internal model cannot be released due to company policy.
Architecture¶
Two-Phase Self-Bootstrapping¶
Phase 1 — Generator (training only): DiT inpainter creates "lip-altered counterparts" — same video with different lip movements. This produces paired training data: (original video, lip-altered video + audio).
Phase 2 — Editor (the released model): DiT learns mask-free editing from generated pairs. At inference, takes full video + new audio → edits only lips.
Reference Video → VAE Encode → ref_latents ─┐
├─ concat along channels → DiT Blocks → VAE Decode → Dubbed Video
Random Noise ────────────────→ noise_latents ─┘
↑ cross-attention
Audio → Whisper + Wav2Vec2 → AudioProjModule ───────┘
Text prompt → umt5-xxl → text embeddings (cross-attn, target only)
Core Components¶
| Component | Implementation | Size |
|---|---|---|
| DiT backbone | Wan2.2-TI2V-5B (modified) | ~5B params |
| VAE | Wan2.2_VAE | Video autoencoder |
| Text encoder | umt5-xxl | Same as Wan2.2 |
| Audio encoder 1 | Whisper Large-v2 | ~3 GB |
| Audio encoder 2 | Wav2Vec2-base-960h | ~360 MB |
| Pose detection | DWPose (YOLOX + RTMPose) | For face cropping |
Total download: ~30.2 GB.
Audio Processing¶
Dual audio encoder — both features concatenated and projected:
# Combined audio features
whisper_features # dim=1280, high-level speech
wav2vec_features # dim=768, low-level audio
combined = concat([whisper, wav2vec]) # dim=2048
# AudioProjModule: 4 layers of attention + FFN
# intermediate_dim=1536, 24 heads
# Output dim=3072 (matches DiT dimension)
# Injected via cross-attention in DiT blocks
Inpainting → Editing (Key Insight)¶
Traditional lip-sync (Wav2Lip, VideoReTalking): mask mouth region → inpaint. Problem: partial context → artifacts at boundaries, inconsistent lighting/texture.
X-Dub editor: sees entire frame without masks. Treats dubbing as full-frame editing. The model decides what to change based on audio signal, not based on a mask. Result: coherent lighting, no boundary artifacts, handles occlusions naturally.
Long Video Handling¶
Sliding-window clip-based inference with temporal stitching:
| Parameter | Value |
|---|---|
| Clip length | 77 frames |
| Overlap | 5 frames (motion conditioning) |
| Resolution | 512×512 (face crop) |
| FPS | 25 (fixed) |
Continuity mechanisms: 1. Motion overlap: last 5 frames of previous clip → VAE-encode → inject as first frames of next clip's denoising 2. Latent stitching: average latents at clip boundaries (smooth_transition_latent) 3. Border replacement: at each denoising step, border pixels replaced with noised reference latents 4. Color correction: per-clip, prevents color drift 5. Single VAE decode: all latent segments concatenated temporally → decoded in one pass
Identity Preservation¶
- Reference latent concat:
[ref_latents, noise_latents]along channel dim — full appearance info - Asymmetric attention: self-attention across ref+target (identity flows), but text/audio cross-attention on target only
- Triple CFG with dynamic schedule:
# Three noise predictions:
pred_nega # no ref, no audio
pred_posi_r # ref only
pred_posi_ra # ref + audio
# Combine:
final = pred_nega + ref_scale * (pred_posi_r - pred_nega) + audio_scale * (pred_posi_ra - pred_posi_r)
# Dynamic schedule through denoising:
# ref_scale: cosine decay 100% → 30% (early: prioritize identity)
# audio_scale: t^1.5 ramp up (later: prioritize lip sync)
Default: ref_cfg_scale=2.5, audio_cfg_scale=10.0.
Inference¶
VRAM: ~21 GB (aggressive model offloading between stages). 30 denoising steps at 512×512. Slow — no distillation or caching in public release.
Gotchas¶
- No license file — GitHub shows
license: null. pyproject.toml says Apache-2.0 for DiffSynth framework portion only. Usage rights ambiguous. - Not the paper's model — different backbone, different training strategy. Quantitative comparison is TODO.
- Single-person only — multi-person support on roadmap.
- ~2% noisy frame rate — small percentage of outputs have severely noisy frames.
- Face cropping simplified — DWPose landmark-based vs paper's FLAME-mesh. Can cause jitter with rapid head movement.
- No training code — inference only.
- Heavy dependencies — OpenMMLab stack (mmengine, mmcv, mmdet, mmpose) + DiffSynth-Studio + Whisper + Wav2Vec2.
- 512×512 fixed — face crop resolution, pasted back into original video.
- Works on non-human characters (cartoons, animals) — Wan-5B version is actually better at this than internal.
Results¶
Key advantages over Wav2Lip / VideoReTalking: - No mask artifacts (mask-free editing) - Works on stylized/animated characters - Handles occlusions and challenging lighting - >1 minute videos without drift
Public Wan-5B vs internal model: - Slightly weaker temporal stability (occasional flickering) - Slightly weaker identity consistency - ~2x slower inference - Better generalization to non-human characters
Relation to Other Models¶
- Built on [[Wan2.2]] (Alibaba video generation model) — same VAE, text encoder, DiT architecture
- Uses DiffSynth-Studio framework (same as [[Step1X-Edit|Qwen-Image-Edit]])
- Audio approach similar to AniPortrait, DreamTalk but with dual encoder (Whisper + Wav2Vec2)
Key Links¶
- GitHub: github.com/KlingAIResearch/X-Dub
- HF weights: huggingface.co/KlingTeam/X-Dub
- Project page: hjrphoebus.github.io/X-Dub
- Paper: arxiv.org/abs/2512.25066