ATI (Any Trajectory Instruction)¶

Trajectory-based motion control for I2V generation. Lightweight Gaussian motion injector module for pretrained video DiT. Controls camera and object motion with the same unified representation.

Paper: arXiv:2505.22944 (May 2025). Authors: ByteDance.

Architecture¶

Adds a module between preprocessing and patchify of frozen I2V DiT:

Input image → VAE encode → latent L_I (H×W×C)
                                ↓
For each trajectory point: bilinear interpolation → C-dim feature vector
                                ↓
Gaussian distribution across frames:
  P = exp(-||position_t - (i,j)||^2 / 2σ)   σ=1/440
                                ↓
Soft spatial guidance signals → injected into latent space
                                ↓
Frozen Wan2.1-I2V-14B → generated video following trajectories

Unified Representation¶

Camera motion = coordinated point trajectories (radial expansion=zoom, uniform translation=pan). Object motion = point trajectories anchored to objects. Same mechanism for both. No separate encoders per motion type.

Training¶

2.4M video clips (filtered from 5M), TAP-Net tracked 120 points/frame
50K iterations on 64 GPUs (80GB each)
1-20 random points per video during training
"Tail Dropout Regularizer" (p=0.2): randomly truncates trajectories to prevent occlusion hallucination

Base Model¶

Wan2.1-I2V-14B-480P (primary). Also validated on Seaweed-7B (internal ByteDance). Model-agnostic injector.

License¶

Apache 2.0 — fully commercial.

Gotchas¶

Output 480P only
Very rapid movements (half image width in 2 frames) → failure
Requires full Wan2.1 14B model + ATI weights + VAE/T5/CLIP copied manually
No confirmed Wan 2.2 support yet
Trajectory editor = localhost only (security risk on remote)
ComfyUI nodes available via Kijai (ComfyUI-WanVideoWrapper)

Key Links¶

GitHub: github.com/bytedance/ATI
HF: huggingface.co/bytedance-research/ATI
ComfyUI: docs.comfy.org/tutorials/video/wan/wan-ati