Klear: AI that makes video and sound move in sync
Meet Klear, a research system for generating video and audio together — with lips, speech, and actions that line up.
The team tackles three common pain points in A/V generation: off-sync sound, weak lip–speech matches, and quality drops when one modality dominates.
- One unified model: a single-tower design with full audio–video attention keeps timing tight and scales cleanly.
- Smarter training: progressive multitask learning (with modality masking and a curriculum) builds robust, aligned A/V representations and avoids unimodal collapse.
- Better data: a new large-scale set of tightly aligned audio–video clips with dense captions, built by an automated pipeline that filters millions of examples.
The authors report high-fidelity, instruction-following generation in joint or audio-only/video-only modes, strong generalization to new scenarios, and large gains over prior open methods — with performance comparable to Veo 3.
Paper: https://arxiv.org/abs/2601.04151
Paper: https://arxiv.org/abs/2601.04151v1
Register: https://www.AiFeta.com
AI GenerativeAI Video Audio Multimodal Research LipSync Diffusion