Klear: AI that makes video and sound move in sync

Meet Klear, a research system for generating video and audio together — with lips, speech, and actions that line up.

The team tackles three common pain points in A/V generation: off-sync sound, weak lip–speech matches, and quality drops when one modality dominates.

One unified model: a single-tower design with full audio–video attention keeps timing tight and scales cleanly.
Smarter training: progressive multitask learning (with modality masking and a curriculum) builds robust, aligned A/V representations and avoids unimodal collapse.
Better data: a new large-scale set of tightly aligned audio–video clips with dense captions, built by an automated pipeline that filters millions of examples.

The authors report high-fidelity, instruction-following generation in joint or audio-only/video-only modes, strong generalization to new scenarios, and large gains over prior open methods — with performance comparable to Veo 3.

Paper: https://arxiv.org/abs/2601.04151

Paper: https://arxiv.org/abs/2601.04151v1

Register: https://www.AiFeta.com

AI GenerativeAI Video Audio Multimodal Research LipSync Diffusion

Klear: AI that makes video and sound move in sync

Read more

Tekoälyapuria ei kannata valita pelkän esittelytekstin perusteella

Hakutulosten kannattaa olla hyödyllisiä, ei vain samankaltaisia

Yksi malli voi pian puhua, soittaa ja kolista – pelkillä tekstiohjeilla

Tekoälyn kanssa pärjäämme paremmin sopimalla kuin komentamalla