Molmo2: Open Video-Language AI with Pixel-Level Grounding

Most top video AIs are locked up. Molmo2 opens the door: open weights and open datasets, built to understand videos and ground that understanding by pointing to and tracking objects in the pixels.

Data you can build on: 7 new video datasets and 2 multi-image sets, including rich video captions, free-form video Q&A, complex object tracking, and a new video pointing set, all collected without closed models.
Training recipe: efficient sequence packing, message-tree encoding, bi-directional attention over vision tokens, and a novel token-weighting strategy.
Results: the 8B Molmo2 leads open models on short-video understanding, counting, and captioning, and is competitive on long videos.
Grounding wins: beats open models like Qwen3-VL on video counting (35.5 vs 29.6), and surpasses Gemini 3 Pro on some tasks (video pointing F1: 38.4 vs 20.0; video tracking J&F: 56.2 vs 41.1).

Why it matters: developers and researchers finally get transparent, high-quality building blocks for video-language systems that not only describe what is happening, but can show you where it happens. Paper: https://arxiv.org/abs/2601.10611v1

Paper: https://arxiv.org/abs/2601.10611v1

Register: https://www.AiFeta.com

AI OpenSource VisionLanguage VideoUnderstanding MachineLearning ComputerVision VLM Research Molmo2 Grounding Tracking Datasets

Molmo2: Open Video-Language AI with Pixel-Level Grounding

Read more

Tekoälyapuria ei kannata valita pelkän esittelytekstin perusteella

Hakutulosten kannattaa olla hyödyllisiä, ei vain samankaltaisia

Yksi malli voi pian puhua, soittaa ja kolista – pelkillä tekstiohjeilla

Tekoälyn kanssa pärjäämme paremmin sopimalla kuin komentamalla