Molmo2: Open Video-Language AI with Pixel-Level Grounding
Most top video AIs are locked up. Molmo2 opens the door: open weights and open datasets, built to understand videos and ground that understanding by pointing to and tracking objects in the pixels.
- Data you can build on: 7 new video datasets and 2 multi-image sets, including rich video captions, free-form video Q&A, complex object tracking, and a new video pointing set, all collected without closed models.
- Training recipe: efficient sequence packing, message-tree encoding, bi-directional attention over vision tokens, and a novel token-weighting strategy.
- Results: the 8B Molmo2 leads open models on short-video understanding, counting, and captioning, and is competitive on long videos.
- Grounding wins: beats open models like Qwen3-VL on video counting (35.5 vs 29.6), and surpasses Gemini 3 Pro on some tasks (video pointing F1: 38.4 vs 20.0; video tracking J&F: 56.2 vs 41.1).
Why it matters: developers and researchers finally get transparent, high-quality building blocks for video-language systems that not only describe what is happening, but can show you where it happens. Paper: https://arxiv.org/abs/2601.10611v1
Paper: https://arxiv.org/abs/2601.10611v1
Register: https://www.AiFeta.com
AI OpenSource VisionLanguage VideoUnderstanding MachineLearning ComputerVision VLM Research Molmo2 Grounding Tracking Datasets