Omni-R1: AI that draws its thoughts
What if AI could think with pictures?
Omni-R1 is a new multimodal AI that doesn’t just “talk through” problems—it draws its intermediate steps. Instead of relying on one fixed reasoning style, it unifies many skills (like zooming into regions, pointing to objects, or marking paths) by generating small helper images while reasoning.
- Unified generative reasoning: one paradigm for many vision-language tasks.
- Two-stage training: supervised fine-tuning + reinforcement learning with a perception alignment loss and perception reward to make the generated visuals actually useful.
- Omni-R1-Zero: learns the same trick without multimodal labels by bootstrapping visual steps from text-only reasoning—and often matches or beats Omni-R1.
Why it matters: more general, transparent multimodal reasoning that can show its work across diverse tasks.
Paper: https://arxiv.org/abs/2601.09536v1 — Authors: Dongjie Cheng et al. (cs.AI)
Paper: https://arxiv.org/abs/2601.09536v1
Register: https://www.AiFeta.com
AI Multimodal MLLM ComputerVision GenerativeAI ReinforcementLearning Research