Can video AIs guide real robots? Meet the WoW-World-Eval Embodied Turing Test
Can video AIs guide real robots? Meet the WoW-World-Eval Embodied Turing Test
Robots need 'world models' that predict what happens next. A new benchmark, WoW-World-Eval (Wow, wo, val), tests video foundation models on 609 robot manipulation samples across five skills: perception, planning, prediction, generalization, and execution.
- 22-metric score aligns with human judgment (Pearson > 0.93), enabling a reliable Human Turing Test.
- Long-horizon planning is weak: average score 17.27.
- Physical and temporal realism are limited: best physical consistency reaches 68.02.
- Real-world execution via an inverse dynamics model: most models ~0% success; one model (WoW) hits 40.74%.
Bottom line: today’s video models still struggle to produce plans and physics that hold up in the real world, so stronger, standardized evaluation is needed before using them as universal priors for embodied agents.
Paper: https://arxiv.org/abs/2601.04137v1
Paper: https://arxiv.org/abs/2601.04137v1
Register: https://www.AiFeta.com
AI Robotics EmbodiedAI WorldModels Benchmark TuringTest VideoAI Research