WoW-World-Eval: A Turing Test for Robot-Ready Video AI
What’s new
As AI video models are used as “world models” for robots, we need to know if their imagined futures match reality. WoW-World-Eval (Wow, wo, val) is a public benchmark that stress-tests these models before we trust them on real machines.
- Five skills: perception, planning, prediction, generalization, execution.
- 22 metrics with scores that track human preference very closely (Pearson > 0.93).
Results on 609 real robot manipulation scenes reveal big gaps: models score only 17.27 on long-horizon planning and at best 68.02 on physical consistency—signs of shaky spatiotemporal reasoning.
In a “Turing Test” for execution using an inverse dynamics model, most video models drop to about 0% real-world task success. The WoW model holds 40.74%, but there’s still a long road to go.
Bottom line: today’s video world models can look convincing yet fail to guide robots reliably. WoW-World-Eval offers a standardized yardstick to close that reality gap.
Paper: https://arxiv.org/abs/2601.04137v1
Paper: https://arxiv.org/abs/2601.04137v1
Register: https://www.AiFeta.com
EmbodiedAI WorldModels Robotics VideoAI Benchmark TuringTest AIResearch arXiv