Escaping the Verifier: Learning to Reason via Demonstrations
LLMs can learn to reason—without task verifiers
Many real-world problems don’t have automatic checkers to grade answers, even though we have lots of expert solutions. RARO (Relativistic Adversarial Reasoning Optimization) shows how to train reasoning skills from those examples alone.
How it works:
- A policy (the model) tries to generate answers like the experts.
- A relativistic critic learns to tell expert and model answers apart by direct comparison.
- Both are trained together via reinforcement learning and inverse RL, with stabilizers that keep training steady.
Why it matters: RARO beat strong verifier-free baselines on Countdown (math puzzles), DeepMath (theorem steps), and Poetry Writing, and it scales as reliably as standard RL on tasks that do have verifiers. In short, you can unlock strong reasoning from demonstrations—no task-specific checker required.
Paper by Locke Cai and Ivan Provilkov: https://arxiv.org/abs/2511.21667v1
Paper: https://arxiv.org/abs/2511.21667v1
Register: https://www.AiFeta.com
AI LLM ReinforcementLearning InverseRL Reasoning MachineLearning NLP Research