Escaping the Verifier: Learning to Reason via Demonstrations

Escaping the Verifier: Learning to Reason via Demonstrations

LLMs can learn to reason—without task verifiers

Many real-world problems don’t have automatic checkers to grade answers, even though we have lots of expert solutions. RARO (Relativistic Adversarial Reasoning Optimization) shows how to train reasoning skills from those examples alone.

How it works:

  • A policy (the model) tries to generate answers like the experts.
  • A relativistic critic learns to tell expert and model answers apart by direct comparison.
  • Both are trained together via reinforcement learning and inverse RL, with stabilizers that keep training steady.

Why it matters: RARO beat strong verifier-free baselines on Countdown (math puzzles), DeepMath (theorem steps), and Poetry Writing, and it scales as reliably as standard RL on tasks that do have verifiers. In short, you can unlock strong reasoning from demonstrations—no task-specific checker required.

Paper by Locke Cai and Ivan Provilkov: https://arxiv.org/abs/2511.21667v1

Paper: https://arxiv.org/abs/2511.21667v1

Register: https://www.AiFeta.com

AI LLM ReinforcementLearning InverseRL Reasoning MachineLearning NLP Research

Read more