Escaping the Verifier: Learning to Reason via Demonstrations
TL;DR: RARO teaches language models to reason from expert demonstrations - no task-specific verifier needed. Many real tasks don’t have an automatic "checker." RARO (Relativistic Adversarial Reasoning Optimization) uses inverse reinforcement learning to learn from examples instead. How it works: a policy tries to produce answers