Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
ROVER: a minimalist RLVR method that recovers optimal actions from a random policy’s Q-values. LLM reasoning with RL and verifiable rewards often leans on heavy policy-optimization loops (e.g., PPO/GRPO) plagued by instability and diversity collapse. This paper reframes math reasoning as a specialized finite-horizon MDP with deterministic