Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning
Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic transitions, tree‑structured dynamics, and binary terminal rewards—simpler than the control problems these algorithms target.
The key theoretical insight: you can recover the optimal action from the Q‑function of a fixed uniformly random policy. That means no alternating policy evaluation/improvement loop is required. Building on this, the authors propose ROVER—Random Policy Valuation for Diverse Reasoning—which samples actions from a softmax over uniform‑policy Q‑values.
Despite its minimalist design, ROVER preserves exploration throughout training and avoids brittle heuristics. Empirically, across multiple base models and standard math reasoning benchmarks, it delivers notable improvements: +8.2 on pass@1, +16.8 on pass@256, and +17.6% diversity, as reported by the authors.
Why it matters: if random‑policy valuation suffices for structured RLVR regimes, we can streamline training, reduce tuning burden, and maintain diversity—crucial for discovering alternative correct pathways in reasoning trees.
Who should care: teams scaling math‑centric RLVR, researchers probing the limits of policy iteration in LLMs, and practitioners battling instability and mode collapse.
Paper: arXiv: ROVER
Register: AiFeta
#RLVR #LLM #MathReasoning #ReinforcementLearning #Qlearning #Exploration #AITraining