Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
ROVER: a minimalist RLVR method that recovers optimal actions from a random policy’s Q-values.
LLM reasoning with RL and verifiable rewards often leans on heavy policy-optimization loops (e.g., PPO/GRPO) plagued by instability and diversity collapse. This paper reframes math reasoning as a specialized finite-horizon MDP with deterministic transitions, tree-structured dynamics, and binary terminal rewards—and exploits that structure to radically simplify training.
The core insight: you can recover the optimal action from the Q-function of a fixed uniform random policy. ROVER (Random Policy Valuation for Diverse Reasoning) operationalizes this by sampling actions from a softmax over those Q-values, bypassing generalized policy iteration and its heuristics.
The result is both simpler and stronger. Across bases and benchmarks, ROVER improves pass@1 by +8.2, pass@256 by +16.8, and boosts solution-path diversity by +17.6%, while maintaining training stability. Because it preserves diverse trajectories throughout learning, ROVER sustains exploration of multiple valid reasoning paths—key for robustness and generalization.
Why it matters: When problem structure allows, minimalism beats complexity. ROVER shows that for verifiable math reasoning, careful valuation can replace heavyweight policy updates—delivering better quality and diversity with fewer moving parts.
Paper: arXiv: ROVER
Register: https://www.AiFeta.com
#RLVR #Reasoning #Math #QValues #Exploration #Stability #LLM #Minimalism