Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

Kari Jaaskelainen

30 Sep 2025 — 1 min read

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning

Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic transitions, tree‑structured dynamics, and binary terminal rewards—simpler than the control problems these algorithms target.

The key theoretical insight: you can recover the optimal action from the Q‑function of a fixed uniformly random policy. That means no alternating policy evaluation/improvement loop is required. Building on this, the authors propose ROVER—Random Policy Valuation for Diverse Reasoning—which samples actions from a softmax over uniform‑policy Q‑values.

Despite its minimalist design, ROVER preserves exploration throughout training and avoids brittle heuristics. Empirically, across multiple base models and standard math reasoning benchmarks, it delivers notable improvements: +8.2 on pass@1, +16.8 on pass@256, and +17.6% diversity, as reported by the authors.

Why it matters: if random‑policy valuation suffices for structured RLVR regimes, we can streamline training, reduce tuning burden, and maintain diversity—crucial for discovering alternative correct pathways in reasoning trees.

Who should care: teams scaling math‑centric RLVR, researchers probing the limits of policy iteration in LLMs, and practitioners battling instability and mode collapse.

Paper: arXiv: ROVER
Register: AiFeta

#RLVR #LLM #MathReasoning #ReinforcementLearning #Qlearning #Exploration #AITraining

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Scaling Synthetic Task Generation for Agents via Exploration

AutoPlay explores environments to synthesize diverse, verifiable tasks—fueling stronger UI agents Agent post‑training hinges on rich, grounded tasks—but human‑curated datasets are costly and often shallow. AutoPlay introduces a scalable alternative: let an MLLM‑based explorer systematically probe interactive environments, discover states and affordances, and then synthesize

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

Scaling Synthetic Task Generation for Agents via Exploration