Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

Kari Jaaskelainen

30 Sep 2025 — 1 min read

ROVER: a minimalist RLVR method that recovers optimal actions from a random policy’s Q-values.

LLM reasoning with RL and verifiable rewards often leans on heavy policy-optimization loops (e.g., PPO/GRPO) plagued by instability and diversity collapse. This paper reframes math reasoning as a specialized finite-horizon MDP with deterministic transitions, tree-structured dynamics, and binary terminal rewards—and exploits that structure to radically simplify training.

The core insight: you can recover the optimal action from the Q-function of a fixed uniform random policy. ROVER (Random Policy Valuation for Diverse Reasoning) operationalizes this by sampling actions from a softmax over those Q-values, bypassing generalized policy iteration and its heuristics.

The result is both simpler and stronger. Across bases and benchmarks, ROVER improves pass@1 by +8.2, pass@256 by +16.8, and boosts solution-path diversity by +17.6%, while maintaining training stability. Because it preserves diverse trajectories throughout learning, ROVER sustains exploration of multiple valid reasoning paths—key for robustness and generalization.

Why it matters: When problem structure allows, minimalism beats complexity. ROVER shows that for verifiable math reasoning, careful valuation can replace heavyweight policy updates—delivering better quality and diversity with fewer moving parts.

Paper: arXiv: ROVER
Register: https://www.AiFeta.com

#RLVR #Reasoning #Math #QValues #Exploration #Stability #LLM #Minimalism

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated