CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

Kari Jaaskelainen

30 Sep 2025 — 1 min read

A dynamic online curriculum that reshapes problems to match—and grow—model ability.

Online RL with verifiable rewards (RLVR) is a leading recipe for boosting LLM reasoning, but most methods treat all problems equally—wasting time on what’s already mastered and offering little guidance on what’s just out of reach. CLPO (Curriculum-guided Learning for Policy Optimization) fixes this by creating a real-time pedagogical loop that evolves with the model.

CLPO continuously assesses difficulty from the model’s own rollouts to build an Online Curriculum. It then performs Adaptive Problem Restructuring: diversifying medium-difficulty problems to promote generalization while simplifying hard problems to make progress tractable. In effect, the model becomes its own teacher, shaping the training frontier to maximize learning efficiency and final capability.

Across eight demanding math and general reasoning benchmarks, CLPO delivers state-of-the-art results, with an average pass@1 improvement of +6.96% over strong baselines. The approach converts static training into a dynamic, co-evolving process—one that better aligns signal with need.

Why it matters: Curriculum and RL are natural partners. By targeting the right difficulty at the right time, CLPO reduces inefficient exploration and raises the ceiling on what small and mid-sized models can reliably solve.

Paper: arXiv: CLPO
Register: https://www.AiFeta.com

#CurriculumLearning #PolicyOptimization #RLVR #LLMReasoning #MathReasoning #RL #NLP #Efficiency

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic