CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic online curriculum that reshapes problems to match—and grow—model ability.

Online RL with verifiable rewards (RLVR) is a leading recipe for boosting LLM reasoning, but most methods treat all problems equally—wasting time on what’s already mastered and offering little guidance on what’s just out of reach. CLPO (Curriculum-guided Learning for Policy Optimization) fixes this by creating a real-time pedagogical loop that evolves with the model.

CLPO continuously assesses difficulty from the model’s own rollouts to build an Online Curriculum. It then performs Adaptive Problem Restructuring: diversifying medium-difficulty problems to promote generalization while simplifying hard problems to make progress tractable. In effect, the model becomes its own teacher, shaping the training frontier to maximize learning efficiency and final capability.

Across eight demanding math and general reasoning benchmarks, CLPO delivers state-of-the-art results, with an average pass@1 improvement of +6.96% over strong baselines. The approach converts static training into a dynamic, co-evolving process—one that better aligns signal with need.

Why it matters: Curriculum and RL are natural partners. By targeting the right difficulty at the right time, CLPO reduces inefficient exploration and raises the ceiling on what small and mid-sized models can reliably solve.

Paper: arXiv: CLPO
Register: https://www.AiFeta.com

#CurriculumLearning #PolicyOptimization #RLVR #LLMReasoning #MathReasoning #RL #NLP #Efficiency

Read more