CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR

Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy: it continuously assesses difficulty from rollouts, then restructures the training set so the model learns at its frontier.

Two pillars drive CLPO. First, Online Curriculum: real‑time difficulty estimation steers sampling toward the most informative items. Second, Adaptive Problem Restructuring: the model acts as its own teacher—diversifying medium items to promote generalization while simplifying hard ones into attainable steps. This transforms static optimization into a co‑evolving loop where data and policy advance together.

Across eight challenging math and general reasoning benchmarks, CLPO achieves state‑of‑the‑art results, delivering an average pass@1 improvement of 6.96% over competitive baselines. The gains suggest that pacing and structure—core tenets of human pedagogy—translate into more efficient exploration and higher ceilings for LLM reasoning under verifiable signals.

Why it matters: curriculum isn’t just a data trick—it’s a control layer over learning dynamics. For practitioners, CLPO offers a principled way to concentrate compute where it counts and unlock harder problems sooner.

Paper: arXiv: CLPO
Register: AiFeta

#RLVR #CurriculumLearning #LLM #Reasoning #PolicyOptimization #MathAI #AITraining

Read more