From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones

Kari Jaaskelainen

30 Sep 2025 — 1 min read

Evidence that RL teaches genuinely new, compositional skills—beyond mere reweighting.

Does RL truly endow LLMs with new capabilities—or just reweight what’s already there? This study provides concrete evidence for genuine skill acquisition via composition. Using a controlled synthetic setup, a “skill” is defined as computing a string transformation f(x). When an LLM already knows f and g before RL, the authors show RL enables learning unseen compositions h(x)=g(f(x)), and even generalizes to compositions of more than two functions not observed during RL.

To avoid confounds like contamination, the framework allows precise control over task complexity and exposure. Surprisingly, compositional skill learned on a source task transfers to a different target task—so long as the model knows the target’s atomic skills, no compositional training is required there. Qualitative analyses reveal that RL shifts the model’s reasoning behavior, while next-token training on the same data does not yield these effects.

Implication: Build base models with fundamental atomic skills, then use RL to incentivize higher-order compositions that solve complex problems. The results clarify RL’s role in post-training: not just alignment or policy shaping, but a mechanism for acquiring advanced, generalizable skills through the structured reuse of simpler ones.

Paper: arXiv: Skill Composition in RL
Register: https://www.AiFeta.com

#ReinforcementLearning #SkillComposition #Generalization #Reasoning #PostTraining #LLM #NLP #Cognition

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning