StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models

Kari Jaaskelainen

29 Sep 2025 — 1 min read

Co-evolving a policy LLM and a generative process reward model for OR

Solving Operations Research problems with LLMs demands more than final-answer rewards. StepORLM introduces generative process supervision that evaluates the entire modeling and reasoning pipeline. At its core is a co-evolution loop: a policy model learns to solve OR tasks while a Generative Process Reward Model (GenPRM) learns to holistically assess solution steps. Two complementary signals drive progress—definitive outcome feedback from an external solver and nuanced, process-level scoring from GenPRM.

The training uses Weighted Direct Preference Optimization (W-DPO) to align the policy with both outcome and process preferences, while simultaneously refining GenPRM. The result is an 8B-parameter model that sets a new state of the art across six benchmarks, outperforming much larger generalist models, agentic pipelines, and specialized baselines. Beyond training, the co-evolved GenPRM serves as a powerful process verifier at inference time, improving scaling behavior for StepORLM and for other LLMs.

Dual feedback: solver-verified outcomes + holistic process assessment.
Co-evolution: policy and GenPRM improve each other iteratively.
W-DPO alignment: balances correctness with structured reasoning quality.
Generalizable verifier: GenPRM boosts inference-time reliability across models.

Takeaway: moving from outcome-only to process-aware supervision resolves credit assignment pitfalls and better captures the interdependencies of OR modeling—yielding solutions that are not just correct, but well-structured and robust.

Paper: http://arxiv.org/abs/2509.22558v1
Register: https://www.AiFeta.com

#LLM #ReinforcementLearning #OperationsResearch #ProcessSupervision #DPO #Optimization #AI4OR

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning