StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models

Kari Jaaskelainen

29 Sep 2025 — 1 min read

Co-evolving policy and process reward with solver-grounded feedback for OR tasks.

LLMs are promising solvers for Operations Research, yet two pitfalls persist: outcome-only rewards misassign credit, and discriminative process supervision often misses dependencies across modeling steps. StepORLM addresses both with a self-evolving loop that couples a policy model with a generative process reward model (GenPRM).

The framework blends definitive verification from an external OR solver with nuanced, holistic process evaluation from the GenPRM. This dual feedback aligns the policy via Weighted Direct Preference Optimization (W-DPO), while simultaneously refining the GenPRM to better judge future solutions. In effect, StepORLM learns not just to get the right answer, but to produce coherent, verifiably sound reasoning steps along the way.

Results are compelling: an 8B-parameter StepORLM sets new state-of-the-art results across six benchmarks, outperforming far larger generalist models, agentic tool-calling methods, and specialized baselines. Moreover, the co-evolved GenPRM emerges as a reusable process verifier, significantly boosting inference-time scaling for both StepORLM and external LLMs.

Why this matters: in logistics, planning, and scheduling, the path is as important as the solution. By unifying solver-grounded outcomes with generative process supervision, StepORLM offers a practical route to trustworthy, auditable OR reasoning that scales with data and model capacity.

Expect rapid adoption in decision-support pipelines where correctness, rationale quality, and reproducibility must coexist.

Paper: http://arxiv.org/abs/2509.22558v1

Register: https://www.AiFeta.com

#AI #OperationsResearch #LLM #ReinforcementLearning #Optimization #DPO

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning