StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models
Co-evolving policy and process reward with solver-grounded feedback for OR tasks.
LLMs are promising solvers for Operations Research, yet two pitfalls persist: outcome-only rewards misassign credit, and discriminative process supervision often misses dependencies across modeling steps. StepORLM addresses both with a self-evolving loop that couples a policy model with a generative process reward model (GenPRM).
The framework blends definitive verification from an external OR solver with nuanced, holistic process evaluation from the GenPRM. This dual feedback aligns the policy via Weighted Direct Preference Optimization (W-DPO), while simultaneously refining the GenPRM to better judge future solutions. In effect, StepORLM learns not just to get the right answer, but to produce coherent, verifiably sound reasoning steps along the way.
Results are compelling: an 8B-parameter StepORLM sets new state-of-the-art results across six benchmarks, outperforming far larger generalist models, agentic tool-calling methods, and specialized baselines. Moreover, the co-evolved GenPRM emerges as a reusable process verifier, significantly boosting inference-time scaling for both StepORLM and external LLMs.
Why this matters: in logistics, planning, and scheduling, the path is as important as the solution. By unifying solver-grounded outcomes with generative process supervision, StepORLM offers a practical route to trustworthy, auditable OR reasoning that scales with data and model capacity.
Expect rapid adoption in decision-support pipelines where correctness, rationale quality, and reproducibility must coexist.
Paper: http://arxiv.org/abs/2509.22558v1
Register: https://www.AiFeta.com
#AI #OperationsResearch #LLM #ReinforcementLearning #Optimization #DPO