StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models
Co-evolving a policy LLM and a generative process reward model for OR
Solving Operations Research problems with LLMs demands more than final-answer rewards. StepORLM introduces generative process supervision that evaluates the entire modeling and reasoning pipeline. At its core is a co-evolution loop: a policy model learns to solve OR tasks while a Generative Process Reward Model (GenPRM) learns to holistically assess solution steps. Two complementary signals drive progress—definitive outcome feedback from an external solver and nuanced, process-level scoring from GenPRM.
The training uses Weighted Direct Preference Optimization (W-DPO) to align the policy with both outcome and process preferences, while simultaneously refining GenPRM. The result is an 8B-parameter model that sets a new state of the art across six benchmarks, outperforming much larger generalist models, agentic pipelines, and specialized baselines. Beyond training, the co-evolved GenPRM serves as a powerful process verifier at inference time, improving scaling behavior for StepORLM and for other LLMs.
- Dual feedback: solver-verified outcomes + holistic process assessment.
- Co-evolution: policy and GenPRM improve each other iteratively.
- W-DPO alignment: balances correctness with structured reasoning quality.
- Generalizable verifier: GenPRM boosts inference-time reliability across models.
Takeaway: moving from outcome-only to process-aware supervision resolves credit assignment pitfalls and better captures the interdependencies of OR modeling—yielding solutions that are not just correct, but well-structured and robust.
Paper: http://arxiv.org/abs/2509.22558v1
Register: https://www.AiFeta.com
#LLM #ReinforcementLearning #OperationsResearch #ProcessSupervision #DPO #Optimization #AI4OR