Can AI Run Real Research? InnovatorBench Puts LLM Agents to the Test
AI agents promise to speed up discovery by doing the messy parts of research—forming hypotheses, writing code, running experiments, and analyzing results. But do they actually handle end-to-end projects?
InnovatorBench is a new benchmark+platform that tests agents on realistic Large Language Model (LLM) research workflows.
- 20 tasks across data construction/filtering/augmentation, loss and reward design, and scaffold construction
- Requires runnable artifacts; evaluates correctness, performance, output quality, and uncertainty
- Powered by ResearchGym: rich action space, distributed long-horizon runs, async monitoring, snapshotting
The team built a lightweight ReAct-style agent that plans and executes with frontier models (e.g., Claude-4, GPT-5, GLM-4.5, Kimi-K2).
Results: promising on code-heavy tasks, but brittle on algorithmic ones and long-horizon decisions—showing impatience, weak resource management, and template-driven reasoning.
Agents often need 11+ hours to reach best scores, underscoring the benchmark’s difficulty and why end-to-end evaluation matters.
Paper: http://arxiv.org/abs/2510.27598v2
Paper: http://arxiv.org/abs/2510.27598v2
Register: https://www.AiFeta.com
AI LLM Agents Research Benchmark MachineLearning NLP Evaluation arXiv