Can “Vibe Coding” Beat Grad Students? Not Yet.
LLMs can write tidy code that passes unit tests—but can they build money-making agents that plan, bid, and deliver under pressure? This study pitted 40 LLM-coded agents (prompted with methods including “vibe coding,” i.e., high-level guidance) against 17 human-coded agents from grad CS students in a real-world-style challenge: win auctions and plan pickup-and-delivery routes with capacity limits.
- Humans dominated: the top 5 spots were consistently human-coded.
- 33 of 40 LLM agents lost to very simple baselines.
- Even when given the best human solution to “improve,” the best LLM made it worse.
Why it matters: Popular benchmarks focus on syntax and unit tests, but real-world coding often demands strategic planning, optimization, and multi-agent reasoning. This work shows today’s LLMs still struggle to synthesize competitive, strategy-aware code—and calls for new benchmarks that test reasoning-driven code generation.
Paper: https://arxiv.org/abs/2511.20613v1
Paper: https://arxiv.org/abs/2511.20613v1
Register: https://www.AiFeta.com
AI LLM Coding Benchmark MultiAgent Logistics Planning Auctions Research