Benchmark - AI Feta, the news about scientific AI research

AI

AI that auto-builds large-scale optimization models (LEAN-LLM-OPT)

Big business decisions often rely on complex optimization models, but building them is slow and manual. Meet LEAN-LLM-OPT, a lightweight, multi-agent AI that auto-formulates large-scale optimization models from a plain-English problem description and datasets. How it works: two planner agents design a step-by-step workflow for similar problems; a builder agent

AI

APEX-SWE: A Real-World Test for AI Coders

Can AI ship real software? Meet APEX-SWE, a new benchmark that tests whether frontier AI models can do economically valuable software engineering—not just toy coding puzzles. * Integration tasks (n=100): build end-to-end systems across cloud primitives, business apps, and infrastructure-as-code. * Observability tasks (n=100): debug production failures using logs,

AI

Can AI Spot Nonsense in Pictures? Meet UAIT

Can AI spot nonsense in pictures? Humans can tell that “A sandwich cuts a chef” is grammatically fine but semantically absurd. Many vision-language models (VLMs) can’t. UAIT (Uncommon-sense Action Image-Text) is a new benchmark that stress-tests whether VLMs truly understand who is doing what to whom—and what’s

AI

TowerMind: A lightweight tower-defense testbed for AI agents

TowerMind is a new, lightweight tower-defense game environment for testing AI agents, especially large language models (LLMs), on planning and decision-making. * Low compute cost and easy to run * Multimodal observations: pixels, text, and structured game state * Customizable levels and rules * Built-in tests for model hallucination The authors create five benchmark

AI

MineNPC-Task: Teaching AI to Remember in Minecraft

How do we know if game-playing AIs can plan, act, and remember like good teammates? Meet MineNPC-Task—a new open benchmark for memory-aware AI agents inside Minecraft’s open world. Instead of toy prompts, tasks come from real co-play sessions with expert players, then get turned into templates with clear

AI

Arabic prompts, English tools: mind the gap in AI agents

Arabic AI users are hitting a hidden speed bump LLM-powered agents are everywhere—but most tests are English-first. Kubrak et al. present the first benchmark to evaluate tool-calling and agentic skills when users prompt in Arabic. * What they built: a standardized way to measure functional accuracy and robustness in Arabic

EmbodiedAI

WoW-World-Eval: A Turing Test for Robot-Ready Video AI

What’s new As AI video models are used as “world models” for robots, we need to know if their imagined futures match reality. WoW-World-Eval (Wow, wo, val) is a public benchmark that stress-tests these models before we trust them on real machines. * Five skills: perception, planning, prediction, generalization, execution.

NLP

Arctic-ABSA: Smarter, multilingual aspect sentiment with reasoning

Ever wish reviews told you exactly how people feel about the camera, battery, or price? That's aspect-based sentiment analysis (ABSA). Arctic-ABSA is a new family of reasoning-infused models built for real-world ABSA. Trained on a large blend of public and carefully generated synthetic data—about 20× the classic

AI

Can video AIs guide real robots? Meet the WoW-World-Eval Embodied Turing Test

Can video AIs guide real robots? Meet the WoW-World-Eval Embodied Turing Test Robots need 'world models' that predict what happens next. A new benchmark, WoW-World-Eval (Wow, wo, val), tests video foundation models on 609 robot manipulation samples across five skills: perception, planning, prediction, generalization, and execution. * 22-metric score

NLP

Arctic-ABSA: Reasoning-Infused, Multilingual Sentiment Analysis for Real-World Use

Arctic-ABSA brings aspect-based sentiment analysis closer to real life. Instead of only “positive/negative/neutral,” it adds nuance, multilingual reach, and reasoning for better business decisions. * Five classes: positive, negative, neutral, mixed, and unknown. * Finds aspect-level opinions and the overall tone in one pass. * Multilingual: one model holds 87–91%

AI

On Evaluating LLM Alignment by Evaluating LLMs as Judges

How do we know if an AI model is truly aligned with human preferences—helpful, honest, safe, and instruction-following? This paper explores a surprisingly effective shortcut: judge the judges. Instead of grading a model’s open-ended answers (which needs lots of human effort or very strong AI judges), the authors

AI

Can “Vibe Coding” Beat Grad Students? Not Yet.

LLMs can write tidy code that passes unit tests—but can they build money-making agents that plan, bid, and deliver under pressure? This study pitted 40 LLM-coded agents (prompted with methods including “vibe coding,” i.e., high-level guidance) against 17 human-coded agents from grad CS students in a real-world-style challenge: