A new, broader test shows what AI data assistants can and cannot do

A new, broader test shows what AI data assistants can and cannot do

Researchers have built a large, practical test for AI tools that help with data work. The results suggest these systems handle routine, well-structured tasks fairly well, but still stumble when the data are messy or when images and text must be understood together. This matters because many organisations are starting to use such tools in everyday analysis and reporting.

Background: why now

Companies are experimenting with AI "data assistants" that write code, run analyses and summarise findings. Testing them has been difficult: real projects rarely have one correct answer. A multi-institution team of researchers (the paper lists several universities and labs) has posted a study on arXiv introducing DSAEval, a test collection of 641 real data problems drawn from 285 datasets. It covers spreadsheets, text and images, and evaluates 11 modern systems built on large language models (AI programs trained on vast amounts of text).

What the authors see as the structural issue

According to the authors, most existing tests check simple, one-off questions. Real data science is different. It involves many steps, multiple data types, and partial answers that build on each other. Success is not just about producing code; it is about choosing a sensible approach, updating it after new findings, and delivering results that hold up. DSAEval tries to reflect this by: allowing the AI to read both text and images; supporting multi-step back-and-forth; and grading reasoning, code and final results together.

A concrete example

Imagine a retailer asking why a product’s sales fell. A helpful assistant must load and clean tables, join data from different sources, read a screenshot of a chart, form a hypothesis, write code, notice an error and correct it. DSAEval simulates this kind of workflow and scores each stage, not just the last answer.

Key findings

The team reports that Claude-Sonnet-4.5 performed best overall. GPT-5.2 finished tasks fastest, and MiMo-V2-Flash delivered the lowest cost per result. Letting systems read images as well as text improved performance on vision tasks by 2.04% to 11.30%. Even so, current tools do much better on tidy tables than on unstructured material such as raw text or pictures.

Main risk: speed and scale

The authors warn that without realistic testing, organisations may over-trust these assistants. Errors that look minor at the code level can mislead decisions when scaled across many reports or dashboards, especially in unstructured domains.

What they propose

The study recommends broader, multi-step tests before deployment; enabling image-and-text understanding when tasks require it; reporting speed, cost and quality together; keeping a human in the loop for high-impact work; and prioritising research on unstructured data. The authors also call for shared evaluation practices across labs.

Bottom line

DSAEval shows steady progress but also clear limits. AI data assistants are becoming useful for routine analysis, yet they need better handling of complex, messy data before they can be trusted more widely.

In a nutshell: A new, real-world test finds AI data assistants are competent on tidy, routine tasks but still unreliable with messy, multimodal problems.

  • Real projects need multi-step reasoning, not just one-off answers.
  • Reading both images and text helps, but unstructured data remain hard.
  • Evaluate quality, speed and cost together, with human oversight for important decisions.

Paper: https://arxiv.org/abs/2601.13591v1

Register: https://www.AiFeta.com

AI DataScience Research Evaluation MachineLearning

Read more