A new, broader test shows what AI data assistants can and cannot do

Researchers have built a large, practical test for AI tools that help with data work. The results suggest these systems handle routine, well-structured tasks fairly well, but still stumble when the data are messy or when images and text must be understood together. This matters because many organisations are starting to use such tools in everyday analysis and reporting.

Background: why now

Companies are experimenting with AI "data assistants" that write code, run analyses and summarise findings. Testing them has been difficult: real projects rarely have one correct answer. A multi-institution team of researchers (the paper lists several universities and labs) has posted a study on arXiv introducing DSAEval, a test collection of 641 real data problems drawn from 285 datasets. It covers spreadsheets, text and images, and evaluates 11 modern systems built on large language models (AI programs trained on vast amounts of text).

What the authors see as the structural issue

According to the authors, most existing tests check simple, one-off questions. Real data science is different. It involves many steps, multiple data types, and partial answers that build on each other. Success is not just about producing code; it is about choosing a sensible approach, updating it after new findings, and delivering results that hold up. DSAEval tries to reflect this by: allowing the AI to read both text and images; supporting multi-step back-and-forth; and grading reasoning, code and final results together.

A concrete example

Imagine a retailer asking why a product’s sales fell. A helpful assistant must load and clean tables, join data from different sources, read a screenshot of a chart, form a hypothesis, write code, notice an error and correct it. DSAEval simulates this kind of workflow and scores each stage, not just the last answer.

Key findings

The team reports that Claude-Sonnet-4.5 performed best overall. GPT-5.2 finished tasks fastest, and MiMo-V2-Flash delivered the lowest cost per result. Letting systems read images as well as text improved performance on vision tasks by 2.04% to 11.30%. Even so, current tools do much better on tidy tables than on unstructured material such as raw text or pictures.

Main risk: speed and scale

The authors warn that without realistic testing, organisations may over-trust these assistants. Errors that look minor at the code level can mislead decisions when scaled across many reports or dashboards, especially in unstructured domains.

What they propose

The study recommends broader, multi-step tests before deployment; enabling image-and-text understanding when tasks require it; reporting speed, cost and quality together; keeping a human in the loop for high-impact work; and prioritising research on unstructured data. The authors also call for shared evaluation practices across labs.

Bottom line

DSAEval shows steady progress but also clear limits. AI data assistants are becoming useful for routine analysis, yet they need better handling of complex, messy data before they can be trusted more widely.

In a nutshell: A new, real-world test finds AI data assistants are competent on tidy, routine tasks but still unreliable with messy, multimodal problems.

Real projects need multi-step reasoning, not just one-off answers.
Reading both images and text helps, but unstructured data remain hard.
Evaluate quality, speed and cost together, with human oversight for important decisions.

Paper: https://arxiv.org/abs/2601.13591v1

Register: https://www.AiFeta.com

AI DataScience Research Evaluation MachineLearning

Tekoälyapuria ei kannata valita pelkän esittelytekstin perusteella

Uusi vertailu osoittaa, että sanat ja teot eivät kulje käsi kädessä: oikeat koesuoritukset parantavat hakutuloksia, kun etsitään sopivaa tekoälyapuria tuhansien joukosta. Olet etsimässä verkosta apuria, joka hoitaisi puolestasi arjen askareita: täyttäisi lomakkeen, järjestäisi matkasuunnitelman tai seulisi pitkän asiakirjakasan ydinkohdat. Vastassa on valikoima, joka muistuttaa sovelluskauppaa steroideilla. Jokainen ”tekoälyagentti” lupaa paljon

Hakutulosten kannattaa olla hyödyllisiä, ei vain samankaltaisia

Kielimallien taustahaku paranee, kun osumat valitaan sen mukaan, auttavatko ne vastausta — ja se voi olla yli satakertaisesti nopeampaa kuin nykyinen tapa. Kuvittele, että kysyt työpaikan chat-robotilta: “Mitä viime kuun kokouspäiväkirjassa päätettiin etätyöpäivistä?” Robotti selaa arkistoja ja poimii sinulle pätkän, jossa toistellaan, mitä etätyö tarkoittaa. Teksti on aiheeltaan lähellä kysymystä,

Yksi malli voi pian puhua, soittaa ja kolista – pelkillä tekstiohjeilla

Kun tekee kotivideota, ääni on usein suurin vaiva. Juonto syntyy yhdellä sovelluksella, taustamusiikki toisella ja ukkosen jyrinä kolmannella. Jokainen työkalu ymmärtää erilaisia komentoja, eikä mikään niistä oikein “puhu” toistensa kanssa. Lopputulos on pienen palapelityön tulos. Vuosia on ajateltu, että näin tämän kuuluukin mennä. Puhe on sanoja ja lauseita – hyvin jäsenneltyä.

Tekoälyn kanssa pärjäämme paremmin sopimalla kuin komentamalla

Puhelimesi suosittelee seuraavaa kappaletta, karttasovellus ehdottaa nopeinta reittiä, tekstinkorjaus päättää puolestasi, mitä olit ehkä sanomassa. Harva näistä järjestelmistä tottelee sinua sokeasti. Useammin huomaat itse muokkaavasi tapojasi niiden mukaan – ja ne puolestaan mukautuvat sinuun. Arkinen kokemus paljastaa: emme enää elä maailmassa, jossa kone on vain hiljainen renki. Silti puhe tekoälystä palaa