TruthTensor tests AI in motion: real‑world choices, not quiz questions

A new study proposes a clearer way to test artificial intelligence when the world does not stand still. Instead of quiz-like benchmarks, the approach follows live events and asks models to assign probabilities, then tracks how accurate, stable and careful they are over time. This matters because many organisations now lean on AI for forecasts and decisions under uncertainty.

Why this is being discussed now

The paper, posted on arXiv by Shirin Shahabi, Spencer Graham and Haruna Isah, comes from researchers working across universities and industry. The authors argue that common tests miss the messiness of real life: questions change, new information arrives and models may have seen parts of old test sets during training. As AI enters newsrooms, boardrooms and public services, evaluation needs to reflect that reality.

What the authors see as the structural problem

Traditional scores tell whether a model picked the right answer on a fixed list. They reveal little about how the model handles uncertainty, how its views shift as facts emerge or how well its stated probabilities match outcomes. The authors call these missing pieces calibration (do 60% predictions come true 60% of the time?), drift (how much the model swings as news breaks) and risk sensitivity (does it overstate confidence).

A concrete example

Consider a live market that trades on the chance a candidate wins an election. Two models might end with similar overall accuracy. Yet one may jump from 95% to 40% week to week as headlines roll in, while another stays near 60–65% and slowly adjusts. For a newsroom planning coverage or a city office preparing for scenarios, the first model’s sharp swings can mislead, even if its final score looks fine.

Key risk: speed and scale

The authors warn that narrow, static tests can create misplaced trust. Deployed at scale, an overconfident model can steer choices with high costs, from misallocated budgets to poor public guidance. Equally, a model that drifts can push teams into constant course corrections.

What they propose as safeguards

Their framework, TruthTensor, ties evaluation to forward-looking questions anchored in live prediction markets, where outcomes are not yet known and thus not in training data. It scores probabilities properly, measures drift and narrative stability, and specifies how humans check results. The protocol emphasises clear hypotheses, reproducible methods, transparent compute and cost reporting, and open, versioned evaluation contracts. In short: multi-axis testing before wide deployment, plus logs and audits to keep systems accountable.

In summary

The study’s message is restrained but practical: do not judge AI by a single number. Test how right it is, how sure it is, how its views change and what it costs to run. That is the level of evidence needed when AI advice meets real decisions.

In a nutshell: TruthTensor tests AI forecasts against live events to check accuracy, confidence and stability, helping organisations rely on models without being misled by shiny but shallow scores.

Static tests miss how models handle uncertainty and new information.
Models with similar accuracy can differ in confidence, drift and risk.
Use multi-axis, reproducible evaluations before deployment, with human oversight.

Paper: https://arxiv.org/abs/2601.13545v1

Register: https://www.AiFeta.com

AI research evaluation forecasting predictionmarkets governance

Kielimallit tekevät vaatimuskysymyksiä eri tyyleillä – ja tyyli riippuu käyttötarkoituksesta

Uusi vertailu näyttää, että tekoälyn tapa muotoilla järjestelmävaatimuksia luonnollisen kielen kysymyksiksi vaihtelee mallin ja aiheen mukaan. Siksi tärkeintä ei ole valita ”parasta” mallia, vaan tilanteeseen sopiva. Kuvitellaan tuttu kokous: pöydän ääressä yritetään päättää, mitä uuden tietojärjestelmän pitää pystyä tekemään. Syntyy lista kysymyksiä, joihin järjestelmän on osattava vastata. Esimerkiksi: ”Mitkä lääkkeet

Hyvin tehty muokkaus ei aina ole oikea muutos

Olet kuvannut perhejuhlan. Pyydät videota muokkaavaa tekoälyä vaihtamaan harmaan taivaan iltaruskoon. Tulos näyttää häkellyttävän aidolta – kunnes huomaat, että taivas on kyllä hehkuva, mutta myös morsiamen mekon väri on vahingossa muuttunut. Onko muokkaus hyvä vai huono? Videon laatua on totuttu arvioimaan sillä, miltä se näyttää. Tekoälyn tekemissä muokkauksissa pelkkä ulkonäkö ei

Julkaistu ajattelu voidaan jo purkaa tekoälyksi

Kun tutkija jättää työpöytänsä, hänen äänensä ei välttämättä vaikene. Pelkistä julkaisuista voidaan jo rakentaa tekoäly, joka ohjaa väitöskirjaa, arvioi artikkeleita ja väittelee paneelissa – uskottavasti. Useimmat meistä ajattelevat tutkimusartikkeleita kirjastoiksi: hyllyriveiksi ajatuksia, joihin muut voivat palata. Uusi arXivissa julkaistu esityspaperi ehdottaa toisenlaista kuvaa. Julkaisut ovatkin rakennuspiirustuksia, joista voidaan koota tekijänsä ajattelutapa

Konferenssien suunta ei ole pakko syntyä suljettujen ovien takana

Moni tietää tunteen seminaarin päätteeksi: ohjelma oli kiinnostava, mutta kuka päätti, mistä puhuttiin ja mistä ei? Usein vastaus on pieni ohjelmakomitea, joka tekee valinnat ennakkoon. Yleisö kuuntelee, harva vaikuttaa. Eräässä tekoälyn yhteiskunnallisia vaikutuksia käsittelevässä kansainvälisessä konferenssissa kokeiltiin toisenlaista tapaa. Osallistujat eivät vain tulleet paikalle – he auttoivat muokkaamaan itse tilaisuuden suuntaa.