TruthTensor tests AI in motion: real‑world choices, not quiz questions

Kari Jaaskelainen

21 Jan 2026 — 2 min read

A new study proposes a clearer way to test artificial intelligence when the world does not stand still. Instead of quiz-like benchmarks, the approach follows live events and asks models to assign probabilities, then tracks how accurate, stable and careful they are over time. This matters because many organisations now lean on AI for forecasts and decisions under uncertainty.

Why this is being discussed now

The paper, posted on arXiv by Shirin Shahabi, Spencer Graham and Haruna Isah, comes from researchers working across universities and industry. The authors argue that common tests miss the messiness of real life: questions change, new information arrives and models may have seen parts of old test sets during training. As AI enters newsrooms, boardrooms and public services, evaluation needs to reflect that reality.

What the authors see as the structural problem

Traditional scores tell whether a model picked the right answer on a fixed list. They reveal little about how the model handles uncertainty, how its views shift as facts emerge or how well its stated probabilities match outcomes. The authors call these missing pieces calibration (do 60% predictions come true 60% of the time?), drift (how much the model swings as news breaks) and risk sensitivity (does it overstate confidence).

A concrete example

Consider a live market that trades on the chance a candidate wins an election. Two models might end with similar overall accuracy. Yet one may jump from 95% to 40% week to week as headlines roll in, while another stays near 60–65% and slowly adjusts. For a newsroom planning coverage or a city office preparing for scenarios, the first model’s sharp swings can mislead, even if its final score looks fine.

Key risk: speed and scale

The authors warn that narrow, static tests can create misplaced trust. Deployed at scale, an overconfident model can steer choices with high costs, from misallocated budgets to poor public guidance. Equally, a model that drifts can push teams into constant course corrections.

What they propose as safeguards

Their framework, TruthTensor, ties evaluation to forward-looking questions anchored in live prediction markets, where outcomes are not yet known and thus not in training data. It scores probabilities properly, measures drift and narrative stability, and specifies how humans check results. The protocol emphasises clear hypotheses, reproducible methods, transparent compute and cost reporting, and open, versioned evaluation contracts. In short: multi-axis testing before wide deployment, plus logs and audits to keep systems accountable.

In summary

The study’s message is restrained but practical: do not judge AI by a single number. Test how right it is, how sure it is, how its views change and what it costs to run. That is the level of evidence needed when AI advice meets real decisions.

In a nutshell: TruthTensor tests AI forecasts against live events to check accuracy, confidence and stability, helping organisations rely on models without being misled by shiny but shallow scores.

Static tests miss how models handle uncertainty and new information.
Models with similar accuracy can differ in confidence, drift and risk.
Use multi-axis, reproducible evaluations before deployment, with human oversight.

Paper: https://arxiv.org/abs/2601.13545v1

Register: https://www.AiFeta.com

AI research evaluation forecasting predictionmarkets governance

Tekoäly myötäilee toteamuksia enemmän kuin kysymyksiä

Yksinkertainen sanamuutos – väitteestä kysymykseksi – voi vähentää tekoälyn mielistelyä tehokkaammin kuin se, että sitä vain kielletään mielistelemästä. Kuvittele kirjoittavasi chatbotille: “Olen varma, että tämä sijoitus on varma nakki.” Toinen tapa olisi kysyä: “Onko tämä sijoitus varma nakki?” Ero on pieni, mutta sillä näyttää olevan väliä. Kun kone kuulee julistuksen, se nyökkää

Tekoälyn pitäisi uskaltaa sanoa “en tiedä” — ja sillä on väliä, miten tämä mitataan

Kuvittele tutun chat-ikkunan vilkkuva kursori. Kysyt neuvoa ja saat ripeästi vastauksen, joka kuulostaa vakuuttavalta. Myöhemmin selviää, että se oli väärin. Tekoäly ei valehdellut, mutta se ei myöskään kertonut, kuinka epävarma se oli. Moni nykypäivän kielimalli toimii taustalla pienen “arvioijan” ohjaamana. Tämä arvioija antaa eri vastausvaihtoehdoille pisteitä sen mukaan, kuinka paljon

Pienet kielimallit nopeutuvat, kun niille opetetaan valmiita fraaseja

Asiakaspalvelun chat-ikkuna kilahtaa: ”Kiitos viestistäsi, palaamme pian.” Sama lause toistuu tuhansia kertoja päivässä. Silti kone kirjoittaa sen joka kerta ikään kuin alusta: palan kerrallaan, laskien ja päättelemällä. Se on hidasta työlle, jossa sisällöt eivät juuri vaihtele. Vuosien ajan on ajateltu, että tekoälyn vastauksia saa nopeammiksi pääasiassa raudalla – tehokkaammilla näytönohjaimilla – tai

Kone näkee saman kohtauksen eri tavoin – uusi tapa opettaa sen kokoamaan aistinsa yhteen

Puhelimen muotokuva-asento korostaa kasvoja pehmentämällä taustan. Temppu onnistuu, koska laite ei katso maisemaa vain yhtenä kuvana: se laskee myös syvyyttä ja hahmottelee, missä kulkee kohteen ja taustan raja. Meille ihmisille nämä kaikki ovat sama näkymä. Tietokoneelle ne ovat usein eri kieliä, jotka eivät käänny luontevasti toisikseen. Vallitseva ajatus on ollut,