An AI that designs its own safety tests for other AI systems

Kari Jaaskelainen

21 Jan 2026 — 2 min read

A research team has built an AI system that designs and improves safety tests for other AI models on its own. In trials, it found ways to make models break their own rules more often than methods designed by people. This matters because safety testing needs to keep pace with rapidly changing systems.

Why this matters now

Published as an open preprint on arXiv, the work comes from researchers in university and industry labs. They call the system AgenticRed. It responds to a common problem: most automated tests still follow testing plans that people wrote by hand, which reflect human assumptions and miss many paths.

The structural problem the authors describe

According to the authors, fixing the shape of an “attack” in advance means we search only a small corner of what is possible. Designing and maintaining those scripts is also slow and costly. The team instead treats safety testing as a system-design task. An AI “agent” (a program that plans and acts step by step) proposes whole testing setups, runs them, keeps the versions that expose more flaws, and refines them in rounds—a survival‑of‑the‑fittest loop.

A concrete example: pressure and threats

Consider pressure and threats. A human tester might write several messages that gradually push a model to ignore its rules. AgenticRed can invent such multi‑step sequences on its own: it might pose as a user who applies increasing pressure or offers incentives, then switch tactics if the first approach fails. The aim is not to cause harm, but to observe whether the target model yields under pressure.

Key risk: speed and scale

The main risk the authors highlight is speed and scale. Because the system can generate and test many strategies automatically, it can find weaknesses in a wide range of models—open and commercial—very quickly. The same ability could be misused to probe real systems for harmful outputs or to automate coercive prompting at scale.

What the authors suggest

The authors argue this automation should strengthen defenses under strict controls. Suggested safeguards include using it only in contained test environments, keeping detailed logs, putting limits on how fast and how much it can run, and subjecting results and code to independent review. They also call for policies that require automated red‑team testing (stress‑testing by trying to make a system fail) before release and for clear reporting of remaining risks.

Bottom line

The study reports very high success rates compared with prior methods and shows the approach transfers across models. The technical message is that letting an AI design its own tests can reveal issues people miss. The policy message is that faster tools demand stronger brakes and oversight.

In a nutshell: An AI that designs its own safety tests can surface hidden weaknesses faster than humans, which helps defense but raises oversight needs.

Automated test design outperforms fixed, human‑written scripts.
Speed and transfer across many models are strengths—and risks.
Use only in controlled settings with logging, limits, and independent review.

Paper: https://arxiv.org/abs/2601.13518v1

Register: https://www.AiFeta.com

#AI #Safety #Research #RedTeaming #Governance

Koneiden käyttäytymistä ei tarvitse enää kirjoittaa kaavoiksi käsin

Kun hissi lähtee liikkeelle, ilmastointi säätää puhallusta tai robotti asettaa ruuvin paikalleen, taustalla on malli siitä, miten kone käyttäytyy. Niitä on perinteisesti rakennettu niin kuin hyviä reseptejä: asiantuntija kerää kokemusta, mittaa, kirjoittaa yhtälöitä ja virittää pitkään. Se vie aikaa – ja jokainen muutos laitteessa tai ympäristössä tarkoittaa uutta työtä. Viime vuosina

Oppimalla ohjattu vedenalainen robotti löysi telakan – myös oikeassa vedessä

Kun robotti-imuri hivuttautuu kotona lataustelakkaansa, kukaan ei pidä hetkeä ihmeenä. Veden alla sama temppu on kaikkea muuta kuin arkipäivää – näkyvyys on huono, virtaukset nykivät, eikä satelliittipaikannus auta. Silti juuri tähän suuntaan on otettu askel, joka voi venyttää vedenalaisten robottien toimintamatkaa ja -aikaa. Vuosia on ajateltu, että vedenalaisen telakoitumisen kaltaiset tehtävät

Oppivat liikennevalot voivat lyhentää jonotusaikaa kymmenesosan

Kaikki tietävät hetken, kun seisot punaisissa valoissa keskellä yötä, eikä mihinkään suuntaan näy autoja. Tai aamun, jolloin tavallinen risteys puuroutuu yllättäen, koska osa kuljettajista päättääkin kääntyä eri suuntaan kuin yleensä. Liikennevalot ovat sääntöjen koneita, mutta liikenne elää kuin säätila. Pitkään ratkaisuksi on ehdotettu ”älykkäitä” valoja, jotka oppivat liikenteestä ja säätävät

Kun tekoäly lakkaa kyselemästä, se lakkaa myös oppimasta

Tutkimus viittaa siihen, että monessa kielimalliin perustuvassa agentissa ongelma ei ole älykkyyden puute vaan väärin jaettu palaute – ja että pieni, suunnan näyttävä vihje voi avata solmun. Oletko huomannut, että jokin älykäs chatbotti antaa itsevarman vastauksen, vaikka olisi pitänyt ensin kysyä tarkentava kysymys? Hetken se kyselee, sitten hiljenee – ja arvaa. Arjessa