An AI that designs its own safety tests for other AI systems

A research team has built an AI system that designs and improves safety tests for other AI models on its own. In trials, it found ways to make models break their own rules more often than methods designed by people. This matters because safety testing needs to keep pace with rapidly changing systems.

Why this matters now

Published as an open preprint on arXiv, the work comes from researchers in university and industry labs. They call the system AgenticRed. It responds to a common problem: most automated tests still follow testing plans that people wrote by hand, which reflect human assumptions and miss many paths.

The structural problem the authors describe

According to the authors, fixing the shape of an “attack” in advance means we search only a small corner of what is possible. Designing and maintaining those scripts is also slow and costly. The team instead treats safety testing as a system-design task. An AI “agent” (a program that plans and acts step by step) proposes whole testing setups, runs them, keeps the versions that expose more flaws, and refines them in rounds—a survival‑of‑the‑fittest loop.

A concrete example: pressure and threats

Consider pressure and threats. A human tester might write several messages that gradually push a model to ignore its rules. AgenticRed can invent such multi‑step sequences on its own: it might pose as a user who applies increasing pressure or offers incentives, then switch tactics if the first approach fails. The aim is not to cause harm, but to observe whether the target model yields under pressure.

Key risk: speed and scale

The main risk the authors highlight is speed and scale. Because the system can generate and test many strategies automatically, it can find weaknesses in a wide range of models—open and commercial—very quickly. The same ability could be misused to probe real systems for harmful outputs or to automate coercive prompting at scale.

What the authors suggest

The authors argue this automation should strengthen defenses under strict controls. Suggested safeguards include using it only in contained test environments, keeping detailed logs, putting limits on how fast and how much it can run, and subjecting results and code to independent review. They also call for policies that require automated red‑team testing (stress‑testing by trying to make a system fail) before release and for clear reporting of remaining risks.

Bottom line

The study reports very high success rates compared with prior methods and shows the approach transfers across models. The technical message is that letting an AI design its own tests can reveal issues people miss. The policy message is that faster tools demand stronger brakes and oversight.

In a nutshell: An AI that designs its own safety tests can surface hidden weaknesses faster than humans, which helps defense but raises oversight needs.

Automated test design outperforms fixed, human‑written scripts.
Speed and transfer across many models are strengths—and risks.
Use only in controlled settings with logging, limits, and independent review.

Paper: https://arxiv.org/abs/2601.13518v1

Register: https://www.AiFeta.com

#AI #Safety #Research #RedTeaming #Governance

Tekoälyapuria ei kannata valita pelkän esittelytekstin perusteella

Uusi vertailu osoittaa, että sanat ja teot eivät kulje käsi kädessä: oikeat koesuoritukset parantavat hakutuloksia, kun etsitään sopivaa tekoälyapuria tuhansien joukosta. Olet etsimässä verkosta apuria, joka hoitaisi puolestasi arjen askareita: täyttäisi lomakkeen, järjestäisi matkasuunnitelman tai seulisi pitkän asiakirjakasan ydinkohdat. Vastassa on valikoima, joka muistuttaa sovelluskauppaa steroideilla. Jokainen ”tekoälyagentti” lupaa paljon

Hakutulosten kannattaa olla hyödyllisiä, ei vain samankaltaisia

Kielimallien taustahaku paranee, kun osumat valitaan sen mukaan, auttavatko ne vastausta — ja se voi olla yli satakertaisesti nopeampaa kuin nykyinen tapa. Kuvittele, että kysyt työpaikan chat-robotilta: “Mitä viime kuun kokouspäiväkirjassa päätettiin etätyöpäivistä?” Robotti selaa arkistoja ja poimii sinulle pätkän, jossa toistellaan, mitä etätyö tarkoittaa. Teksti on aiheeltaan lähellä kysymystä, mutta

Yksi malli voi pian puhua, soittaa ja kolista – pelkillä tekstiohjeilla

Kun tekee kotivideota, ääni on usein suurin vaiva. Juonto syntyy yhdellä sovelluksella, taustamusiikki toisella ja ukkosen jyrinä kolmannella. Jokainen työkalu ymmärtää erilaisia komentoja, eikä mikään niistä oikein “puhu” toistensa kanssa. Lopputulos on pienen palapelityön tulos. Vuosia on ajateltu, että näin tämän kuuluukin mennä. Puhe on sanoja ja lauseita – hyvin jäsenneltyä.

Tekoälyn kanssa pärjäämme paremmin sopimalla kuin komentamalla

Puhelimesi suosittelee seuraavaa kappaletta, karttasovellus ehdottaa nopeinta reittiä, tekstinkorjaus päättää puolestasi, mitä olit ehkä sanomassa. Harva näistä järjestelmistä tottelee sinua sokeasti. Useammin huomaat itse muokkaavasi tapojasi niiden mukaan – ja ne puolestaan mukautuvat sinuun. Arkinen kokemus paljastaa: emme enää elä maailmassa, jossa kone on vain hiljainen renki. Silti puhe tekoälystä palaa