ACE-Safety: Co-Evolution of Attack and Defense for Safer LLMs

ACE-Safety: Co-Evolution of Attack and Defense for Safer LLMs

Large language models are powerful but can be tricked into harmful outputs (jailbreaks). Most research isolates either attacks or defenses. This paper introduces ACE-Safety, a "train both sides" approach where an attacking AI and a defending AI grow stronger together—like sparring partners—to harden real-world systems.

  • GS-MCTS: a group-aware, strategy-guided tree search that rapidly explores many jailbreak tactics, uncovering blind spots and generating diverse, realistic attack prompts.
  • AC-TGPO: a curriculum-based reinforcement learning method that jointly trains attacker and defender on increasingly tough cases, so each round raises the bar for the next.

Across multiple benchmarks, this co-evolution outperforms standalone attack or defense methods, pointing to a practical path for more robust, continuously improving LLM safety.

“Don’t just build bigger walls—train better sparring partners.”

Paper: https://arxiv.org/abs/2511.19218v1 — Authors: Xurui Li, Kaisong Song, Rui Zhu, Pin-Yu Chen, Haixu Tang

Paper: https://arxiv.org/abs/2511.19218v1

Register: https://www.AiFeta.com

#AI #LLM #AIsafety #AdversarialML #Cybersecurity #ReinforcementLearning #MCTS #ResponsibleAI #Jailbreaks

Read more