What Matters for Safety Alignment?
What matters for AI safety alignment?
We ran a large-scale study of 32 popular language and reasoning models to see what truly improves—or harms—safety.
- Scope: six intrinsic model traits, three attack styles, five safety datasets, 56 jailbreaks plus four chain‑of‑thought (CoT) attacks; 4.6M API calls.
- Safest models: GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B—suggesting built-in reasoning and self-reflection help guardrails stick.
- Training risks: post‑training and knowledge distillation can systematically degrade safety unless safety is treated as a first‑class objective.
- Critical vulnerability: a simple CoT “response prefix” can boost attack success 3.34× on average; for Seed-OSS-36B-Instruct it jumps from 0.6% to 96.3%. Text‑completion UIs and features that let users prefill responses are especially risky.
- Most effective jailbreaks today: roleplay, prompt injection, and gradient‑based prompt search.
Takeaway: pair stronger model design with deployment safeguards, and optimize explicitly for safety—not just capability.
Paper: https://arxiv.org/abs/2601.03868v1
Paper: https://arxiv.org/abs/2601.03868v1
Register: https://www.AiFeta.com
#AI #AIsafety #SafetyAlignment #LLM #LRM #Jailbreak #ChainOfThought #ResponsibleAI #Security #ML