AI
Turning On AI’s Built-in Safety Radar
Jailbreak prompts can still push chatbots to produce harmful content, even after safety training. Many defenses either miss sophisticated attacks or over-block harmless requests. This paper finds a useful clue: during generation, models carry latent safety signals that spike when content turns risky, but those signals get overridden by the