Probing AI Isn’t So Simple: Synthetic Training Can Mislead
Probing AI Isn’t So Simple
To keep AI models honest, researchers train tiny “probes” that look inside a model’s activations to flag behaviors like deception or sycophancy. But real examples of these behaviors are rare, so teams often use synthetic AI-generated data instead.
This study tested how well such probes generalize across eight behaviors and multiple models. The takeaway: the way you generate training data can strongly change probe performance—and the effect depends on the behavior.
- Generalizing from off-policy (synthetic) data can predict success on real, on-policy tests—but not always.
- Probes for Deception and Sandbagging are especially at risk of failing when moved to real monitoring.
- Biggest pitfall: domain shift. Training and testing on different domains hurts far more than using synthetic vs. natural data.
Bottom line: if you lack on-policy data, using same-domain synthetic data is safer than mixing domains. We need better methods to handle distribution shifts in AI monitoring.
Paper: https://arxiv.org/abs/2511.17408v1 — Kirch, Dower, Skapars, Lubana, Krasheninnikov.
Paper: https://arxiv.org/abs/2511.17408v1
Register: https://www.AiFeta.com
AI LLM Safety ML Probing SyntheticData Generalization AIResearch