Why some fine-tuned LLMs miss phishing—and how to fix it
Not all fine-tuned LLMs spot phishing equally. A new study tests Llama 3.1 8B, Gemma 2 9B, and Mistral on high-stakes phishing detection—and uses SHAP and mechanistic interpretability to reveal why models do (or don’t) generalize.
- Architecture × data diversity matters: Gemma 2 9B hits state-of-the-art performance (F1 > 91%) but only when trained on a stylistically diverse, “generalist” dataset.
- Generalization is architecture-dependent: Llama 3.1 8B excels in a narrow domain yet stumbles when asked to integrate diverse data, causing a notable drop in performance.
- Some models are inherently steadier: Mistral is a consistent, resilient performer across multiple training setups.
Bottom line: Reliable AI isn’t just about fine-tuning—it’s about validating the interplay of model architecture, data diversity, and training strategy, and auditing the flawed heuristics models learn along the way.
Read more: https://arxiv.org/abs/2601.10524v1
Paper: https://arxiv.org/abs/2601.10524v1
Register: https://www.AiFeta.com
AI LLMs Security Phishing Generalization Interpretability NLP MachineLearning