Why some fine-tuned LLMs miss phishing—and how to fix it

Why some fine-tuned LLMs miss phishing—and how to fix it

Not all fine-tuned LLMs spot phishing equally. A new study tests Llama 3.1 8B, Gemma 2 9B, and Mistral on high-stakes phishing detection—and uses SHAP and mechanistic interpretability to reveal why models do (or don’t) generalize.

  • Architecture × data diversity matters: Gemma 2 9B hits state-of-the-art performance (F1 > 91%) but only when trained on a stylistically diverse, “generalist” dataset.
  • Generalization is architecture-dependent: Llama 3.1 8B excels in a narrow domain yet stumbles when asked to integrate diverse data, causing a notable drop in performance.
  • Some models are inherently steadier: Mistral is a consistent, resilient performer across multiple training setups.

Bottom line: Reliable AI isn’t just about fine-tuning—it’s about validating the interplay of model architecture, data diversity, and training strategy, and auditing the flawed heuristics models learn along the way.

Read more: https://arxiv.org/abs/2601.10524v1

Paper: https://arxiv.org/abs/2601.10524v1

Register: https://www.AiFeta.com

AI LLMs Security Phishing Generalization Interpretability NLP MachineLearning

Read more