Hidden Winning Tickets in Transformer Attention
Ever heard of the “lottery ticket” idea in AI? It says big neural nets hide small subnetworks that can perform just as well. This paper proves a strong version of that for the heart of Transformers: multi-head attention (MHA).
- The big claim: Inside a randomly initialized MHA, there exists a small subnetwork that can closely imitate any MHA with the same input size—provided the hidden size is large enough (grows with input dimension and number of heads).
- Beyond attention: Using this result, the authors extend the strong lottery ticket theory to entire Transformers without normalization layers.
- Evidence: Experiments show the approximation error shrinks exponentially as the hidden size increases.
Why it matters: if “winning tickets” are guaranteed to exist in attention, we can prune or sparsely train large models more confidently, aiming for smaller, faster, and cheaper Transformers without sacrificing accuracy.
Paper: http://arxiv.org/abs/2511.04217v1
Paper: http://arxiv.org/abs/2511.04217v1
Register: https://www.AiFeta.com
AI MachineLearning Transformers DeepLearning NeuralNetworks EfficientAI ModelCompression Research