Hidden Winning Tickets in Transformer Attention

Ever heard of the “lottery ticket” idea in AI? It says big neural nets hide small subnetworks that can perform just as well. This paper proves a strong version of that for the heart of Transformers: multi-head attention (MHA).

The big claim: Inside a randomly initialized MHA, there exists a small subnetwork that can closely imitate any MHA with the same input size—provided the hidden size is large enough (grows with input dimension and number of heads).
Beyond attention: Using this result, the authors extend the strong lottery ticket theory to entire Transformers without normalization layers.
Evidence: Experiments show the approximation error shrinks exponentially as the hidden size increases.

Why it matters: if “winning tickets” are guaranteed to exist in attention, we can prune or sparsely train large models more confidently, aiming for smaller, faster, and cheaper Transformers without sacrificing accuracy.

Paper: http://arxiv.org/abs/2511.04217v1

Paper: http://arxiv.org/abs/2511.04217v1

Register: https://www.AiFeta.com

AI MachineLearning Transformers DeepLearning NeuralNetworks EfficientAI ModelCompression Research

Hidden Winning Tickets in Transformer Attention

Read more

Tekoälyapuria ei kannata valita pelkän esittelytekstin perusteella

Hakutulosten kannattaa olla hyödyllisiä, ei vain samankaltaisia

Yksi malli voi pian puhua, soittaa ja kolista – pelkillä tekstiohjeilla

Tekoälyn kanssa pärjäämme paremmin sopimalla kuin komentamalla