Making LLMs Faster: Fix Memory and Interconnect, Not Just Compute
Why running LLMs is hard (and how hardware can help) Large language models don't just need fast math—they need to fetch and share enormous amounts of data, one token at a time. In inference, the autoregressive "decode" phase dominates, making memory and interconnect, not compute,