Making LLMs Faster: Fix Memory and Interconnect, Not Just Compute

Making LLMs Faster: Fix Memory and Interconnect, Not Just Compute

Why running LLMs is hard (and how hardware can help)

Large language models don't just need fast math—they need to fetch and share enormous amounts of data, one token at a time. In inference, the autoregressive "decode" phase dominates, making memory and interconnect, not compute, the true bottlenecks.

  • Trend headwinds: Bigger models, longer context windows, and limited batching all amplify memory pressure and communication costs.
  • What could fix it:
  • High-Bandwidth Flash: ~10× more memory capacity with HBM-like bandwidth to keep models close to the chips.
  • Processing-Near-Memory and 3D memory–logic stacking: bring simple operations to the data and boost on-package bandwidth.
  • Low-latency interconnects: faster links across accelerators and servers to reduce waiting during decode.

The paper focuses on datacenters, with lessons that can extend to mobile devices as on‑device AI grows.

Bottom line: To make LLMs faster, cheaper, and greener, prioritize memory and networking innovations over more compute.

Paper by Xiaoyu Ma and David Patterson: https://arxiv.org/abs/2601.05047

Paper: https://arxiv.org/abs/2601.05047v1

Register: https://www.AiFeta.com

#AI #LLM #Hardware #Datacenter #Semiconductors #Memory #Interconnect #EdgeAI

Read more