Making LLMs Faster: Fix Memory and Interconnect, Not Just Compute

Kari Jaaskelainen

10 Jan 2026 — 1 min read

Why running LLMs is hard (and how hardware can help)

Large language models don't just need fast math—they need to fetch and share enormous amounts of data, one token at a time. In inference, the autoregressive "decode" phase dominates, making memory and interconnect, not compute, the true bottlenecks.

Trend headwinds: Bigger models, longer context windows, and limited batching all amplify memory pressure and communication costs.
What could fix it:
High-Bandwidth Flash: ~10× more memory capacity with HBM-like bandwidth to keep models close to the chips.
Processing-Near-Memory and 3D memory–logic stacking: bring simple operations to the data and boost on-package bandwidth.
Low-latency interconnects: faster links across accelerators and servers to reduce waiting during decode.

The paper focuses on datacenters, with lessons that can extend to mobile devices as on‑device AI grows.

Bottom line: To make LLMs faster, cheaper, and greener, prioritize memory and networking innovations over more compute.

Paper by Xiaoyu Ma and David Patterson: https://arxiv.org/abs/2601.05047

Paper: https://arxiv.org/abs/2601.05047v1

Register: https://www.AiFeta.com

#AI #LLM #Hardware #Datacenter #Semiconductors #Memory #Interconnect #EdgeAI

Making LLMs Faster: Fix Memory and Interconnect, Not Just Compute

Kari Jaaskelainen

Why running LLMs is hard (and how hardware can help)

Read more

Tekoäly myötäilee toteamuksia enemmän kuin kysymyksiä

Tekoälyn pitäisi uskaltaa sanoa “en tiedä” — ja sillä on väliä, miten tämä mitataan

Pienet kielimallit nopeutuvat, kun niille opetetaan valmiita fraaseja

Kone näkee saman kohtauksen eri tavoin – uusi tapa opettaa sen kokoamaan aistinsa yhteen