Categories

When AI reasoning meets hardware bottleneck: four ways to break the game

Jan 22nd,2026 61 Views

David Patterson, a legend in the field of computer architecture, jointly published a heavy paper with Google colleagues, pointing to the hardware dilemma of large language model reasoning.

A thought-provoking data: at the 1976 computer architecture conference, about 40% of the papers came from industry; By 2025, this proportion will fall below 4%. The gap between academics and industry is deepening. This paper attempts to rebuild the bridge.

Core judgment: LLM reasoning is in crisis. Training shows the breakthrough ability of AI, but the cost of reasoning determines the commercial feasibility. With the surge of users, the cost of serving the most advanced model makes enterprises overwhelmed.

Six trends make reasoning worse: MoE architecture expands the number of experts to hundreds, and the pressure on memory and communication doubles; The reasoning model generates a large number of "thinking" token before answering, which is under double pressure of delay and memory; Multi-modality extends from text to image, audio and video; Long context window improves quality but devours resources; RAG introduces external knowledge base to increase the burden; The diffusion model puts forward higher requirements for computing power.

The root of the problem lies in the fundamental mismatch between the current mainstream AI hardware design philosophy and LLM decoding reasoning.

The FLOPS growth of GPU and TPU far exceeds that of memory bandwidth. From 2012 to 2022, the 64-bit floating-point computing power of NVIDIA GPU increased by 80 times, but the bandwidth only increased by 17 times. This scissors gap continues to widen. Even more difficult, the unit capacity and bandwidth cost of HBM are rising, while the cost of traditional DDR memory is falling. DRAM density growth is also slowing down, and quadrupling from 8Gb to 32Gb will take more than 10 years, compared with 3 to 6 years before.

Cerebras and Groq tried to bypass these challenges with pure SRAM scheme, but the scale of LLM soon exceeded the carrying capacity of on-chip SRAM, and the two companies had to install external DRAM later.

The paper puts forward four research paths:

First, high-bandwidth flash memory. Stack flash memory chips like HBM to achieve a 10-fold increase in memory capacity. The writing durability of flash memory is limited, but the weight is frozen when reasoning, which is just suitable for this characteristic. This can greatly reduce the scale of the system and reduce power consumption, cost and carbon emissions.

Second, near memory processing. Putting the computing logic near the memory instead of inside not only gains the advantage of high bandwidth, but also avoids the difficulties of traditional PIM scheme in software fragmentation and power consumption. For data center LLM reasoning, PNM is more feasible than PIM.

Third, 3D memory logical stacking. Wide and dense memory interface is realized through vertical silicon vias, and high bandwidth is obtained at low power consumption. Heat dissipation is the main challenge, but the arithmetic intensity of LLM decoding reasoning itself is low, which can be dealt with by reducing clock frequency and voltage.

Fourth, low-latency interconnection. LLM reasoning is more sensitive to network delay than bandwidth, because the messages in the decoding stage are often small but frequent. High connectivity topology, in-network processing and chip-level optimization are all directions to be explored.

These four paths are not mutually exclusive, but can be combined synergistically. They all point to a core insight: we need to rethink the design principles of AI reasoning hardware, from pursuing peak computing power to optimizing memory and latency.

At the end of the paper, Patterson called on the academic community to respond to this opportunity and accelerate the actual landing of AI research. When the world urgently needs affordable AI reasoning, the value of hardware innovation has never been so clear.

Maxipcb is a PCB fabrication professional manufacturer; you can request a PCB fabrication quotation here. Please contact Maxipcb or send Gerber files and PCB fabrication drawing notes to Maxipcb. We'll get back to you soon.