Tom’s Hardware Premium Roadmaps

a snippet from the HBM roadmap article

(Image credit: Future)

DeepSeek has released a new technical paper, which details a new method for how new AI models might rely on a queryable database of information committed to system memory. Named “Engram”, the conditional memory-based technique achieves demonstrably higher performance in long-context queries by committing sequences of data to static memory. This eases the reliance on reasoning for AI models, allowing the GPUs to only handle more complex tasks, increasing performance, and reducing the reliance on high-bandwidth memory (HBM).

GitHub page, Engram hopes to address how the company might be able to curb the reliance on more complex memory types and instead commit a knowledge library to a more common system memory standard, such as CXL.

You may like

AI accelerators is something that even Chinese silicon, such as Huawei’s Ascend series, cannot escape. Each stack of HBM uses more memory dies, and with demand skyrocketing, easing any AI model’s reliance on the GPU’s direct high-bandwidth memory would be significant, especially considering the ongoing memory supply squeeze.

Engram would enable static memory to be held separately from an LLM’s compute power, allowing the GPU’s rapid HBM to dedicate itself to reasoning, therefore enabling more performant Engram-based AI models, compared to a standard Mixture of Experts (MoE) model.

As detailed in the paper, an Engram-based model scaled to nearly 27 billion parameters can beat out a standard MoE model in long-context training and eliminates computational waste generated by having to reason out facts, by allowing them to be externally stored.

A standard MoE model might have to reconstruct these pieces of data every time it’s referenced in a query, which is called conditional computation. The model will then call on its expert parameters to assemble and reason the data every time, even when it only focuses the query on certain parts or experts, named sparse computation.

Deepseek Paper shjowing Engram in training and inference

How Engram embeds itself into training and inference workloads (Image credit: Deepseek)

The Engram paper adds that placing conditional memory would allow the model to merely ask: “Do I already have this data?”, rather than having to access the parts of the model that deal with reasoning.

“This process essentially amounts to an expensive runtime reconstruction of a static lookup table, wasting valuable sequential depth on trivial operations that could otherwise be allocated to higher-level reasoning,” the paper reads.

Nvidia’s KVCache, announced at CES 2026, offloads context data to NVMe memory with BlueField-4, this acts as more of a short-term solution, allowing the model to remember things that you have recently said or added within context, and is, for all intents and purposes, disposable after you move on to the next query or conversation.

You may like

KVCache, while persistent within the history of your conversations or queries, does not draw on an existing base of pre-calculated data, and is not persistent in the same way that Engram-based LLMs could be, if the paper is to be believed. To put it simply, KVCache can be likened to storing your handwritten notes, whereas Engram is a record of the whole encyclopedia.

Blufield-4

(Image credit: Nvidia)