{"id":172875,"date":"2025-09-27T13:36:31","date_gmt":"2025-09-27T13:36:31","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/172875\/"},"modified":"2025-09-27T13:36:31","modified_gmt":"2025-09-27T13:36:31","slug":"how-the-von-neumann-bottleneck-is-impeding-ai-computing","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/172875\/","title":{"rendered":"How the von Neumann bottleneck is impeding AI computing"},"content":{"rendered":"<p class=\"eBWTD EihHw\">AI computing has a reputation for consuming epic quantities of energy. This is partly because of the sheer volume of data being handled. Training often requires billions or trillions of pieces of information to create a model with billions of parameters. But that\u2019s not the whole reason \u2014 it also comes down to how most computer chips are built.<\/p>\n<p class=\"eBWTD\">Modern computer processors are quite efficient at performing the discrete computations they\u2019re usually tasked with. Though their efficiency nosedives when they must wait for data to move back and forth between memory and compute, they\u2019re designed to quickly switch over to work on some unrelated task. But for AI computing, almost all the tasks are interrelated, so there often isn\u2019t much other work that can be done when the processor gets stuck waiting, said IBM Research scientist Geoffrey Burr.<\/p>\n<p class=\"eBWTD\">In that scenario, processors hit what is called the von Neumann bottleneck, the lag that happens when data moves slower than computation. It\u2019s the result of von Neumann architecture, found in almost every processor over the last six decades, wherein a processor\u2019s memory and computing units are separate, connected by a bus. This setup has advantages, including flexibility, adaptability to varying workloads, and the ability to easily scale systems and upgrade components. That makes this architecture great for conventional computing, and it won\u2019t be going away any time soon.<\/p>\n<p class=\"eBWTD\">But for AI computing, whose operations are simple, numerous, and highly predictable, a conventional processor ends up working below its full capacity while it waits for model weights to be shuttled back and forth from memory. Scientists and engineers at IBM Research are working on new processors, like <a class=\"cds--link aY4Tj cds--link--inline\" href=\"https:\/\/research.ibm.com\/blog\/aiu-chip-family-ibm-research\" rel=\"nofollow noopener\" target=\"_blank\">the AIU family<\/a>, which use various strategies to break down the von Neumann bottleneck and supercharge AI computing.<\/p>\n<p>Why does the von Neumann bottleneck exist?<\/p>\n<p class=\"eBWTD\">The von Neumann bottleneck is named for mathematician and physicist John von Neumann, who first circulated <a class=\"cds--link aY4Tj cds--link--inline\" href=\"https:\/\/archive.org\/details\/firstdraftofrepo00vonn\/page\/n1\/mode\/2up\" rel=\"nofollow noopener\" target=\"_blank\">a draft of his idea<\/a> for a stored-program computer in 1945. In that paper, he described a computer with a processing unit, a control unit, memory that stored data and instructions, external storage, and input\/output mechanisms. His description didn\u2019t name any specific hardware \u2014 likely to avoid security clearance issues with the US Army, for whom he was consulting. Almost no scientific discovery is made by one individual, though, and von Neumann architecture is no exception. Von Neumann\u2019s work was based on the work of J. Presper Eckert and John Mauchly, who invented the Electronic Numerical Integrator and Computer (ENIAC), the world\u2019s first digital computer. In the time since that paper was written, von Neumann architecture has become the norm.<\/p>\n<p class=\"eBWTD\">\u201cThe von Neumann architecture is quite flexible, that\u2019s the main benefit,\u201d said IBM Research scientist Manuel Le Gallo-Bourdeau. \u201cThat\u2019s why it was first adopted, and that\u2019s why it\u2019s still the prominent architecture today.\u201d<\/p>\n<p class=\"eBWTD\">Discrete memory and computing units mean you can design them separately and configure them more or less any way you want. Historically, this has made it easier to design computing systems because the best components can be selected and paired, based on the application.<\/p>\n<p class=\"eBWTD\">Even the cache memory, which is integrated into a single chip with the processor, can still be individually upgraded. \u201cI\u2019m sure there are implications for the processor when you make a new cache memory design, but it\u2019s not as difficult as if they were coupled together,\u201d Le Gallo-Bourdeau said. \u201cThey\u2019re still separate. It allows some freedom in designing the cache separately from the processor.\u201d<\/p>\n<p>How the von Neumann bottleneck reduces efficiency<\/p>\n<p class=\"eBWTD\">For AI computing, the von Neumann bottleneck creates a twofold efficiency problem: the number of model parameters (or weights) to move, and how far they need to move. More model weights mean larger storage, which usually means more distant storage, said IBM Research scientist Hsinyu (Sidney) Tsai. \u201cBecause the quantity of model weights is very large, you can\u2019t afford to hold them for very long, so you need to keep discarding and reloading,\u201d she said.<\/p>\n<p class=\"eBWTD\">The main energy expenditure during AI runtime is spent on data transfers \u2014 bringing model weights back and forth from memory to compute. By comparison, the energy spent doing computations is low. In deep learning models, for example, the operations are almost all relatively simple matrix vector multiplication problems. Compute energy is still around 10% of modern AI workloads, so it isn\u2019t negligible, said Tsai. \u201cIt is just found to be no longer dominating energy consumption and latency, unlike in conventional workloads,\u201d she added.<\/p>\n<p class=\"eBWTD\">About a decade ago, the von Neumann bottleneck wasn\u2019t a significant issue because processors and memory weren\u2019t so efficient, at least compared to the energy that was spent to transfer data, said Le Gallo-Bourdeau. But data transfer efficiency hasn\u2019t improved as much as processing and memory have over the years, so now processors can complete their computations much more quickly, leaving them sitting idle while data moves across the von Neumann bottleneck.<\/p>\n<p class=\"eBWTD\">The farther away the memory is from the processor, the more energy it costs to move it. On a basic physical level, an electrical copper wire is charged to propagate a 1, and it\u2019s discharged to propagate a 0. The energy spent charging and discharging the wires is proportional to their length, so the longer the wire is, the more energy you spend. This also means greater latency, as it takes more time for the charge to dissipate or propagate the longer the wire is.<\/p>\n<p class=\"eBWTD\">Admittedly, the time and energy cost of each data transfer is low, but every time you want to propagate data through a large language model, you need to load up to billions of weights from the memory. This could mean using the DRAM from one or more other GPUs, because one GPU doesn\u2019t have enough memory to store them all. After they\u2019re downloaded to the processor, it performs its computations and sends the result to another memory location for further processing.<\/p>\n<p class=\"eBWTD\">Aside from eliminating the von Neumann bottleneck, one solution includes closing that distance. \u201cThe entire industry is working to try to improve data localization,\u201d Tsai said. IBM Research scientists recently announced such an approach: a <a class=\"cds--link aY4Tj cds--link--inline\" href=\"https:\/\/research.ibm.com\/blog\/co-packaged-optics-to-supercharge-generative-ai-computing\" rel=\"nofollow noopener\" target=\"_blank\">polymer optical waveguide for co-packaged optics<\/a>. This module brings the speed and bandwidth density of fiber optics to the edge of chips, supercharging their connectivity and hugely reducing model training time and energy costs.<\/p>\n<p class=\"eBWTD\">With currently available hardware, though, the result of all these data transfers is that training an LLM can easily take months, consuming more energy than a typical US home does in that time. And AI doesn\u2019t stop needing energy after model training. Inferencing has similar computational requirements, meaning that the von Neumann bottleneck slows it down in a similar fashion.<\/p>\n<p><img alt=\"An infographic comparing von Neumann architecture to in-memory computing.png\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" decoding=\"async\" data-nimg=\"fill\" class=\"qJzBH\" style=\"position:absolute;top:0;left:0;bottom:0;right:0;box-sizing:border-box;padding:0;border:none;margin:auto;display:block;width:0;height:0;min-width:100%;max-width:100%;min-height:100%;max-height:100%\"\/><\/p>\n<p>a. In a conventional computing system, when an operation f is performed on data D, D has to be moved into a processing unit, leading to significant costs in latency and energy. b. In the case of in-memory computing, f(D) is performed within a computational memory unit by exploiting the physical attributes of the memory devices, thus obviating the need to move D to the processing unit. The computational tasks are performed within the confines of the memory array and its peripheral circuitry, albeit without deciphering the content of the individual memory elements. Both charge-based memory technologies, such as SRAM, DRAM, and flash memory, and resistance-based memory technologies, such as RRAM, PCM, and STT-MRAM, can serve as elements of such a computational memory unit. Source: Nature Nanotechnology<\/p>\n<p>Getting around the bottleneck<\/p>\n<p class=\"eBWTD\">For the most parts, model weights are stationary, and AI computing is memory-centric, rather than compute heavy, said Le Gallo-Bourdeau. \u201cYou have a fixed set of synaptic weights, and you just need to propagate data through them.\u201d<\/p>\n<p class=\"eBWTD\">This quality has enabled him and his colleagues to pursue analog in-memory computing, which integrates memory with processing, using the laws of physics to store weights. One of these approaches is phase-change memory (PCM), which stores model weights in the resistivity of a <a class=\"cds--link aY4Tj cds--link--inline\" href=\"https:\/\/en.wikipedia.org\/wiki\/Chalcogenide_glass\" rel=\"nofollow noopener\" target=\"_blank\">chalcogenide glass<\/a>, which is changed by applying an electrical current.<\/p>\n<p class=\"eBWTD\">\u201cThis way we can reduce the energy that is spent in data transfers and mitigate the von Neumann bottleneck,\u201d said Le Gallo-Bourdeau. In-memory computing isn\u2019t the only way to work around the von Neumann bottleneck, though.<\/p>\n<p class=\"eBWTD\">The <a class=\"cds--link aY4Tj cds--link--inline\" href=\"https:\/\/research.ibm.com\/blog\/northpole-ibm-ai-chip\" rel=\"nofollow noopener\" target=\"_blank\">AIU NorthPole<\/a> is a processor that stores memory in digital SRAM, and while its memory isn\u2019t intertwined with compute in the same way as analog chips, its numerous cores each has access to local memory \u2014 making it an extreme example of near-memory computing. Experiments have already demonstrated the power and promise of this architecture. In recent inference tests run on a 3-billion-parameter LLM developed from IBM\u2019s Granite-8B-Code-Base model, NorthPole was <a class=\"cds--link aY4Tj cds--link--inline\" href=\"https:\/\/research.ibm.com\/blog\/northpole-llm-inference-results\" rel=\"nofollow noopener\" target=\"_blank\">47 times faster<\/a> than the next most energy-efficient GPU and was 73 times more energy efficient than the next lowest latency GPU.<\/p>\n<p class=\"eBWTD\">It\u2019s also important to note that models trained on von Neumann hardware can be run on non-von Neumann devices. In fact, for <a class=\"cds--link aY4Tj cds--link--inline\" href=\"https:\/\/research.ibm.com\/blog\/analog-in-memory-training-algorithms\" rel=\"nofollow noopener\" target=\"_blank\">analog in-memory computing<\/a>, it\u2019s essential. PCM devices aren\u2019t durable enough to have their weights changed over and over, so they\u2019re used to deploy models that have been trained on conventional GPUs. Durability is a comparative advantage of SRAM memory in near-memory or in-memory computing, as it can be rewritten infinitely.<\/p>\n<p>Why von Neumann computing isn\u2019t going away<\/p>\n<p class=\"eBWTD\">While von Neumann architecture creates a bottleneck for AI computing, for other applications, it\u2019s perfectly suited. Sure, it causes issues in model training and inference, but von Neumann architecture is perfect for processing computer graphics or other compute-heavy processes. And when 32- or 64-bit floating point precision is called for, the <a class=\"cds--link aY4Tj cds--link--inline\" href=\"https:\/\/research.ibm.com\/blog\/low-precision-computing\" rel=\"nofollow noopener\" target=\"_blank\">low precision<\/a> of in-memory computing isn\u2019t up to the task.<\/p>\n<p class=\"eBWTD\">\u201cFor general purpose computing, there&#8217;s really nothing more powerful than the von Neumann architecture,\u201d said Burr. Under these circumstances, bytes are either operations or operands that are moving on a bus from a memory to a processor. \u201cJust like an all-purpose deli where somebody might order some salami or pepperoni or this or that, but you&#8217;re able to switch between them because you have the right ingredients on hand, and you can easily make six sandwiches in a row.\u201d Special-purpose computing, on the other hand, may involve 5,000 tuna sandwiches for one order \u2014 like AI computing as it shuttles static model weights.<\/p>\n<p class=\"eBWTD\">Even when building their in-memory AIU chips, IBM Researchers include some conventional hardware for the necessary high-precision operations.<\/p>\n<p class=\"eBWTD DMxM7\">Even as scientists and engineers work on new ways to eliminate the von Neumann bottleneck, experts agree that the future will likely include both hardware architectures, said Le Gallo-Bourdeau. \u201cWhat makes sense is some mix of von Neumann and non-von Neumann processors to each handle the operations they are best at.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"AI computing has a reputation for consuming epic quantities of energy. This is partly because of the sheer&hellip;\n","protected":false},"author":2,"featured_media":172876,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[64,63,257,105],"class_list":{"0":"post-172875","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-computing","8":"tag-au","9":"tag-australia","10":"tag-computing","11":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/172875","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=172875"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/172875\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/172876"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=172875"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=172875"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=172875"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}