{"id":255881,"date":"2026-01-21T09:16:09","date_gmt":"2026-01-21T09:16:09","guid":{"rendered":"https:\/\/www.newsbeep.com\/ie\/255881\/"},"modified":"2026-01-21T09:16:09","modified_gmt":"2026-01-21T09:16:09","slug":"addressing-critical-tradeoffs-in-npu-design","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ie\/255881\/","title":{"rendered":"Addressing Critical Tradeoffs In NPU Design"},"content":{"rendered":"<p>Experts At The Table: AI\/ML are driving a steep ramp in neural processing unit (NPU) design activity for everything from data centers to edge devices such as PCs and smartphones. Semiconductor Engineering sat down with Jason Lawley, director of product marketing, AI IP at <a href=\"https:\/\/semiengineering.com\/entities\/cadence-design-systems\/\" rel=\"nofollow noopener\" target=\"_blank\">Cadence<\/a>; Sharad Chole, chief scientist and co-founder at <a href=\"https:\/\/semiengineering.com\/entities\/expedera\/\" rel=\"nofollow noopener\" target=\"_blank\">Expedera<\/a>; Steve Roddy, chief marketing officer at <a href=\"https:\/\/semiengineering.com\/entities\/quadric\/\" rel=\"nofollow noopener\" target=\"_blank\">Quadric<\/a>; Steven Woo, fellow and distinguished inventor at <a href=\"https:\/\/semiengineering.com\/entities\/rambus-inc\/\" rel=\"nofollow noopener\" target=\"_blank\">Rambus<\/a>; Russell Klein, program director for the High-Level Synthesis Division at <a href=\"https:\/\/semiengineering.com\/entities\/mentor-a-siemens-business\/\" rel=\"nofollow noopener\" target=\"_blank\">Siemens EDA<\/a>; and Gordon Cooper, principal product manager at <a href=\"https:\/\/semiengineering.com\/entities\/synopsys-inc\/\" rel=\"nofollow noopener\" target=\"_blank\">Synopsys<\/a>. What follows are excerpts of that discussion. To read part one of this discussion, click <a href=\"https:\/\/semiengineering.com\/how-and-why-to-optimize-npus\/\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a>.<\/p>\n<p><img data-recalc-dims=\"1\" fetchpriority=\"high\" decoding=\"async\" class=\"alignnone size-full wp-image-24269727\" src=\"https:\/\/www.newsbeep.com\/ie\/wp-content\/uploads\/2026\/01\/Screenshot-2026-01-14-at-11.10.22-AM.png\" alt=\"\" width=\"2122\" height=\"330\"  \/><br \/>L-R: Cadence\u2019s Lawley, Expedera\u2019s Chole, Quadric\u2019s Roddy, Rambus\u2019 Woo, Siemens EDA\u2019s Klein, and Synopsys\u2019 Cooper.<\/p>\n<p>SE: What are some of the tradeoffs around NPUs?<\/p>\n<p>Cooper: I\u2019ll offer one from the CNN days. You want to trade off area efficiency in as small an area as possible, while also ensuring future-proofing and flexibility. For example, there are activation functions, ReLU, PReLU, Swish, whatever you could design, one at a time, and hard-wire that in. And if there\u2019s only two, three, four, or five, you can do that. But then you don\u2019t know what\u2019s coming out. You don\u2019t know what the next paper will have. So now you have to do the tradeoff and say, \u2018Maybe I need to create a lookup table, which is more expensive in area than one or two, but in the long run it pays off because now I can do any activation function.\u2019 It\u2019s these tradeoffs you need to figure out. When do I have to make a flexible engine, and when can I hard-code it, because I\u2019m guessing that it\u2019s not more flexible that way? And unfortunately, it\u2019s rare that you can hard-code something and really optimize it like an ASIC, because you don\u2019t know what the next algorithm will require. You need some level of flexibility.<\/p>\n<p>Roddy: Gordon really nails it on the hardwired versus programmable tradeoff. With the design time for a complex SoC stretching 24 to 36 months, but with state-of-the-art (SOTA) AI models changing monthly, designers are faced with a daunting task of locking down a silicon architecture today that will likely need to run an unknown future workload in 3 years. Architects need to huddle closely with their product planning teams to decide, \u2018Can we really be certain the workload won\u2019t change much,\u2019 before committing to inflexible, non-programmable solutions. For some select subset of designs \u2013 closed boxes that won\u2019t need new algorithms \u2013 a fixed-function AI accelerator works. But a large majority of designs that we\u2019ve encountered clearly want a general-purpose programmable NPU.<\/p>\n<p>Chole: In automotive deployments, when we are looking at actual workloads, multiple requests come in, and the cadence of the request or the latency requirement is very different. Some requests have extremely deterministic guarantee requirements. Basically, if you get a request, you must respond within 10 milliseconds. Otherwise, the system cannot really function. You must give those ASIL guarantees. In such scenarios, there is a system-level tradeoff that you need to do concerning how you are designing your runtime and how these multiple applications need to interact with each other. How can you context-switch an application completely, where you have to tell NPUs to stop running \u2018this\u2019 and start running \u2018that?\u2019 You have to save the state, load the state again. And there\u2019s a question of priorities. There is a question of getting the determinism out of the system, because we guarantee that NPU is deterministic by itself. But as a system, to be able to give that guarantee, the bandwidth needs to be deterministic \u2014 or the allocated bandwidth needs to be deterministic to the NPUs, and the runtime also needs to be able to support that jobs can be swapped on the fly. I can make the tradeoff, and I can keep the entire SoC busy, not just the NPU. I can maybe keep the GPU part busy. The pipeline is always going on, so different application domains create different and interesting tradeoff questions. Automotive is going toward server class. Even such features as isolation guarantees that divide the NPU into multiple parts with virtualization are coming into automotive deployments.<\/p>\n<p>Klein: One of the hard tradeoffs is how programmable to make this versus how much performance and efficiency we want to get. In the high-level synthesis world, I\u2019m often working with customers who are deploying a very specific implementation in ASIC or FPGA, and we can often very significantly outperform a GPU or an NPU. But we\u2019re putting so much into a hard-coded implementation that it\u2019s no longer future-proof. There are applications where it\u2019s fine to nail down parts of the architecture, where we know for our typical embedded applications, not a data center, that we\u2019re always going to be doing these things. We know we\u2019re going to be pre-filtering some video data coming in, and it\u2019s always going to look like this. Well, we can put that into hardware, into ASIC or FPGA logic, and dramatically improve the performance and power characteristics, but there\u2019s no going back if we want to. If we want to change that later on, we\u2019ve got to build a new chip, or with an FPGA, go reprogram.<\/p>\n<p>Lawley: Programmability and flexibility can mean a lot of things. In some cases, that may mean, how do you future-proof things like activation functions? In other cases, that may mean how much scalar and vector compute is added for the matrix compute allocation of the NPU. If you have too much scalar and vector compute, you have wasted area. On the other hand, if you don\u2019t have enough scalar and vector compute, that can lead to excessive fallback or bottlenecks in the architecture, degrading performance. And then there is the flexibility of the number of MACs, the size of dedicated memory, and the interface bandwidth, just to name a few. We try to give the customer the flexibility they need to make the decisions to right-size the IP, and then we make the decisions that are more nuanced and difficult to make without understanding the architecture of the NPU.<\/p>\n<p>SE: For that reprogramming, do you mean embedded FPGA, or a type of interposer, MCM, or something else?<\/p>\n<p>Klein: All of the above. Embedded FPGA is particularly good because it eliminates the chip-to-chip problems, and you can get much more bandwidth and faster transfers using less power. So certainly, if we can put FPGA fabric next to our other compute units, that\u2019s an ideal situation. But if you\u2019ve got a lot of data reuse, using a discrete FPGA can be a viable solution.<\/p>\n<p>Cooper: I agree. Certainly, it\u2019s very use-case dependent. If you\u2019re deeply embedded, you have a handful of graphs or models that you want to implement. That\u2019s a great way to go. On the other hand, maybe you\u2019re going to have 100 different people programming this, and you don\u2019t know what models you are going to bring to it, and that has to be much more flexible. It really depends on the use case.<\/p>\n<p>Woo: Here\u2019s an example of a tradeoff that we sometimes see is at the high end. With some of these processors, where they\u2019re spending 50% of their power budget on the PHYs and things that just move data back and forth, either to memory or chip to chip, you really have to be careful and decide whether to take those watts and spend them on moving data in and out, or do you want to possibly remove some of the external bandwidth capability and spend the watts on more compute. Then you\u2019ve got to figure out how to keep this in balance, like Russ is mentioning, and that\u2019s what these architectures are about. It means you have to think about, \u2018Do I quantize? Do I go to different algorithms?\u2019 Some of the tradeoffs we see are because of what we do. We see them more on the physical side, and it may mean picking a different kind of memory, or changing the way you architect the rest of the SoC, where those interconnects are outside of the chip. You\u2019ve got so many watts, and you know there\u2019s some fraction you want to spend on the compute, because there\u2019s a performance level you need to get to. It\u2019s often the case that you just can\u2019t give them enough bandwidth. They\u2019re either power-limited or there are not enough pins, for example, on the processor. So you give them what fits within their budget. And even that, in many cases, is not enough or not what they really want. When they\u2019re faced with this decision of how many cores they can really put down, and how accurate the computation needs to be in each of those cores, you can go to things like more quantized representations, or fewer bits per parameter, and that extends the effective bandwidth for what you\u2019re trying to do. But then that can be an accuracy tradeoff. So there\u2019s this challenging thing from the physical side of it, where you are limited by what you can do on the chip, and you have to think very carefully about how many watts you really want to spend moving data in and out. Then, what\u2019s the best way to represent the data to use those watts? What\u2019s the best core implementation that uses the remaining watts, so everything stays in balance?<\/p>\n<p>Cooper: The amount of internal SRAM you have on the chip is flexible. The more you have, the less external data movement you need, and therefore less power consumption on those external pins. That helps you and helps you with performance, but it kills you in area. So power, performance, and area are all tradeoffs and degrees of freedom. You don\u2019t have unlimited area, but there\u2019s flexibility there, and you have to decide which way your specific design needs to accommodate that tradeoff.<\/p>\n<p>SE: These are all extremely nuanced tradeoffs to make. When you look out at your customers and the activity that they\u2019re doing, what is the level of expertise needed to make these tradeoffs? Are we training the lower-level engineers to come up to speed on this, because these are issues that are only going to get more challenging?<\/p>\n<p>Cooper: There\u2019s certainly a learning curve. The first time out, you\u2019re probably going to make some mistakes that you didn\u2019t realize. As you iterate, there\u2019s some level of learning from your mistakes. The problem is that those mistakes, as you go from 5nm to 3nm to the smaller design nodes, become more expensive, so you can\u2019t afford to make them. Therefore, you need other tools to offset and pre-test, like digital twins, etc. There is that level of learning.<\/p>\n<p>Chole: As we are working with customers, it\u2019s very much a collaboration of 100 different people. It\u2019s a collaboration between the customer\u2019s data science team, the customer\u2019s deployment team, and the software team, which actually has to ingest the models, and this collaboration doesn\u2019t end at tape out. This is a continuous, ongoing collaboration. Then, on the NPU design side, we have architecture, we have hardware instructions, an ISA of that architecture, the hardware team building it, and the compiler team that is building the compiler for it. It\u2019s a very complex problem. Given what we are doing, it\u2019s amazing that we are able to do this, and so far away from the training. When somebody is training the model, they are not really thinking of edge deployments. There are user cases later on that optimize for edge deployment, like shrinking the model down, distillation, and the like. But the basic training that is pushing the boundaries right now is pretty much focused on the data center. The challenge is how to take that, how to shrink it down, how to make it fit for the accuracy as well as power, area, and the cost budget that we have, then going to the final deployment and actually running it. It might be running live. It might be doing 60 frames per second. That\u2019s an amazing feat, and it is a huge collaboration. Is anybody coming up to that speed, that level? Even I\u2019m not there, if you say, \u2018Hey, guide these 100 people together,\u2019 I don\u2019t have that expertise. I need somebody to do the data science. I need somebody to do the physical design. I can\u2019t do that by myself. It\u2019s a collaboration.<\/p>\n<p>Klein: What we\u2019re doing is combining some different domains of knowledge and engineering. They really need to work together to achieve an optimal outcome. To give an example, we held a hackathon last summer where people were using high-level synthesis to implement an AI accelerator, not really an NPU, but a bespoke accelerator. What was interesting was how the winner was able to create the best implementation. The way we judged the implementation was how much energy was used per inference. We looked at all the entries and figured out which one used the least energy. The guy who created the smallest, most efficient implementation beat my reference implementation. I\u2019ve been using this tool, and he ended up using about half the energy that my example did. The way he did it was to say, \u2018You\u2019re not training this thing well. You\u2019ve got to take this network, and change the training,\u2019 and that increased the accuracy. With that increased accuracy, he was able to use fewer multipliers and fewer channels within the layers, but still get the minimum accuracy. He was able to hit that with a smaller design by changing the training and changing the architecture of the network a little bit. It wasn\u2019t just hardware design. It was neural network design, as well, and understanding how to most efficiently train this, create the highest level of accuracy, and then implement that in hardware. It\u2019s a broad range of skills, and this one guy had both the hardware design and the neural network design, and was able to bring those together. And he won the contest. That was kind of fun.<\/p>\n<p>Woo: What I\u2019m noticing is that there is the hardware side of things and the software side of things. The industry has done a great job of making it cheaper and more available to do compute, and you see all these open-source models, lots of classes in universities, and things where you get to use some of these models. Because they\u2019re open source, you can tweak them, and you can learn a lot of the basics about what\u2019s going on. The bigger challenge is on the hardware side and the physical implementation, because it\u2019s become more expensive to produce hardware. If you had asked this question about training engineers in one of the most important areas in semiconductors \u2014 advanced packaging \u2014 the packaging is really helping us get the physical capabilities that the silicon needs, which is to move data in and out quickly. Those things are tougher, more expensive, and the expertise is harder to acquire. What I see going on is that there is a growing divide in what you\u2019re able to train people to do in college versus what goes on in the more advanced designs these days. This is true on the software side, as well. The hardware side is complicated by the fact that it\u2019s more expensive to produce the hardware. It does seem like there needs to be, if we\u2019re to succeed as an industry, a way to accelerate the training and broaden the training to a larger set of people.<\/p>\n<p>Chole: I come from a software background, with completely zero formal education in hardware. Everything I learned, I learned [at previous companies], a previous startup, and this startup. It\u2019s phenomenally mind-boggling coming from the software world. I cannot ever learn this unless I have access to tools, unless I have access to advanced methodologies. And companies have been building in-house technologies for so long that I don\u2019t even know how an Intel, which might be running at 3.6 gigahertz, does that. I can\u2019t comprehend the complexity they are dealing with. The cost of getting there is so much. At the startup, we can\u2019t really even think about that. The training here is pretty much hands-on. I had to learn from experts in the field to be able to get here.<\/p>\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"Experts At The Table: AI\/ML are driving a steep ramp in neural processing unit (NPU) design activity for&hellip;\n","protected":false},"author":2,"featured_media":255882,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[81770,127487,127488,127489,61,60,127490,127491,127492,127493,127494,81766,13156,80],"class_list":{"0":"post-255881","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-technology","8":"tag-cadence","9":"tag-design-optimization","10":"tag-design-tradeoffs","11":"tag-expedera","12":"tag-ie","13":"tag-ireland","14":"tag-mentor","15":"tag-neural-processing-unit","16":"tag-npu","17":"tag-quadric","18":"tag-rambus","19":"tag-siemens-eda","20":"tag-synopsys","21":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/255881","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/comments?post=255881"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/255881\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media\/255882"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media?parent=255881"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/categories?post=255881"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/tags?post=255881"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}