Balancing Training, Quantization, And Hardware Integration In NPUs

Experts At The Table: AI/ML is driving a steep ramp in neural processing unit (NPU) design activity for everything from data centers to edge devices such as PCs and smartphones. Semiconductor Engineering sat down to discuss this with Jason Lawley, director of product marketing, AI IP at Cadence; Sharad Chole, chief scientist and co-founder at Expedera; Steve Roddy, chief marketing officer at Quadric; Steven Woo, fellow and distinguished inventor at Rambus; Russell Klein, program director for the High-Level Synthesis Division at Siemens EDA; and Gordon Cooper, principal product manager at Synopsys. What follows are excerpts of that discussion. To view part one of this discussion, click here. Part two is here.

L-R: Cadence’s Lawley, Expedera’s Chole, Quadric’s Roddy, Rambus’ Woo, Siemens EDA’s Klein, and Synopsys’ Cooper.

SE: How do you ensure that algorithm training is accurate enough, and what are the key considerations for engineers?

Klein: How accurate a particular network is going to be will depend on how well it was trained. That means you have to ask how long to spend on that training. How many epochs are we going to run? How many data elements, or how many samples, are we going to run per epoch? Also, when do we stop training? How do we not overtrain? All of those are ultimately going to impact the accuracy that we get for a particular set of weights for that particular network. Now, when we go to implement in hardware, as we quantize things, we want to use smaller multipliers. We want to use fewer channels. And we want to use fewer layers. We want to reduce the overall computation that we need to do. If we’ve done a better job in training, we’ve got more latitude to shrink down the number of computations. If we do a poor job in training, we’re going to need to augment that with hardware. The training we do once, while the inferencing we do over and over again. In the case of our hackathon, we had this algorithm, and we had trained it rather generically. The person who won came back and said the batch size was too big, which meant we were overfitting. That equates to bad accuracy. He came in and changed the number of hyperparameters used in the training, and was able to get a slightly higher accuracy, which let him throw out some hardware. That can be done broadly and requires a lot of knowledge of neural network design, which is not hardware design. Again, it’s these two different domains that, when they work together, are going to give an optimal result. If the person doing the training doesn’t think at all about edge, you lose something there. I’ve talked to data scientists who, when I asked them if we could quantize something down by a few bits, scratched their heads and asked why I’d want to do that. I said we were going into hardware, and we wanted to save time and energy, as those are important at the edge. It’s understanding across that whole flow that’s going to make that work.

Cooper: In many cases, since we develop an NPU for any use case of transformer, CNN, or now GenAI, often we’re handed a model and told that any model has to work. They don’t go back and look at the training. There are use cases in deeply embedded applications in which it makes sense to optimize something, then design the hardware to that optimization. But if you start with hardware, we have had partners and customers ask us to help them understand our hardware, so they can influence the training. That’s rarer because usually it’s, ‘Hey, here’s the model. Make it work.’ But there are opportunities there, not done enough, to relook at the training. And because this is such a different skill set, the fact that Russ had that one guy who knew both is rare.

Chole: To add to that, most of our customer data science teams have been doing quantize-aware training. The frameworks are at a level where the quantization has become part of the training. It’s fine-tuning the model, but the model structure makes it feasible for low-weight quantization, like 8-bit quantization or 4-bit quantization. And the industry has matured enough in that sense. So, vision transformers, convolutional networks [are maturing], then comes LLM, and there is no acuity for LLM. There are 20 or 30 ways to quantize an LLM’s weights, and it’s not easy to measure accuracy, because you have to run a bunch of suites to get the output. This is the interesting part about evolving AI dynamics. You have a set of networks that are very well understood. We know how to quantize it, train it, optimize it, reduce the pressure, and we have set up networks that can even do something like LoRA (low-rank adaptation) or limited fine-tuning, which is quite expensive, and you don’t necessarily know whether it is really giving you a benefit or not. Most of the deployments in the data center don’t really fine-tune that way anymore, because just the prompt engineering provides much better accuracy improvement, rather than actual fine-tuning with a specific use case. We are at the stage of the flux where, at the NPUs, we do have to support different position requirements, different quantization semantics, and even compilers have to support the same framework that worked previously. For just supporting CNNs, they don’t scale well for something like 600-gigabyte LLMs. So from training to inference to compilation, this challenge has been there. Now we have to support the latest generation models, and these are more painful to work with.

SE: Speaking of accuracy, how do we define it in the context we’ve been talking about?

Cooper: It’s probably more important than frames per second and performance levels. In CNNs, we all took it for granted because CNNs were INT8, and so the assumption was that whatever it is, INT8 accuracy in quantization was really important, getting it from 32 bits to 8 bits. But now you get to transformers and,ddd all of a sudden, maybe you need some layers and floating point 16 or floating point 8. And now, as Sharad said, you’ve got large language models coming with pre-quantization going on, and all sorts of creative data types. This means accuracy is really important and harder now, because you could have different vendors offer solutions that have INT4 or FP16. How do you measure the accuracy? It’s a little bit of a challenge. You have to have your tools working well. You may have to have hardware or an FPGA-based solution to measure accuracy. It’s really important, and something that is a little harder to measure than performance.

SE: Don’t we need an agreed-upon definition of accuracy?

Cooper: Yes. There’s this idea of perplexity now in large language models.

Chole: It’s crazy to measure accuracy. You just have to let it be like that. There are similarity measures, obviously, and they’re not enough. You have to have a human check to actually do it. And we kind of gave up on it. We let the customer take care of it since there is no way we have the expertise to know that we can do open-source benchmarks. Other than that, it’s our customer’s responsibility to make sure the model gets accuracy. To add to Gordon’s point, it’s very important to have fast acceleration or a fast simulation environment, so having an FPGA platform becomes extremely critical to measure accuracy. We cannot do it in software. It’s a 100X difference in terms of performance, no matter what model you build, because you need bit-accurate representation to guarantee the accuracy of the application.

Woo: For those of us doing a lot of memory and interconnect technology, it’s whatever the customers need to buy to get the level of accuracy. But an interesting observation, if you look over the last 10 to 15 years, is there’s been about 1,000X improvement in the inference performance of these ASICs. And there’s this really nice chart out there that shows that most of that actually came from changing number formats and improving number representation, and that was all spurred by the fact that you just couldn’t get enough bandwidth to move the data in and out of the processors and do some of the computation on shorter data values. It saves you area and power, so you can get more performance there, as well. What we see on the interconnect side of it is that there’s a lot of effort to try to improve the number formats simply because it uses the available bandwidth better. If you have the size of your numbers, you get twice as many of them through that same bandwidth. But that’s fundamentally a tradeoff with accuracy. We hear different answers from different people who are adopting our technology, and it’s a bit in the eye of the beholder as to what is acceptable. They’re all trying to figure out the balance between what is the right way to use the resources that are there, versus what they perceive to be the accuracy they need in their application.

Klein: We seem to be talking about numeric representation as we talk about accuracy, and that is extremely important. The specific way that you represent those numbers, and the operators that you’re going to then use on those, can have a huge impact on the performance, the area, and so forth. So as Steve mentioned, if we can shrink down our representation, we have less data to move and less data to store, and that just makes everything better. But also, our operators can shrink down. We had a customer who did an implementation with power versus two representations. In other words, he’s not using traditional numbers. Each of the bits represents an exponent. But the interesting thing is, you no longer need multipliers. You use adders. What happens when you have to add these exponents together? And I said either it’s a rounding error or you double it. Now we need to shift. So you’ve eliminated your multipliers, and now your hardware gets massively simplified. And this, again, we’re using a tiny representation and really small operators, and that works great down on the edge.

SE: What tools are architects and design teams using to do these optimizations?

Klein: In the high-level synthesis space, our latest survey showed it’s used on almost 30% of designs, so it’s got some solid uptake in the industry. Within the high-level synthesis environment, we do have bit-accurate operators and bit-accurate data representation. So rather than going down to the Verilog level to look at the algorithms, whether we’re using power-of-two numeric representation or fixed-point of particular sizes at the C level, at a very abstract level we can test that algorithm with those specific bit accuracies and run it as a C program on a general-purpose computer. That runs much faster than doing the same things down in Verilog. Doing bit-accurate calculations at an abstract level is extremely important. The other one that’s extremely important is FPGA prototyping. Once we have that RTL implementation, we’re going to need to run thousands of inferences — tens of thousands to millions — to determine the accuracy, given this actual architecture, and what the performance is going to be given this size cache, given these size buses, and so forth. That has to be done on an FPGA prototype, where you can get the performance, and you can run all of those. Those are two key elements that need to be brought into the mix to do development here.

Chole: We typically separate our performance and accuracy. As an architect, the first worry I have is performance, as in how the performance looks for NPUs and at the system level. That’s the biggest worry for customers. We need a deterministic simulation environment. It doesn’t have to process the data at all. It’s a transactional level model, but it needs to be deterministic. And every transaction is simulated as fast as we can. Basically, you’re trying to get an understanding of the pressure on the bandwidth. What’s the pressure on the system bus? What’s the pressure on the DDR? How much on-chip buffering is being used, and what capacity? That helps us define, in the end, how the pipeline of model work is going to work. How the different pieces on the end SoCs or CPU, GPU, NPUs, DSP, ISP, all the pieces are going to collaborate and create a workload characteristic. At the architecture level, where we don’t have everything put together, this becomes very important because we cannot even simulate this. We can probably emulate this, but that would be a huge effort and huge cost. Our customers go through that, but that’s at the end of the cycle when they’re designing and committing to a certain design. Six months before that you have to know in advance, as an architect, that this is how the system will look, these are the limitations of the system, these are the bottlenecks, and this is what we need to work on.

Cooper: I look at tools at two levels — processor/NPU-level tools and system-level/SoC-level tools. One critical tool is rapid prototyping. The algorithms — particularly if you’re targeting 100 TOPS, or heaven forbid, 1,000 TOPS or beyond in an edge space of automotive, maybe — the cost to simulation in time, if you’re bit-accurate, could take you weeks to do an accurate simulation. You need some kind of rapid tool that’s not going to be 100% accurate. And by accurate, it means you know the exact frames per second your hardware will be. Then you can right-size how much memory you need, how much bandwidth, and how many MACs. You need that level to figure out how to size it properly. Then, of course, the next level of tools you need is the compiler, and then some kind of platform, whether it be an FPGA (although that’s a problem, because as the NPUs are getting bigger now) or an emulator. Synopsys has Zebu, Cadence has Palladium, [Siemens has Veloce]. Those are the processor-level. Then there are system-level/ SoC-level tools, virtual VDKs, for example, which are really important for automotive, so they can start a digital twin, start prototyping, and the software can start before the silicon is available. Those are just a couple of the key tools.

Woo: Given the expense of developing hardware, you need to be right by the time you start to press something into silicon. But also, it’s a sign of how complex everything has gotten just trying to understand where the bottlenecks are from the hardware standpoint — and trying to understand, as you do your model development, the right way to map your model onto that hardware. It’s so complicated that, exactly like Sharad said, no one person can really do this. And so, in a sense, the emulation platforms or models give the team of developers a kind of sandbox in which they can work together to try and figure out if the hardware guys do this, then I can do my algorithm like this. And it gives you that flexibility to try things, so that you have a lot of confidence by the time something gets into the silicon.

SE: What are your crystal ball predictions as to how this area will evolve in the short term?

Woo: There is an appetite for investment around the world, not only in the U.S., but also in nation-states that are doing a lot of investment. What we’ll see is a lot of build-out in terms of the hardware capabilities, and in some sense there’s going to be a challenge just to keep up with the level of investment the systems companies want to make. But that’s always a good position to be in as a semiconductor supplier and company. It’s great because there’s going to be a lot of experimentation and activity. I’m not exactly sure how it will all play out, other than to say a lot of these predictions of more inference will happen. More of it will move to the edge. Things will become more complicated. Agentic AI will start to become more mainstream. Those things are all going to be enabled because the level of investment people want to commit is going to enable the development of the hardware to make compute cheaper and more available. Then, that tends to provide more opportunity for the software and application people to develop the applications that people want. I’m excited just because the commitment of resources is going in that direction.

Cooper: The technology is moving so fast that it’s really hard [to predict what’s going to happen]. Part of my job is to say where we’re going to be in five years, and what the target is. But two years out, it’s gotten a bit murky. We anticipated transformers when they came, and we’re constantly being asked what’s next. That’s not clear. Anything could pop up, In the short term, we can learn from CNNs. We’re 10 years into CNNs. They kept throwing multiply-accumulators at the problem, and more layers, more layers, more layers. And then [the folks in research] figured out maybe now they can optimize it. We saw this trend toward moving to the edge and fewer multiply, more efficient algorithms. With large language models, we’re seeing this trend of moving to the edge and the huge number of parameters. We’re going to see that trend of, when you get a level of performance efficiency, how well a large language model is, or VLM, or whatever it might be. It’s going to start being optimized. We’re going to see more algorithms targeted for edge applications. So that’s my short-term prediction. After that, I’ll say that AI is not going away. There’s always a question of how overhyped it is. We see a lot of engagement. Some companies will fall away, just because the industry can’t sustain so many companies in the business, but AI isn’t going to go away, and it will continue to evolve. I can guarantee that part, at least. It will continue to change.

Klein: The bad news is that the number of computations per operation is going to continue to go up. The computational loads are going to continue to get worse, and that trend is going to continue. The good news is that these algorithms are embarrassingly parallel, which means we can bring more and more hardware to bear on these problems, to wrangle them into feasibility. Customizing that hardware and customizing the way that we process these algorithms is one of the approaches that’s going to make things practical, and the more highly customized your implementation is, the faster and more efficient it is. But that means it is less programmable, less future-proof, so those hard tradeoffs will continue to dog us.

Chole: We already have ongoing projects with 120-TOPS, single-core NPUs. As the number of TOPS increases, computation increases. We are looking at edge going toward 1 PetaOP [1,000 TeraOPS], and it will go to 1 PetaOP in the next five years. The good thing about that is we will have the entire control system built on it, where the idea is that the models themselves can control end-to-end flow. Where they can define, they can dynamically change what sub-models to run, what kind of structures to run, and be able to guide maybe robotics or maybe automotive systems pretty much end-to-end, and take care of the 99.999% of nuances. To be able to get there, there are a lot of innovations that need to happen. Chiplets need to happen. The bandwidth and how the bandwidth can actually be fed into NPUs directly need to be solved. The tools need to be solved. You need to be able to have these models, like LLM models, which are hugely costly to train, to be able to train especially for edge cases, and it needs to be democratized. We have to get all these models out in the open. If only Tesla were training this model, it’s not going to happen. Other customers won’t get access to it. These are the bottlenecks of the system. I feel like the NPU core and NPU technology are no longer a bottleneck. We have reached a point where we can say that we can get you this many TOPS per watt, and this is the best that anyone can get, theoretically close to the theoretical output. But as an industry, when we have to pull all these things together. There is a lot of plumbing that we need to do from everywhere, and that’s the challenge for the next five years.

Lawley: Here are my predictions. First, the thing that is close to certain is the need for AI inference at the edge, which will continue to increase. Second, the line between NPUs and GPUs will continue to blur. Inference-focused GPUs will look more like NPUs, and NPUs will add features like FP16/BF16 support, making them look more like GPUs. My last prediction is that quantization will start to become a function of the network training process rather than the quantization-aware training (QAT) that is happening now. It makes so much more sense to quantize the network with the full dataset rather than using a subset of data to quantize the network after it is trained.

Balancing Training, Quantization, And Hardware Integration In NPUs

Tags: