The increasing computational demands of modern deep neural networks have spurred the development of specialised hardware accelerators, such as NVIDIA Tensor Cores and Matrix Cores, designed to speed up matrix multiplication. However, subtle differences in how these accelerators handle floating-point calculations can introduce numerical inconsistencies, potentially compromising the reliability of both training and inference. Peichen Xie, Yang Wang, Fan Yang, and Mao Yang, all from Microsoft Research, address this challenge with MMA-Sim, a novel reference model that accurately simulates the arithmetic behaviour of matrix multiplication units across ten different GPU architectures. By meticulously dissecting these units through rigorous testing, the team has identified and modelled nine distinct algorithms governing their operation, achieving bitwise equivalence with real hardware and offering a crucial tool for understanding and mitigating potential sources of error in deep learning systems.
Numerical imprecision and inconsistency can compromise the stability and reproducibility of deep neural network training and inference. This research introduces MMA-Sim, a groundbreaking bit-accurate reference model that reveals the detailed arithmetic behaviours of matrix multiplication accelerators (MMAs) from ten modern GPU architectures, including those from NVIDIA and AMD. This work addresses a critical gap in understanding, as the internal arithmetic of these accelerators is typically undocumented, potentially leading to numerical inconsistencies during deep learning computations. The team systematically dissected the arithmetic behaviours of each MMA through a testing-based methodology, employing both carefully designed and randomised inputs to detect characteristics in both typical and edge cases.
Matrix Accelerator Numerical Discrepancies and Reproducibility
This extensive research delves into the numerical behaviour of matrix accelerators, identifying potential discrepancies that can impact the reproducibility and accuracy of deep learning computations. The study highlights that these accelerators introduce non-determinism due to factors like fused multiply-add (FMA) operations and varying accumulation orders, leading to inconsistent results across different runs and GPUs. Subtle numerical differences can accumulate over many layers in deep neural networks, potentially reducing accuracy or causing models to diverge. The research demonstrates that architectural differences between NVIDIA and AMD matrix accelerators affect numerical behaviour, and that the order in which floating-point operations are performed, particularly FMA, significantly impacts the final result.
The impact of various numerical formats, including FP32, FP16, BF16, INT8, and emerging formats like FP8, on numerical stability and accuracy is also examined. The research confirms measurable numerical differences between NVIDIA and AMD matrix accelerators, even when using the same numerical format and performing the same computations. While FP8 offers potential benefits in terms of memory usage and performance, it also introduces new challenges related to numerical stability and accuracy. The authors propose deterministic algorithms, software controls for specifying accumulation order, numerical stabilisation techniques, and standardisation of numerical formats and algorithms to mitigate these issues.
GPU Matrix Multiplication Accelerator Arithmetic Revealed
Scientists have developed MMA-Sim, a bit-accurate reference model that reveals the detailed arithmetic behaviours of matrix multiplication accelerators (MMAs) found in ten modern GPU architectures from NVIDIA and AMD. MMA-Sim, implemented in approximately 2800 lines of Python code, faithfully simulates the diverse arithmetic across all tested architectures, including NVIDIA’s Volta, RTX Blackwell, and AMD’s CDNA2 and CDNA3. Extensive large-scale random testing, involving over one million sets of randomly generated input matrices, confirms bitwise equivalence between MMA-Sim and the corresponding hardware, validating the model’s accuracy. Through MMA-Sim, the team confirmed previously reported issues, such as reduced accumulation precision in 8-bit floating-point (FP8) instructions on NVIDIA’s Hopper architecture and reduced dynamic range in 16-bit floating-point instructions on AMD’s CDNA2.
Furthermore, they discovered reduced accumulation precision in FP8 instructions on NVIDIA’s Ada Lovelace architecture, which is improved in newer Blackwell and RTX Blackwell designs. Notably, researchers identified asymmetric rounding in TF32, FP16, and BF16 instructions on AMD’s CDNA3, and a modified rounding method in FP8 instructions on CDNA3 that mitigates these issues. This work delivers a powerful analytical tool for both software developers and hardware designers, bridging the gap between hardware implementations and software expectations, and providing detailed functional specifications for future MMA design and verification. MMA-Sim will be released as an open-source resource, providing a unified framework for modelling, analysis, and optimisation in both software and hardware domains.
MMA-Sim Reveals GPU Arithmetic Behaviour
This research presents MMA-Sim, a novel bit-accurate reference model that comprehensively reveals the arithmetic behaviours of matrix multiplication accelerators (MMAs) found in modern graphics processing units from both NVIDIA and AMD. Through systematic testing and validation, the team successfully created a model that precisely replicates the calculations performed by these hardware components, achieving bitwise equivalence with actual devices. This allows for detailed analysis of how these accelerators handle floating-point calculations, uncovering previously undocumented behaviours that could impact the stability and reproducibility of deep learning tasks. The significance of this work lies in its ability to provide a transparent and reliable tool for understanding the numerical characteristics of MMAs.
By accurately simulating these complex calculations, researchers can investigate potential sources of error and optimise algorithms for improved performance and accuracy. The team identified variations in arithmetic behaviours across different GPU architectures and precision formats, highlighting the need for careful consideration of these factors when developing and deploying deep neural networks. The authors intend to make MMA-Sim publicly available as an open-source tool, facilitating broader investigation and collaboration within the research community, enabling cross-vendor comparisons, reproduction of arithmetic behaviours, and precision-aware co-design of future accelerators and artificial intelligence systems.