{"id":8697,"date":"2025-09-11T03:35:24","date_gmt":"2025-09-11T03:35:24","guid":{"rendered":"https:\/\/www.newsbeep.com\/il\/8697\/"},"modified":"2025-09-11T03:35:24","modified_gmt":"2025-09-11T03:35:24","slug":"analog-in-memory-computing-attention-mechanism-for-fast-and-energy-efficient-large-language-models","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/il\/8697\/","title":{"rendered":"Analog in-memory computing attention mechanism for fast and energy-efficient large language models"},"content":{"rendered":"<p>Hardware-based neural network simulations<\/p>\n<p>We implement the sliding window attention by masking the elements of S outside the sliding window (blank spaces in the example Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>). The HardSigmoid charge-to-pulse circuit is modeled by the equation<\/p>\n<p>$$\\phi (S)=\\left\\{\\begin{array}{ll}{T}_{{\\mathrm{max}}}\\quad &amp;{\\rm{if}}\\,S\\ge {S}_{{\\mathrm{sat}}}\\\\ \\frac{{T}_{{\\mathrm{max}}}}{{S}_{{\\mathrm{sat}}}}S\\quad &amp;{\\rm{if}}\\,0 &lt; S &lt; {S}_{{\\mathrm{sat}}}\\\\ 0\\quad &amp;{\\rm{if}}\\,S\\le 0\\end{array}\\right.,$$<\/p>\n<p>\n                    (3)\n                <\/p>\n<p>where Tmax\u2009=\u200915\u2009ns is the maximum pulse length for the input pulse generators. The input queries Q are quantized in 16 levels between 0 and 1, the stored K and V projections are quantized in 8 levels between 0 and 0.9, and the outputs of the second dot product are quantized in 32 levels between \u22121 and 1. The quantized models (linear intermediate hardware model and nonlinear hardware model) are trained with quantization aware training<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Jacob, B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2704&#x2013;2713 (IEEE, 2018).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR48\" id=\"ref-link-section-d52126104e3572\" rel=\"nofollow noopener\" target=\"_blank\">48<\/a>: quantization is done only in the forward pass and the backward pass is done in full precision.<\/p>\n<p>For the nonlinear model of the gain cell, the third-order polynomials<\/p>\n<p>$$\\begin{array}{r}S=\\mathop{\\sum }\\limits_{i}^{3}\\mathop{\\sum }\\limits_{j}^{3-i}Q\\cdot {\\left({K}^{T}-{K}_{{\\mathrm{offset}}}\\right)}^{i}{V}_{{\\mathrm{in}}}^{\\,j}{C}_{i,\\,j}\\\\ A=\\mathop{\\sum }\\limits_{i}^{3}\\mathop{\\sum }\\limits_{j}^{3-i}\\phi \\left(S\\right)\\cdot {\\left(V-{V}_{{\\mathrm{offset}}}\\right)}^{i}{V}_{{\\mathrm{in}}}^{\\,j}{C}_{i,\\,j}\\end{array}$$<\/p>\n<p>\n                    (4)\n                <\/p>\n<p>are used with S and A as the outputs, Q and \u03d5(S) the input pulse width, K and V the stored voltage, the constant Vin\u2009=\u20090.9\u2009V is the input voltage of the cell applied at the word line read (WLR) ports, the constant yoffset\u2009=\u20090.45\u2009V corresponds to half the supply voltage (VDD\/2), and Ci,j as fit parameters from the curve Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig1\" rel=\"nofollow noopener\" target=\"_blank\">1e<\/a>. To speed-up computation during training, we compute all the tokens in parallel with \\(Q\\in {{\\mathbb{R}}}^{T,D}\\), \\({K}^{T}\\in {{\\mathbb{R}}}^{D,T}\\), \\(V\\in {{\\mathbb{R}}}^{T,D}\\) and \\(\\phi \\left(S\\right)\\in {{\\mathbb{R}}}^{T,T}\\) (the batch dimension and the head dimension are omitted for simplicity).<\/p>\n<p>The capacitor leakage leads to an exponential decay in the stored value. After discretization, the exponential decay is formulated as<\/p>\n<p>$${y}_{t}={y}_{t-1}{{\\mathrm{e}}}^{-\\frac{{\\Delta }_{t}}{\\tau }};\\quad {\\Delta }_{t}=L{\\delta }_{t},$$<\/p>\n<p>\n                    (5)\n                <\/p>\n<p>where \u03c4 is the time constant of the capacitors, \u0394t is the time elapses between two inference steps, \u03b4t is the latency caused by each neural network layer, and L is the number of layers. To model the decay of all capacitors at all time steps in parallel, we introduce a decay mask \\(\\alpha \\in {{\\mathbb{R}}}^{T,T}\\) defined as<\/p>\n<p>$$\\alpha ={{\\mathrm{e}}}^{-\\frac{{\\Delta }_{t}}{\\tau }{m}_{t,{t}^{{\\prime} }}};\\quad {m}_{t,{t}^{{\\prime} }}=\\max \\left(0,t-{t}^{{\\prime} }\\right),$$<\/p>\n<p>\n                    (6)\n                <\/p>\n<p>where m is the relative tokens\u2019 position. To optimize computation, the decay mask is directly integrated in the dot-product computation as<\/p>\n<p>$$\\begin{array}{l}S=\\mathop{\\sum }\\limits_{i}^{3}\\mathop{\\sum }\\limits_{j}^{3-i}\\left(Q\\cdot {\\left({K}^{T}-{K}_{{\\mathrm{offset}}}\\right)}^{i}{V}_{{\\mathrm{in}}}^{\\,j}{C}_{i,\\,j}\\right){\\alpha }^{i}\\\\ A=\\mathop{\\sum }\\limits_{i}^{3}\\mathop{\\sum }\\limits_{j}^{3-i}\\left(\\phi \\left(S\\right){\\alpha }^{i}\\right)\\cdot {\\left(V-{V}_{{\\mathrm{offset}}}\\right)}^{i}{V}_{{\\mathrm{in}}}^{\\,j}{C}_{i,\\,j}\\end{array}$$<\/p>\n<p>\n                    (7)\n                <\/p>\n<p>In our simulation, we chose a time constant \u03c4\u2009=\u20095\u2009ms to be consistent with the data from Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig1\" rel=\"nofollow noopener\" target=\"_blank\">1h<\/a>. We chose \u03b4t =\u200965\u2009ns to be equal to the latency of our full hardware attention mechanism (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig2\" rel=\"nofollow noopener\" target=\"_blank\">2c<\/a>). Our decay factor is therefore \\(\\frac{{\\Delta }_{t}}{\\tau }=\\frac{12\\times 65\\times 1{0}^{-9}}{5\\times 1{0}^{-3}}\\simeq 1.6\\times 1{0}^{-4}\\). In a full transformer implementation, the latency per layer \u03b4t =\u2009will be higher than 65\u2009ns as it will also include latency from other modules, such as feedforward neural networks. However, time constant \u03c4 of three orders of magnitude larger were reported in OSFET-based gain-cell memories<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 26\" title=\"Wang, Y. et al. An in-memory computing architecture based on two-dimensional semiconductors for multiply&#x2013;accumulate operations. Nat. Commun. &#010;                https:\/\/doi.org\/10.1038\/s41467-021-23719-3&#010;                &#010;               (2021).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR26\" id=\"ref-link-section-d52126104e4953\" rel=\"nofollow noopener\" target=\"_blank\">26<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 29\" title=\"Belmonte, A. et al. Lowest IOFF &lt;3&#xD7;10&#x2212;21 A\/&#x3BC;m in capacitorless DRAM achieved by reactive ion etch of IGZO-TFT. In Proc. 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 1&#x2013;2 (IEEE, 2023); &#010;                https:\/\/doi.org\/10.23919\/VLSITechnologyandCir57934.2023.10185398&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR29\" id=\"ref-link-section-d52126104e4956\" rel=\"nofollow noopener\" target=\"_blank\">29<\/a>, and therefore we conclude that the choice of decay factor of 1.6 \u00d7\u200910\u22124 is very conservative. In Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">6<\/a>, we study empirically the effect of the decay constant over language processing accuracy. It is noteworthy that the decay of stored keys and values may not necessarily hinder network performance: several approaches in deep learning leverage exponential decay masks to enhance memory structure<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 39\" title=\"Ma, X. et al. Mega: moving average equipped gated attention. In Proc. 11th International Conference on Learning Representations (2023); &#010;                https:\/\/openreview.net\/forum?id=qNLe3iq2El&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR39\" id=\"ref-link-section-d52126104e4966\" rel=\"nofollow noopener\" target=\"_blank\">39<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 49\" title=\"Press, O., Smith, N. A. &amp; Lewis, M. Train short, test long: attention with linear biases enables input length extrapolation. In Proc. International Conference on Learning Representations (2022); &#010;                https:\/\/openreview.net\/forum?id=R8sQPpGCv0&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR49\" id=\"ref-link-section-d52126104e4969\" rel=\"nofollow noopener\" target=\"_blank\">49<\/a>. In <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">Supplementary Information<\/a> section \u2018Effect of capacitor\u2019s leakage\u2019, we study the connection between the KV pairs decay and the relative positional embedding called AliBi<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 49\" title=\"Press, O., Smith, N. A. &amp; Lewis, M. Train short, test long: attention with linear biases enables input length extrapolation. In Proc. International Conference on Learning Representations (2022); &#010;                https:\/\/openreview.net\/forum?id=R8sQPpGCv0&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR49\" id=\"ref-link-section-d52126104e4976\" rel=\"nofollow noopener\" target=\"_blank\">49<\/a>.<\/p>\n<p>To speed up our training process, we used the library Triton<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Tillet, P., Kung, H. T. &amp; Cox, D. Triton: an intermediate language and compiler for tiled neural network computations. In Proc. 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019 10&#x2013;19 (Association for Computing, 2019); &#010;                https:\/\/doi.org\/10.1145\/3315508.3329973&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR50\" id=\"ref-link-section-d52126104e4984\" rel=\"nofollow noopener\" target=\"_blank\">50<\/a> to incorporate our simulations into an adapted version of the flash attention algorithm<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Dao, T. FlashAttention-2: faster attention with better parallelism and work partitioning. In Proc. 12th International Conference on Learning Representations (2024); &#010;                https:\/\/openreview.net\/forum?id=mZn2Xyh9Ec&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR51\" id=\"ref-link-section-d52126104e4988\" rel=\"nofollow noopener\" target=\"_blank\">51<\/a>, which optimizes the GPU resources. This method led to a factor of five latency reduction during training.<\/p>\n<p>For the adaptation, the algorithm was repeated until the mean and standard deviation of the output of the scaling functions of the nonlinear model matches the mean and standard deviation of the linear model within a tolerance ratio: \\(\\left\\vert {\\sigma }_{{\\mathrm{NL}}}-{\\sigma }_{{\\mathrm{L}}}\\right\\vert &lt; 0.0001\\) and \\(\\left\\vert{\\mu}_{{\\mathrm{NL}}}-{\\mu}_{{\\mathrm{L}}}\\right\\vert\\)\\(&lt;0.0001\\).<\/p>\n<p>Nonlinear model adaptation algorithm<\/p>\n<p>with distinct scalars a and b for each of the Q, K and V projections, as well as for the output of the attention, with separate factors applied across different attention heads and layers.<\/p>\n<p>To choose the scaling parameters a and b, we develop an algorithm inspired by ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Mishkin, D. &amp; Matas, J. All you need is a good init. Preprint at &#010;                https:\/\/arxiv.org\/abs\/1511.06422&#010;                &#010;               (2015).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR52\" id=\"ref-link-section-d52126104e5189\" rel=\"nofollow noopener\" target=\"_blank\">52<\/a>, detailed in Supplementary Algorithm <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>. Given a set of input samples, we use an iterative loop that updates the scaling parameters so that the output of the scaling function of the nonlinear model matches the statistics of the linear model (as sketched in Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig4\" rel=\"nofollow noopener\" target=\"_blank\">4b<\/a>). First, we measure the standard deviation \u03c3L and the mean \u03bcL of the output of every scaling stage (see equation (<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"equation anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Equ8\" rel=\"nofollow noopener\" target=\"_blank\">8<\/a>)) of the linear model on a large set of samples. Then, at each iteration, we measure the standard deviation \u03c3NL and the mean \u03bcNL for the scaling stage of the nonlinear model. For each iteration, the scaling parameters are updated as<\/p>\n<p>$$\\begin{array}{l}a\\leftarrow a\\frac{{\\sigma}_{{\\mathrm{L}}}}{{\\sigma}_{{\\mathrm{NL}}}}\\\\ b\\leftarrow b+\\left(\\;{\\mu}_{{\\mathrm{L}}}-{\\mu}_{{\\mathrm{NL}}}\\right)\\end{array}.$$<\/p>\n<p>\n                    (9)\n                <\/p>\n<p>Analog sliding window attention timing and execution<\/p>\n<p>To support efficient sequential inference, our architecture implements sliding window attention using a pipelined read\u2013write mechanism across analog gain-cell arrays. At each inference step, new (K,\u2009V) pairs are written into the arrays while the current query (Q) is applied, ensuring that memory access and computation overlap.<\/p>\n<p>Each attention step begins with a 5\u2009ns discharge phase to reset the storage capacitors of the gain cells. New K and V vectors are written to a column of the respective arrays using 10\u2009ns multi-level voltage pulses generated by 3-bit DACs. In parallel, the input query Q is encoded as PWM voltage pulses with durations between 0\u2009ns and Tmax\u2009=\u200915\u2009ns, generated by 4-bit (16 levels) voltage pulse generators operating at 1\u2009GHz.<\/p>\n<p>This parallelization is possible because the V array is not required during the Q\u2009\u22c5\u2009KT computation phase and can therefore be updated while the first dot product is processed. Once the write is complete, the charge-to-pulse circuit for the V array is reset, and the resulting \u03d5(S) pulses from the K array\u2019s readout are applied to the V array to compute the second dot product \u03d5(S)\u2009\u22c5\u2009V.<\/p>\n<p>After M time steps, when all columns in the K and V arrays have been populated, the first column is overwritten, preserving a sliding attention window of fixed size M. The succession of write and read phases implements a sequential sliding window attention mechanism, with minimal idle time and continuous throughput. This pipelined execution scheme is visualized in Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig2\" rel=\"nofollow noopener\" target=\"_blank\">2c<\/a>, and forms the basis for the latency and energy analysis presented in later sections.<\/p>\n<p>Sub-tiling to scale attention dimensions<\/p>\n<p>IR drop, caused by resistive losses in interconnects, results in reduced accuracy in large-scale analog crossbar arrays<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Lepri, N., Glukhov, A., Mannocci, P., Porzani, M. &amp; Ielmini, D. Compact modeling and mitigation of parasitics in crosspoint accelerators of neural networks. IEEE Trans, Electron Devices 71, 1900&#x2013;1906 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR53\" id=\"ref-link-section-d52126104e5444\" rel=\"nofollow noopener\" target=\"_blank\">53<\/a>. To mitigate IR drop issues, we limit the size of our gain-cell arrays to 64\u2009\u00d7\u200964. However, most NLP applications require larger either a larger window dimension M (columns) or a larger embedding dimension d (rows). To accommodate larger dimensions, we perform inference across multiple sub-tiles, as shown in Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig3\" rel=\"nofollow noopener\" target=\"_blank\">3a<\/a>.<\/p>\n<p>In this paper, we implement a GPT-2 model with an embedding dimension d\u2009=\u200964 and a sliding window size M\u2009=\u20091,024. Therefore, the entire KV cache of the window size M is divided into 16 sub-tiles, each having its charge-to-pulse blocks and storing a fraction of the K and V in two 64\u2009\u00d7\u200964 arrays. A write address controller keeps track of the current write index. All tiles receive the same input Q generated by the digital block in parallel, are measured by pulse counters and summed by 64 digital adders, each with 16 inputs (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig3\" rel=\"nofollow noopener\" target=\"_blank\">3b,c<\/a>). In sliding window attention, the maximum attention span is equal to L(M\u2009\u2212\u20091)\u2009+\u20091 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 43\" title=\"Fu, Z. et al. Sliding window attention training for efficient large language models. Preprint at &#010;                https:\/\/arxiv.org\/abs\/2502.18845&#010;                &#010;               (2025).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR43\" id=\"ref-link-section-d52126104e5488\" rel=\"nofollow noopener\" target=\"_blank\">43<\/a>). Therefore, in the presented architecture, the maximum attention span can be increased by increasing the number of sub-tiles. However, this leads to additional area footprint scaling linearly with the sliding window dimension, and additional latency as each digital adder requires one clock cycle.<\/p>\n<p>Hardware-based neural network training<\/p>\n<p>To evaluate our training algorithm and the inference accuracy of our architecture, we implement the analog gain-cell-based attention mechanism on the GPT-2 architecture<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Radford, A. et al. Language models are unsupervised multitask learners. OpenAI &#010;                https:\/\/cdn.openai.com\/better-language-models\/language_models_are_unsupervised_multitask_learners.pdf&#010;                &#010;               (2019).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR54\" id=\"ref-link-section-d52126104e5501\" rel=\"nofollow noopener\" target=\"_blank\">54<\/a>. GPT-2 is a transformer neural network with 124\u2009million parameters, 12 layers, an attention mechanism input dimension of 768, 12 heads per attention block and a head dimension of 64. We used the open-source text collection OpenWebText<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 44\" title=\"Gokaslan, A. &amp; Cohen, V. OpenWebText Corpus. GitHub &#010;                http:\/\/Skylion007.github.io\/OpenWebTextCorpus&#010;                &#010;               (2019).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR44\" id=\"ref-link-section-d52126104e5505\" rel=\"nofollow noopener\" target=\"_blank\">44<\/a> split between training and testing samples, and the pre-trained GPT-2 tokenizer to encode the plain text into tokens (vectors of size 50,304 each). Each training iteration had a batch size of 1,920, with sequences of length 1,024 per sample. We selected a sliding window size of 1,024, which matches the number of gain-cell rows in the memory. As the sequence length also equals 1,024, each gain cell is written only once per sequence, eliminating the need to overwrite cells during one sliding window iteration. For a larger sequence length, the gain cells would be overwritten, as described in the section \u2018Analog hardware sliding window attention data-flow\u2019. To train the network, the next token in the sequence is predicted for each input token. Thus, the target sequences are the input sequences shifted by one token. The cost function used was cross-entropy, calculated between the predicted sequence and the target sequence. We used backpropagation with the AdamW optimizer<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"Loshchilov, I. &amp; Hutter, F. Decoupled weight decay regularization. In Proc. International Conference on Learning Representations (2019); &#010;                https:\/\/openreview.net\/forum?id=Bkg6RiCqY7&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR55\" id=\"ref-link-section-d52126104e5509\" rel=\"nofollow noopener\" target=\"_blank\">55<\/a>, with a learning rate of 6\u2009\u00d7 10\u22124 and a weight decay of 0.1. The results of each evaluation are averaged over 4,000 samples.<\/p>\n<p>Downstream tasks set-up<\/p>\n<p>The datasets cover various types of problem. Our benchmarking set-up is inspired by refs. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 11\" title=\"Gu, A. &amp; Dao, T. Mamba: linear-time sequence modeling with selective state spaces. In Proc. Conference on Language Modeling (2024); &#010;                https:\/\/openreview.net\/forum?id=tEYskw1VY2&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR11\" id=\"ref-link-section-d52126104e5523\" rel=\"nofollow noopener\" target=\"_blank\">11<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Beck, M. et al. xLSTM: extended long short-term memory. In Proc. 38th Annual Conference on Neural Information Processing Systems (2024); &#010;                https:\/\/openreview.net\/forum?id=ARAxPPIAhq&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR56\" id=\"ref-link-section-d52126104e5526\" rel=\"nofollow noopener\" target=\"_blank\">56<\/a> in terms of evaluated tasks and metrics. ARC-Easy and ARC-Challenge<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 57\" title=\"Clark, P. et al. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. Preprint at &#010;                https:\/\/arxiv.org\/abs\/1803.05457&#010;                &#010;               (2018).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR57\" id=\"ref-link-section-d52126104e5530\" rel=\"nofollow noopener\" target=\"_blank\">57<\/a> focus on question answering, with ARC-Easy containing straightforward questions and ARC-Challenge featuring more difficult ones. WinoGrande<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Sakaguchi, K., Bras, R. L., Bhagavatula, C. &amp; Choi, Y. WinoGrande: an adversarial winograd schema challenge at scale. Commun. ACM 64, 99&#x2013;106 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR58\" id=\"ref-link-section-d52126104e5534\" rel=\"nofollow noopener\" target=\"_blank\">58<\/a> evaluates common-sense reasoning and co-reference resolution by presenting minimal pairs to resolve ambiguities. HellaSwag<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 59\" title=\"Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. &amp; Choi, Y. HellaSwag: can a machine really finish your sentence? In Proc. 57th Annual Meeting of the Association for Computational Linguistics, 4791&#x2013;4800 (ACL, 2019).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR59\" id=\"ref-link-section-d52126104e5538\" rel=\"nofollow noopener\" target=\"_blank\">59<\/a> tests common-sense inference, requiring models to predict the most plausible continuation of a given context. LAMBADA<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 60\" title=\"Paperno, D. et al. The LAMBADA dataset: word prediction requiring a broad discourse context. In Proc. 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Erk, K. &amp; Smith, N. A.) 1525&#x2013;1534 (Association for Computational Linguistics, 2016); &#010;                https:\/\/doi.org\/10.18653\/v1\/P16-1144&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR60\" id=\"ref-link-section-d52126104e5542\" rel=\"nofollow noopener\" target=\"_blank\">60<\/a> evaluates models\u2019 text understanding through a word prediction task that requires comprehension of broader discourse, not just local context. PIQA<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 61\" title=\"Bisk, Y., Zellers, R., Bras, R. L., Gao, J. &amp; Choi, Y. PIQA: reasoning about physical commonsense in natural language. In Proc. 34th AAAI Conference on Artificial Intelligence, 7432&#x2013;7439 (AAAI, 2020).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR61\" id=\"ref-link-section-d52126104e5547\" rel=\"nofollow noopener\" target=\"_blank\">61<\/a> assesses physical common-sense reasoning, testing a model\u2019s understanding of physical scenarios. WikiText-2<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 62\" title=\"Merity, S., Xiong, C., Bradbury, J. &amp; Socher, R. Pointer sentinel mixture models. In Proc. International Conference on Learning Representations (2017); &#010;                https:\/\/openreview.net\/forum?id=Byj72udxe&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR62\" id=\"ref-link-section-d52126104e5551\" rel=\"nofollow noopener\" target=\"_blank\">62<\/a> is a general text corpus derived from Wikipedia articles to assess long-term dependencies processing, text prediction and generation capabilities. For WikiText-2, we report perplexity scores normalized by the word count in the original text. For fair comparisons, except for software public GPT-2, all the models were evaluated after the same number of training iterations. The linear hardware model was trained on 13,000 iterations, the nonlinear hardware model was mapped from the 13,000 iterations linear model using the adaptation algorithm but without fine-tuning, and the nonlinear hardware model with adaptation and fine-tuning was adapted from a linear model trained on 3,000 iterations, and then fine-tuned on 10,000 iterations.<\/p>\n<p>Hardware SPICE simulations<\/p>\n<p>To assess circuit performance accuracy, energy consumption and speed, we conducted SPICE array simulations using the TSMC 28\u2009nm PDK within the Cadence Virtuoso environment. All simulations are based on a 64\u2009\u00d7\u200964 array, corresponding to the tile size in our architecture (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig3\" rel=\"nofollow noopener\" target=\"_blank\">3a<\/a>). To extrapolate the energy and latency for a full attention head with a window size of 1,024, we multiply the per-sub-tile measurements by 16, reflecting the total number of sub-tiles comprising 1 attention head in our architecture. In these simulations, a parasitic wire capacitance of 0.8\u2009fF and a series resistance of 2\u2009\u03a9 per array element are included. Both arrays, one performing \u03d5(Q \u22c5 KT) and the other performing \u03d5(S)\u2009\u22c5 V, are simulated separately, but always in combination with their specific charge-to-pulse circuitry readout circuitry.<\/p>\n<p>GPU attention latency and energy consumption measurements<\/p>\n<p>To measure the latency and energy on Nvidia RTX 4090, Nvidia H100 and Nvidia Jetson Nano, which are a consumer GPU, a data-center GPU and an embedded application GPU, respectively, we perform 10 runs of 1,024 steps of autoregressive token generation with 12 attention heads using the method FlashAttention-2<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Dao, T. FlashAttention-2: faster attention with better parallelism and work partitioning. In Proc. 12th International Conference on Learning Representations (2024); &#010;                https:\/\/openreview.net\/forum?id=mZn2Xyh9Ec&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR51\" id=\"ref-link-section-d52126104e5596\" rel=\"nofollow noopener\" target=\"_blank\">51<\/a>, which optimizes attention computation in GPUs. The energy and latency consumption measurement solely focus on attention computation, and for a fair comparison, the linear projections are not implemented in this experiment as they are also not implemented by our hardware architecture, and the static power measured before inference is subtracted from the power measured during inference. For each run, we measure the latency and the power using the Nvidia-SMI python API, and average them.<\/p>\n<p>Area estimation<\/p>\n<p>Our floorplan is based on ITO gain cells, an emerging OSFET technology that has enabled low-area gain-cell designs<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Liu, S. et al. Design guidelines for oxide semiconductor gain cell memory on a logic platform. IEEE Trans. Electron Devices 71, 3329&#x2013;3335 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR45\" id=\"ref-link-section-d52126104e5608\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a>. A two-transistor ITO gain cell occupies an area of 0.14\u2009\u03bcm2 (approximately 370\u2009nm\u2009\u00d7\u2009370\u2009nm)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Liu, S. et al. Design guidelines for oxide semiconductor gain cell memory on a logic platform. IEEE Trans. Electron Devices 71, 3329&#x2013;3335 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR45\" id=\"ref-link-section-d52126104e5614\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a>, allowing for denser memories than CMOS-based gain cells. On the basis of the area results presented in these studies<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Liu, S. et al. Design guidelines for oxide semiconductor gain cell memory on a logic platform. IEEE Trans. Electron Devices 71, 3329&#x2013;3335 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR45\" id=\"ref-link-section-d52126104e5618\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 46\" title=\"Subhechha, S. et al. Demonstration of multilevel multiply accumulate operations for AiMC using engineered a-IGZO transistors-based 2T1C gain cell arrays. In Proc. 2023 IEEE International Memory Workshop (IMW) 1&#x2013;4 (IEEE, 2023); &#010;                https:\/\/doi.org\/10.1109\/IMW56887.2023.10145946&#010;                &#010;              \" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR46\" id=\"ref-link-section-d52126104e5621\" rel=\"nofollow noopener\" target=\"_blank\">46<\/a>, we estimate the worst-case area of the proposed 6-transistor cell to be 1\u2009\u03bcm2, leading to a 19\u00d7 area reduction compared with gain cells based on CMOS write transistors (our CMOS-based gain-cell layout is presented in Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>). The total area of 1 attention head is derived from this single-cell area estimation, as well as the charge-to-pulse circuit layout and the total floorplan incorporating the 16 sub-tiles and digital circuits, providing a precise representation of the space requirements. This structure is designed to be repetitive (vertical dimension in Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig3\" rel=\"nofollow noopener\" target=\"_blank\">3c<\/a>), allowing multiple attention heads to be efficiently integrated on a single chip. Each attention head receives inputs from the lower digital block, while its outputs are processed by the upper digital block. To facilitate the connection of the bitline outputs of one array (that is, vertical metal lines) to the wordline input of the next array (that is, horizontal metal line), we employ wire tapping, as highlighted in Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#Fig3\" rel=\"nofollow noopener\" target=\"_blank\">3d<\/a>.<\/p>\n<p>When considering 3D-stacked gain cells, the effective cell area is reported in ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Liu, S. et al. Design guidelines for oxide semiconductor gain cell memory on a logic platform. IEEE Trans. Electron Devices 71, 3329&#x2013;3335 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s43588-025-00854-1#ref-CR45\" id=\"ref-link-section-d52126104e5640\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a> as 0.14\/N\u2009\u03bcm2, where N denotes the number of parallel oxide layers. Consequently, a signed gain-cell implementation would occupy 0.28\/N\u2009\u03bcm2, consisting of 2 gain cells, 1 for the positive part and 1 for the negative part.<\/p>\n","protected":false},"excerpt":{"rendered":"Hardware-based neural network simulations We implement the sliding window attention by masking the elements of S outside the&hellip;\n","protected":false},"author":2,"featured_media":8698,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[9756,2350,353,9757,3181,85,46,125],"class_list":{"0":"post-8697","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-computing","8":"tag-computational-science","9":"tag-computer-science","10":"tag-computing","11":"tag-electrical-and-electronic-engineering","12":"tag-general","13":"tag-il","14":"tag-israel","15":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts\/8697","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/comments?post=8697"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts\/8697\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/media\/8698"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/media?parent=8697"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/categories?post=8697"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/tags?post=8697"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}