{"id":372593,"date":"2026-01-16T04:07:08","date_gmt":"2026-01-16T04:07:08","guid":{"rendered":"https:\/\/www.newsbeep.com\/uk\/372593\/"},"modified":"2026-01-16T04:07:08","modified_gmt":"2026-01-16T04:07:08","slug":"openai-to-serve-chatgpt-on-cerebras-ai-dinner-plates-the-register","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/uk\/372593\/","title":{"rendered":"OpenAI to serve ChatGPT on Cerebras&#8217; AI dinner plates \u2022 The Register"},"content":{"rendered":"<p>OpenAI says it will deploy 750 megawatts worth of Nvidia competitor Cerebras&#8217; dinner-plate sized accelerators through 2028 to bolster its inference services.<\/p>\n<p>The deal, which will see Cerebras take on the risk of building and leasing datacenters to serve OpenAI, is valued at more than $10 billion, sources familiar with the matter tell El Reg.<\/p>\n<p>By integrating Cerebras&#8217; wafer-scale compute architecture into its inference pipeline, OpenAI can take advantage of the chip&#8217;s massive SRAM capacity to speed up inference. Each of the chip startup&#8217;s <a target=\"_blank\" href=\"https:\/\/www.theregister.com\/2024\/03\/13\/cerebras_claims_to_have_revived\/\" rel=\"nofollow noopener\">WSE-3 accelerators<\/a> measures in at 46,225 mm2 and is equipped with 44 GB of SRAM.<\/p>\n<p>Compared to the HBM found on modern GPUs, SRAM is several orders of magnitude faster. While a single <a target=\"_blank\" href=\"https:\/\/www.theregister.com\/2026\/01\/05\/ces_rubin_nvidia\/\" rel=\"nofollow noopener\">Nvidia Rubin GPU<\/a> can deliver around 22 TB\/s of memory bandwidth, Cerebras&#8217; chips achieve nearly 1,000x that at 21 Petabytes a second.<\/p>\n<p>All that bandwidth translates into extremely fast inference performance. Running models like OpenAI&#8217;s gpt-oss 120B, Cerebras&#8217; chips can <a target=\"_blank\" rel=\"nofollow noopener\" href=\"https:\/\/artificialanalysis.ai\/models\/gpt-oss-120b\/providers\">purportedly achieve<\/a> single user performance of 3,098 tokens a second as compared to 885 tok\/s for competitor Together AI, which uses Nvidia GPUs.<\/p>\n<p>In the age of reasoning models and AI agents, faster inference means models can &#8220;think&#8221; for longer without compromising on interactivity.<\/p>\n<p>&#8220;Integrating Cerebras into our mix of compute solutions is all about making our AI respond much faster. When you ask a hard question, generate code, create an image, or run an AI agent, there is a loop happening behind the scenes: you send a request, the model thinks, and it sends something back,&#8221; OpenAI explained in a recent <a target=\"_blank\" rel=\"nofollow noopener\" href=\"https:\/\/openai.com\/index\/cerebras-partnership\/\">blog post<\/a>. &#8220;When AI responds in real time, users do more with it, stay longer, and run higher-value workloads.&#8221;<\/p>\n<p>However, Cerebras&#8217; architecture has some limitations. SRAM isn&#8217;t particularly space efficient, which is why, despite the chip&#8217;s impressive size, they only pack about as much memory as a six-year-old Nvidia A100 PCIe card.<\/p>\n<p>Because of this, larger models need to be parallelized across multiple chips, each of which are rated for a prodigious 23 kW of power. Depending on the precision used, the number of chips required can be considerable. At 16-bit precision, which Cerebras has historically preferred for its higher-quality outputs, every billion parameters ate up 2 GB of SRAM capacity. As a result, even modest models like Llama 3 70B required at least four of its CS-3 accelerators to run.<\/p>\n<p>It&#8217;s been nearly two years since Cerebras <a target=\"_blank\" href=\"https:\/\/www.theregister.com\/2024\/03\/13\/cerebras_claims_to_have_revived\/\" rel=\"nofollow noopener\">unveiled<\/a> a new wafer scale accelerator, and since then the company&#8217;s priorities have shifted from training to inference. We suspect the chip biz&#8217;s next chip may dedicate a larger area to SRAM and add support for modern block floating point data types like MXFP4, which should dramatically <a target=\"_blank\" href=\"https:\/\/www.theregister.com\/2025\/08\/10\/openai_mxfp4\/\" rel=\"nofollow noopener\">increase the size<\/a> of the models that can be served on a single chip.<\/p>\n<p>Having said that, the introduction of a <a target=\"_blank\" href=\"https:\/\/www.theregister.com\/2025\/08\/13\/gpt_5_cost_cutting\/\" rel=\"nofollow noopener\">model router<\/a> with the launch of OpenAI&#8217;s GPT-5 last summer should help mitigate Cerebras&#8217; memory constraints. The approach ensures that the vast majority of requests fielded by ChatGPT are fulfilled by smaller cost-optimized models. Only the most complex queries run on OpenAI&#8217;s largest and most resource-intensive models.<\/p>\n<p>It&#8217;s also possible that OpenAI may choose to run a portion of its inference pipeline on Cerebras&#8217; kit. Over the past year, the concept of disaggregated inference has <a target=\"_blank\" href=\"https:\/\/www.theregister.com\/2025\/09\/10\/nvidia_rubin_cpx\/\" rel=\"nofollow noopener\">taken off<\/a>.<\/p>\n<p>In theory, OpenAI could run compute-heavy prompt processing on AMD or Nvidia GPUs and offload token generation to Cerebras&#8217; SRAM packed accelerators for the workload&#8217;s bandwidth-constrained token generation phase. Whether this is actually an option will depend on Cerebras.<\/p>\n<p>&#8220;This is a Cloud service agreement. We build out datacenters with our equipment for OpenAI to power their models with the fastest inference,&#8221; a company spokesperson told El Reg when asked about the possibility of using its CS-3s in a disaggregated compute architecture.<\/p>\n<p>This doesn&#8217;t mean it won&#8217;t happen, but it would be on Cerebras to deploy the GPU systems required to support such a configuration in its datacenters alongside its waferscale accelerators. \u00ae<\/p>\n","protected":false},"excerpt":{"rendered":"OpenAI says it will deploy 750 megawatts worth of Nvidia competitor Cerebras&#8217; dinner-plate sized accelerators through 2028 to&hellip;\n","protected":false},"author":2,"featured_media":372594,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[554,733,4308,86,56,54,55],"class_list":{"0":"post-372593","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-technology","12":"tag-uk","13":"tag-united-kingdom","14":"tag-unitedkingdom"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/372593","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/comments?post=372593"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/372593\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media\/372594"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media?parent=372593"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/categories?post=372593"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/tags?post=372593"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}