{"id":534392,"date":"2026-04-16T16:31:08","date_gmt":"2026-04-16T16:31:08","guid":{"rendered":"https:\/\/www.newsbeep.com\/uk\/534392\/"},"modified":"2026-04-16T16:31:08","modified_gmt":"2026-04-16T16:31:08","slug":"an-inference-layer-designed-for-agents","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/uk\/534392\/","title":{"rendered":"an inference layer designed for agents"},"content":{"rendered":"<p>AI models are changing quickly: the best model to use for agentic coding today might in three months be a completely different model from a different provider. On top of this, real-world use cases often require calling more than one model. Your customer support agent might use a fast, cheap model to classify a user&#8217;s message; a large, reasoning model to plan its actions; and a lightweight model to execute individual tasks.<\/p>\n<p>This means you need access to all the models, without tying yourself financially and operationally to a single provider. You also need the right systems in place to monitor costs across providers, ensure reliability when one of them has an outage, and manage latency no matter where your users are.<\/p>\n<p>These challenges are present whenever you\u2019re building with AI, but they get even more pressing when you\u2019re building <a href=\"https:\/\/www.cloudflare.com\/learning\/ai\/what-is-agentic-ai\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">agents<\/a>. A simple chatbot might make one <a href=\"https:\/\/www.cloudflare.com\/learning\/ai\/inference-vs-training\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">inference<\/a> call per user prompt. An agent might chain ten calls together to complete a single task and suddenly, a single slow provider doesn&#8217;t add 50ms, it adds 500ms. One failed request isn&#8217;t a retry, but suddenly a cascade of downstream failures.\u00a0<\/p>\n<p>Since launching AI Gateway and Workers AI, we\u2019ve seen incredible adoption from developers building AI-powered applications on Cloudflare and we\u2019ve been shipping fast to keep up! In just the past few months, we&#8217;ve refreshed the dashboard, added zero-setup default gateways, automatic retries on upstream failures, and more granular logging controls. Today, we\u2019re making Cloudflare into a unified inference layer: one API to access any AI model from any provider, built to be fast and reliable.\u00a0<\/p>\n<p>      One catalog, one unified endpoint<br \/>\n      <a href=\"#one-catalog-one-unified-endpoint\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-start-5\"><\/p>\n<p>      <\/a><\/p>\n<p>Starting today, you can call third-party models using the same AI.run() binding you already use for Workers AI. If you\u2019re using Workers, switching from a Cloudflare-hosted model to one from OpenAI, Anthropic, or any other provider is a one-line change.\u00a0<\/p>\n<p>            const response = await env.AI.run(&#8216;anthropic\/claude-opus-4-6&#8217;,{<br \/>\ninput: &#8216;What is Cloudflare?&#8217;,<br \/>\n}, {<br \/>\ngateway: { id: &#8220;default&#8221; },<br \/>\n});<\/p>\n<p>For those who don\u2019t use Workers, we\u2019ll be releasing REST API support in the coming weeks, so you can access the full model catalog from any environment.<\/p>\n<p>We\u2019re also excited to share that you&#8217;ll now have access to 70+ models across 12+ providers \u2014 all through one API, one line of code to switch between them, and one set of credits to pay for them. And we\u2019re quickly expanding this as we go.<\/p>\n<p>You can browse through our <a href=\"https:\/\/developers.cloudflare.com\/ai\/models\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">model catalog<\/a> to find the best model for your use case, from open-source models hosted on Cloudflare Workers AI to proprietary models from the major model providers. We\u2019re excited to be expanding access to models from Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, and Vidu \u2014 who will provide their models through AI Gateway. Notably, we\u2019re expanding our model offerings to include image, video, and speech models so that you can build multimodal applications<\/p>\n<p>Accessing all your models through one API also means you can manage all your AI spend in one place. Most companies today are calling <a href=\"https:\/\/aidbintel.com\/pulse-survey\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">an average of 3.5 models<\/a> across multiple providers, which means no one provider is able to give you a holistic view of your AI usage. With AI Gateway, you\u2019ll get one centralized place to monitor and manage AI spend.<\/p>\n<p>By including custom metadata with your requests, you can get a breakdown of your costs on the attributes that you care about most, like spend by free vs. paid users, by individual customers, or by specific workflows in your app.<\/p>\n<p>            const response = await env.AI.run(&#8216;@cf\/moonshotai\/kimi-k2.5&#8217;,<br \/>\n      {<br \/>\nprompt: &#8216;What is AI Gateway?&#8217;<br \/>\n      },<br \/>\n      {<br \/>\nmetadata: { &#8220;teamId&#8221;: &#8220;AI&#8221;, &#8220;userId&#8221;: 12345 }<br \/>\n      }<br \/>\n    );<\/p>\n<p>AI Gateway gives you access to models from all the providers through one API. But sometimes you need to run a model you&#8217;ve fine-tuned on your own data or one optimized for your specific use case. For that, we are working on letting users bring their own model to Workers AI.\u00a0<\/p>\n<p>The overwhelming majority of our traffic comes from dedicated instances for Enterprise customers who are running custom models on our platform, and we want to bring this to more customers. To do this, we leverage Replicate\u2019s <a href=\"https:\/\/cog.run\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Cog<\/a> technology to help you containerize machine learning models.<\/p>\n<p>Cog is designed to be quite simple: all you need to do is write down dependencies in a cog.yaml file, and your inference code in a Python file. Cog abstracts away all the hard things about packaging ML models, such as CUDA dependencies, Python versions, weight loading, etc. <\/p>\n<p>Example of a cog.yaml file:<\/p>\n<p>            build:<br \/>\n  python_version: &#8220;3.13&#8221;<br \/>\n  python_requirements: requirements.txt<br \/>\npredict: &#8220;predict.py:Predictor&#8221;<\/p>\n<p>Example of a <a href=\"http:\/\/predict.py\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">predict.py<\/a> file, which has a function to set up the model and a function that runs when you receive an inference request (a prediction):<\/p>\n<p>            from cog import BasePredictor, Path, Input<br \/>\nimport torch<\/p>\n<p>class Predictor(BasePredictor):<br \/>\n    def setup(self):<br \/>\n        &#8220;&#8221;&#8221;Load the model into memory to make running multiple predictions efficient&#8221;&#8221;&#8221;<br \/>\n        self.net = torch.load(&#8220;weights.pth&#8221;)<\/p>\n<p>    def predict(self,<br \/>\n            image: Path = Input(description=&#8221;Image to enlarge&#8221;),<br \/>\n            scale: float = Input(description=&#8221;Factor to scale image by&#8221;, default=1.5)<br \/>\n    ) -&gt; Path:<br \/>\n        &#8220;&#8221;&#8221;Run a single prediction on the model&#8221;&#8221;&#8221;<br \/>\n        # &#8230; pre-processing &#8230;<br \/>\n        output = self.net(input)<br \/>\n        # &#8230; post-processing &#8230;<br \/>\n        return output<\/p>\n<p>Then, you can run cog build to build your container image, and push your Cog container to Workers AI. We will deploy and serve the model for you, which you then access through your usual Workers AI APIs.\u00a0<\/p>\n<p>We\u2019re working on some big projects to be able to bring this to more customers, like customer-facing APIs and wrangler commands so that you can push your own containers, as well as faster cold starts through GPU snapshotting. We\u2019ve been testing this internally with Cloudflare teams and some external customers who are guiding our vision. If you\u2019re interested in being a design partner with us, please reach out! Soon, anyone will be able to package their model and use it through Workers AI.<\/p>\n<p>      The fast path to first token<br \/>\n      <a href=\"#the-fast-path-to-first-token\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-start-5\"><\/p>\n<p>      <\/a><\/p>\n<p>Using Workers AI models with AI Gateway is particularly powerful if you\u2019re building live agents \u2013 where a user&#8217;s perception of speed hinges on time to first token or how quickly the agent starts responding, rather than how long the full response takes. Even if total inference is 3 seconds, getting that first token 50ms faster makes the difference between an agent that feels zippy and one that feels sluggish.<\/p>\n<p>Cloudflare&#8217;s network of data centers in 330 cities around the world means AI Gateway is positioned close to both users and inference endpoints, minimizing the network time before streaming begins.<\/p>\n<p>Workers AI also hosts open-source models on its public catalog, which now includes large models purpose-built for agents, including <a href=\"https:\/\/developers.cloudflare.com\/workers-ai\/models\/kimi-k2.5\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Kimi K2.5<\/a> and real-time voice models. When you call these Cloudflare-hosted models through AI Gateway, there&#8217;s no extra hop over the public Internet since your code and inference run on the same global network, giving your agents the lowest latency possible.<\/p>\n<p>      Built for reliability with automatic failover<br \/>\n      <a href=\"#built-for-reliability-with-automatic-failover\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-start-5\"><\/p>\n<p>      <\/a><\/p>\n<p>When building agents, speed is not the only factor that users care about \u2013 reliability matters too. Every step in an agent workflow depends on the steps before it. Reliable inference is crucial for agents because one call failing can affect the entire downstream chain.\u00a0<\/p>\n<p>Through AI Gateway, if you&#8217;re calling a model that&#8217;s available on multiple providers and one provider goes down, we&#8217;ll automatically route to another available provider without you having to write any failover logic of your own.\u00a0<\/p>\n<p>If you\u2019re building <a href=\"https:\/\/blog.cloudflare.com\/project-think\/\" rel=\"nofollow noopener\" target=\"_blank\">long-running agents with Agents SDK<\/a>, your streaming inference calls are also resilient to disconnects. AI Gateway buffers streaming responses as they\u2019re generated, independently of your agent&#8217;s lifetime. If your agent is interrupted mid-inference, it can reconnect to AI Gateway and retrieve the response without having to make a new inference call or paying twice for the same output tokens. Combined with the Agents SDK&#8217;s built-in checkpointing, the end user never notices.<\/p>\n<p>The Replicate team has officially <a href=\"https:\/\/blog.cloudflare.com\/replicate-joins-cloudflare\/\" rel=\"nofollow noopener\" target=\"_blank\">joined<\/a> our AI Platform team, so much so that we don\u2019t even consider ourselves separate teams anymore. We\u2019ve been hard at work on integrations between Replicate and Cloudflare, which include bringing all the Replicate models onto AI Gateway and replatforming the hosted models onto Cloudflare infrastructure. Soon, you\u2019ll be able to access the models you loved on Replicate through AI Gateway, and host the models you deployed on Replicate on Workers AI as well.<\/p>\n<p>To get started, check out our documentation for <a href=\"https:\/\/developers.cloudflare.com\/ai-gateway\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">AI Gateway<\/a> or <a href=\"https:\/\/developers.cloudflare.com\/workers-ai\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Workers AI<\/a>. Learn more about building agents on Cloudflare through <a href=\"https:\/\/developers.cloudflare.com\/agents\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Agents SDK<\/a>.\u00a0<\/p><\/p>\n","protected":false},"excerpt":{"rendered":"AI models are changing quickly: the best model to use for agentic coding today might in three months&hellip;\n","protected":false},"author":2,"featured_media":534393,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[554,733,4308,86,56,54,55],"class_list":{"0":"post-534392","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-technology","12":"tag-uk","13":"tag-united-kingdom","14":"tag-unitedkingdom"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/534392","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/comments?post=534392"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/534392\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media\/534393"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media?parent=534392"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/categories?post=534392"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/tags?post=534392"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}