{"id":106290,"date":"2025-08-30T08:14:11","date_gmt":"2025-08-30T08:14:11","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/106290\/"},"modified":"2025-08-30T08:14:11","modified_gmt":"2025-08-30T08:14:11","slug":"cloudflare-is-the-best-place-to-build-realtime-voice-agents","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/106290\/","title":{"rendered":"Cloudflare is the best place to build realtime voice agents"},"content":{"rendered":"<p>The way we interact with AI is fundamentally changing. While text-based interfaces like ChatGPT have shown us what&#8217;s possible, in terms of interaction, it\u2019s only the beginning. Humans communicate not only by texting, but also talking \u2014 we show things, we interrupt and clarify in real-time. Voice AI brings these natural interaction patterns to our applications.<\/p>\n<p>Today, we&#8217;re excited to announce new capabilities that make it easier than ever to build real-time, voice-enabled AI applications on Cloudflare&#8217;s global network. These new features create a complete platform for developers building the next generation of conversational AI experiences or can function as building blocks for more advanced AI agents running across platforms.<\/p>\n<p>We&#8217;re launching:<\/p>\n<p>Cloudflare Realtime Agents &#8211; A runtime for orchestrating voice AI pipelines at the edge<\/p>\n<p>Pipe raw WebRTC audio as PCM in Workers &#8211; You can now connect WebRTC audio directly to your AI models or existing complex media pipelines already built on\u00a0<\/p>\n<p>Workers AI WebSocket support &#8211; Realtime AI inference with models like PipeCat&#8217;s smart-turn-v2<\/p>\n<p>Deepgram on Workers AI &#8211; Speech-to-text and text-to-speech running in over 330 cities worldwide<\/p>\n<p>      Why realtime AI matters now<br \/>\n      <a href=\"#why-realtime-ai-matters-now\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-left-5\"><\/p>\n<p>      <\/a><\/p>\n<p>Today, building voice AI applications is hard. You need to coordinate multiple services such as speech-to-text, language models, text-to-speech while managing complex audio pipelines, handling interruptions, and keeping latency low enough for natural conversation.\u00a0<\/p>\n<p>Building production voice AI requires orchestrating a complex symphony of technologies. You need low latency speech recognition, intelligent language models that understand context and can handle interruptions, natural-sounding voice synthesis, and all of this needs to happen in under 800 milliseconds \u2014 the threshold where conversation feels natural rather than stilted. This latency budget is unforgiving. Every millisecond counts: 40ms for microphone input, 300ms for transcription, 400ms for LLM inference, 150ms for text-to-speech. Any additional latency from poor infrastructure choices or distant servers transforms a delightful experience into a frustrating one.<\/p>\n<p>That&#8217;s why we&#8217;re building real-time AI tools: we want to make real-time voice AI as easy to deploy as a static website. We&#8217;re also witnessing a critical inflection point where conversational AI moves from experimental demos to production-ready systems that can scale globally. If you\u2019re already a developer in the real-time AI ecosystem, we want to build the best building blocks for you to get the lowest latency by leveraging the 330+ datacenters Cloudflare has built.<\/p>\n<p>      Introducing Cloudflare Realtime Agents<br \/>\n      <a href=\"#introducing-cloudflare-realtime-agents\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-left-5\"><\/p>\n<p>      <\/a><\/p>\n<p>Cloudflare Realtime Agents is a simple runtime for orchestrating voice AI pipelines that run on our global network, as close to your users as possible. Instead of managing complex infrastructure yourself, you can focus on building great conversational experiences.<\/p>\n<p>When a user connects to your voice AI application, here&#8217;s what happens:<\/p>\n<p>WebRTC connection &#8211; Audio streams from the user&#8217;s device is sent to the nearest Cloudflare location via WebRTC, using Cloudflare RealtimeKit mobile or web SDKs<\/p>\n<p>AI pipeline orchestration &#8211; Your pre-configured pipeline runs: speech-to-text \u2192 LLM \u2192 text-to-speech, with support for interruption detection and turn-taking<\/p>\n<p>Your configured runtime options\/callbacks\/tools run<\/p>\n<p>Response delivery &#8211; Generated audio streams back to the user with minimal latency<\/p>\n<p>The magic is in how we&#8217;ve designed this as composable building blocks. You&#8217;re not locked into a rigid pipeline \u2014 you can configure data flows, add tee and join operations, and control exactly how your AI agent behaves.<\/p>\n<p>Take a look at the MyTextHandler function from the above diagram, for example. It\u2019s just a function that takes in text and returns text back, inserted after speech-to-text and before text-to-speech:<\/p>\n<p>            class MyTextHandler extends TextComponent {<br \/>\n\tenv: Env;<\/p>\n<p>\tconstructor(env: Env) {<br \/>\n\t\tsuper();<br \/>\n\t\tthis.env = env;<br \/>\n\t}<\/p>\n<p>\tasync onTranscript(text: string) {<br \/>\n\t\tconst { response } = await this.env.AI.run(&#8216;@cf\/meta\/llama-3.1-8b-instruct&#8217;, {<br \/>\n\t\t\tprompt: &#8220;You are a wikipedia bot, answer the user query:&#8221; + text,<br \/>\n\t\t});<br \/>\n\t\tthis.speak(response!);<br \/>\n\t}<br \/>\n}<\/p>\n<p>Your agent is a JavaScript class that extends RealtimeAgent, where you initialize a pipeline consisting of the various text-to-speech, speech-to-text, text-to-text and even speech-to-speech transformations.<\/p>\n<p>            export class MyAgent extends RealtimeAgent {<br \/>\n\tconstructor(ctx: DurableObjectState, env: Env) {<br \/>\n\t\tsuper(ctx, env);<br \/>\n\t}<\/p>\n<p>\tasync init(agentId: string ,meetingId: string, authToken: string, workerUrl: string, accountId: string, apiToken: string) {<br \/>\n\t\t\/\/ Construct your text processor for generating responses to text<br \/>\n\t\tconst textHandler = new MyTextHandler(this.env);<br \/>\n\t\t\/\/ Construct a Meeting object to join the RTK meeting<br \/>\n\t\tconst transport = new RealtimeKitTransport(meetingId, authToken, [<br \/>\n\t\t\t{<br \/>\n\t\t\t\tmedia_kind: &#8216;audio&#8217;,<br \/>\n\t\t\t\tstream_kind: &#8216;microphone&#8217;,<br \/>\n\t\t\t},<br \/>\n\t\t]);<br \/>\n\t\tconst { meeting } = transport;<\/p>\n<p>\t\t\/\/ Construct a pipeline to take in meeting audio, transcribe it using<br \/>\n\t\t\/\/ Deepgram, and pass our generated responses through ElevenLabs to<br \/>\n\t\t\/\/ be spoken in the meeting<br \/>\n\t\tawait this.initPipeline(<br \/>\n\t\t\t[transport, new DeepgramSTT(this.env.DEEPGRAM_API_KEY), textHandler, new ElevenLabsTTS(this.env.ELEVENLABS_API_KEY), transport],<br \/>\n\t\t\tagentId,<br \/>\n\t\t\tworkerUrl,<br \/>\n\t\t\taccountId,<br \/>\n\t\t\tapiToken,<br \/>\n\t\t);<\/p>\n<p>\t\t\/\/ The RTK meeting object is accessible to us, so we can register handlers<br \/>\n\t\t\/\/ on various events like participant joins\/leaves, chat, etc.<br \/>\n\t\t\/\/ This is optional<br \/>\n\t\tmeeting.participants.joined.on(&#8216;participantJoined&#8217;, (participant) =&gt; {<br \/>\n\t\t\ttextHandler.speak(`Participant Joined ${participant.name}`);<br \/>\n\t\t});<br \/>\n\t\tmeeting.participants.joined.on(&#8216;participantLeft&#8217;, (participant) =&gt; {<br \/>\n\t\t\ttextHandler.speak(`Participant Left ${participant.name}`);<br \/>\n\t\t});<\/p>\n<p>\t\t\/\/ Make sure to actually join the meeting after registering all handlers<br \/>\n\t\tawait meeting.rtkMeeting.join();<br \/>\n\t}<\/p>\n<p>\tasync deinit() {<br \/>\n\t\t\/\/ Add any other cleanup logic required<br \/>\n\t\tawait this.deinitPipeline();<br \/>\n\t}<br \/>\n}<\/p>\n<p>View a full example in the <a href=\"https:\/\/developers.cloudflare.com\/realtime\/agents\/getting-started\/\" rel=\"nofollow noopener\" target=\"_blank\">developer docs<\/a> and get your own Realtime Agent running. View <a href=\"https:\/\/dash.cloudflare.com\/?to=\/:account\/realtime\/agents\" rel=\"nofollow noopener\" target=\"_blank\">Realtime Agents<\/a> on your dashboard.<\/p>\n<p>What makes Realtime Agents powerful is its flexibility:<\/p>\n<p>Many AI provider options &#8211; Use the models on Workers AI, OpenAI, Anthropic, or any provider through AI Gateway<\/p>\n<p>Multiple input\/output modes &#8211; Accept audio and\/or text and respond with audio and\/or text<\/p>\n<p>Stateful coordination &#8211; Maintain context across the conversation without managing complex state yourself<\/p>\n<p>Speed and flexibility &#8211; use <a href=\"https:\/\/realtime.cloudflare.com\" rel=\"nofollow noopener\" target=\"_blank\">RealtimeKit<\/a> to manage WebRTC sessions and UI for faster development, or for full control over your stack, you can also connect directly using any standard WebRTC client or raw WebSockets<\/p>\n<p>Integrate with the <a href=\"https:\/\/developers.cloudflare.com\/agents\/\" rel=\"nofollow noopener\" target=\"_blank\">Cloudflare Agents SDK<\/a><\/p>\n<p>During the open beta starting today, Cloudflare Realtime Agents runtime is free to use and works with various AI models:<\/p>\n<p>Speech and Audio: Integration with platforms like ElevenLabs and Deepgram.<\/p>\n<p>LLM Inference: Flexible options to use large language models through Cloudflare Workers AI and AI Gateway, connect to third-party models like OpenAi, Gemini, Grok, Claude, or bring your own custom models.<\/p>\n<p>      Pipe raw WebRTC audio as PCM in Workers<br \/>\n      <a href=\"#pipe-raw-webrtc-audio-as-pcm-in-workers\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-left-5\"><\/p>\n<p>      <\/a><\/p>\n<p>For developers who need the most flexibility with their applications beyond Realtime Agents, we&#8217;re exposing the raw WebRTC audio pipeline directly to Workers.\u00a0<\/p>\n<p>WebRTC audio in Workers works by leveraging Cloudflare\u2019s Realtime SFU, which converts WebRTC audio in Opus codec to PCM and streams it to any WebSocket endpoint you specify. This means you can use Workers to implement:<\/p>\n<p>Live transcription &#8211; Stream audio from a video call directly to a transcription service<\/p>\n<p>Custom AI pipelines &#8211; Send audio to AI models without setting up complex infrastructure<\/p>\n<p>Recording and processing &#8211; Save, audit, or analyze audio streams in real-time<\/p>\n<p>      WebSockets vs WebRTC for voice AI<br \/>\n      <a href=\"#websockets-vs-webrtc-for-voice-ai\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-left-5\"><\/p>\n<p>      <\/a><\/p>\n<p>WebSockets and WebRTC can handle audio for AI services, but they work best in different situations. WebSockets are perfect for server-to-server communication and work fine when you don&#8217;t need super-fast responses, making them great for testing and experimenting. However, if you&#8217;re building an app where users need real-time conversations with low delay, WebRTC is the better choice.<\/p>\n<p>WebRTC has several advantages that make it superior for live audio streaming. It uses UDP instead of TCP, which prevents audio delays caused by lost packets holding up the entire stream (<a href=\"https:\/\/blog.cloudflare.com\/the-road-to-quic\/#head-of-line-blocking\" rel=\"nofollow noopener\" target=\"_blank\">head of line blocking<\/a> is a common topic discussed on this blog). The Opus audio codec in WebRTC automatically adjusts to network conditions and can handle packet loss gracefully. WebRTC also includes built-in features like echo cancellation and noise reduction that WebSockets would require you to build separately.\u00a0<\/p>\n<p>With this feature, you can use WebRTC for client to server communication and leveraging Cloudflare to convert to familiar WebSockets for server-to-server communication and backend processing.<\/p>\n<p>      The power of Workers + WebRTC<br \/>\n      <a href=\"#the-power-of-workers-webrtc\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-left-5\"><\/p>\n<p>      <\/a><\/p>\n<p>When WebRTC audio gets converted to WebSockets, you get PCM audio at the original sample rate, and from there, you can run any task in and out of the Cloudflare developer platform:<\/p>\n<p>Resample audio and send to different AI providers<\/p>\n<p>Run WebAssembly-based audio processing<\/p>\n<p>Build complex applications with <a href=\"https:\/\/developers.cloudflare.com\/durable-objects\/\" rel=\"nofollow noopener\" target=\"_blank\">Durable Objects<\/a>, <a href=\"https:\/\/developers.cloudflare.com\/durable-objects\/api\/alarms\/\" rel=\"nofollow noopener\" target=\"_blank\">Alarms<\/a> and other Workers primitives<\/p>\n<p>Deploy containerized processing pipelines with <a href=\"https:\/\/developers.cloudflare.com\/containers\/\" rel=\"nofollow noopener\" target=\"_blank\">Workers Containers<\/a><\/p>\n<p>The WebSocket works bidirectionally, so data sent back on the WebSocket becomes available as a WebRTC track on the Realtime SFU, ready to be consumed within WebRTC.<\/p>\n<p>To illustrate this setup, we\u2019ve made a simple <a href=\"https:\/\/github.com\/cloudflare\/realtime-examples\/tree\/main\/tts-ws\" rel=\"nofollow noopener\" target=\"_blank\">WebRTC application demo<\/a> that uses the ElevenLabs API for\u00a0 text-to-speech.<\/p>\n<p>Visit the <a href=\"https:\/\/developers.cloudflare.com\/realtime\/sfu\/\" rel=\"nofollow noopener\" target=\"_blank\">Realtime SFU developer docs<\/a> on how to get started.<\/p>\n<p>      Realtime AI inference with WebSockets<br \/>\n      <a href=\"#realtime-ai-inference-with-websockets\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-left-5\"><\/p>\n<p>      <\/a><\/p>\n<p>WebSockets provide the backbone of real-time AI pipelines because it is a low-latency, bidirectional primitive with ubiquitous support in developer tooling, especially for server to server communication. Although HTTP works great for many use cases like chat or batch inference, real-time voice AI needs persistent, low-latency connections when talking to AI inference servers. To support your real-time AI workloads, Workers AI now supports WebSocket connections in select models.<\/p>\n<p>      Launching with PipeCat SmartTurn V2<br \/>\n      <a href=\"#launching-with-pipecat-smartturn-v2\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-left-5\"><\/p>\n<p>      <\/a><\/p>\n<p>The first model with WebSocket support is PipeCat&#8217;s <a href=\"https:\/\/developers.cloudflare.com\/workers-ai\/models\/smart-turn-v2\/\" rel=\"nofollow noopener\" target=\"_blank\">smart-turn-v2<\/a> turn detection model \u2014 a critical component for natural conversation. Turn detection models determine when a speaker has finished talking and it&#8217;s appropriate for the AI to respond. Getting this right is the difference between an AI that constantly interrupts and one that feels natural to talk to.<\/p>\n<p>Below is an example on how to call smart-turn-v2 running on Workers AI.<\/p>\n<p>            &#8220;&#8221;&#8221;<br \/>\nCloudflare AI WebSocket Inference &#8211; With PipeCat&#8217;s smart-turn-v2<br \/>\n&#8220;&#8221;&#8221;<\/p>\n<p>import asyncio<br \/>\nimport websockets<br \/>\nimport json<br \/>\nimport numpy as np<\/p>\n<p># Configuration<br \/>\nACCOUNT_ID = &#8220;your-account-id&#8221;<br \/>\nAPI_TOKEN = &#8220;your-api-token&#8221;<br \/>\nMODEL = &#8220;@cf\/pipecat-ai\/smart-turn-v2&#8243;<\/p>\n<p># WebSocket endpoint<br \/>\nWEBSOCKET_URL = f&#8221;wss:\/\/api.cloudflare.com\/client\/v4\/accounts\/{ACCOUNT_ID}\/ai\/run\/{MODEL}?dtype=uint8&#8221;<\/p>\n<p>async def run_inference(audio_data: bytes) -&gt; dict:<br \/>\n    async with websockets.connect(<br \/>\n        WEBSOCKET_URL,<br \/>\n        additional_headers={<br \/>\n            &#8220;Authorization&#8221;: f&#8221;Bearer {API_TOKEN}&#8221;<br \/>\n        }<br \/>\n    ) as websocket:<br \/>\n        await websocket.send(audio_data)<\/p>\n<p>        response = await websocket.recv()<br \/>\n        result = json.loads(response)<\/p>\n<p>        # Response format: {&#8216;is_complete&#8217;: True, &#8216;probability&#8217;: 0.87}<br \/>\n        return result<\/p>\n<p>def generate_test_audio():<br \/>\n    noise = np.random.normal(128, 20, 8192).astype(np.uint8)<br \/>\n    noise = np.clip(noise, 0, 255) <\/p>\n<p>    return noise<\/p>\n<p>async def demonstrate_inference():<br \/>\n    # Generate test audio<br \/>\n    noise = generate_test_audio()<\/p>\n<p>    try:<br \/>\n        print(&#8220;\\nTesting noise&#8230;&#8221;)<br \/>\n        noise_result = await run_inference(noise.tobytes())<br \/>\n        print(f&#8221;Noise result: {noise_result}&#8221;)<\/p>\n<p>    except Exception as e:<br \/>\n        print(f&#8221;Error: {e}&#8221;)<\/p>\n<p>if __name__ == &#8220;__main__&#8221;:<br \/>\n    asyncio.run(demonstrate_inference())<\/p>\n<p>On Wednesday, we announced that Deepgram&#8217;s speech-to-text and text-to-speech models are available on Workers AI, running in Cloudflare locations worldwide. This means:<\/p>\n<p>Lower latency &#8211; Speech recognition happens at the edge, close to users running in the same network as Workers<\/p>\n<p>WebRTC audio processing without leaving the Cloudflare network<\/p>\n<p>State-of-the-art audio ML models powerful, capable, and fast audio models, available directly through Workers AI<\/p>\n<p>Global scale &#8211; leverages Cloudflare\u2019s global network in 330+ cities automatically<\/p>\n<p>Deepgram is a popular choice for voice AI applications. By building your voice AI systems on the Cloudflare platform, you get access to powerful models and the lowest latency infrastructure to give your application a natural, responsive experience.<\/p>\n<p>      Interested in other realtime AI models running on Cloudflare?<br \/>\n      <a href=\"#interested-in-other-realtime-ai-models-running-on-cloudflare\" aria-hidden=\"true\" class=\"relative sm:absolute sm:-left-5\"><\/p>\n<p>      <\/a><\/p>\n<p>If you&#8217;re developing AI models for real-time applications, we want to run them on Cloudflare&#8217;s network. Whether you have proprietary models or need ultra-low latency inference at scale with open source models reach out to us.<\/p>\n<p>All of these features are available now:<\/p>\n<p>Want to pick the brains of the engineers who built this? Join them for technical deep dives, live demos Q&amp;A at Cloudflare Connect in Las Vegas. Explore the <a href=\"https:\/\/events.cloudflare.com\/connect\/2025\/\" rel=\"nofollow noopener\" target=\"_blank\">full schedule and register<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"The way we interact with AI is fundamentally changing. While text-based interfaces like ChatGPT have shown us what&#8217;s&hellip;\n","protected":false},"author":2,"featured_media":106291,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-106290","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/106290","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=106290"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/106290\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/106291"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=106290"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=106290"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=106290"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}