{"id":387868,"date":"2026-04-08T09:44:08","date_gmt":"2026-04-08T09:44:08","guid":{"rendered":"https:\/\/www.newsbeep.com\/ie\/387868\/"},"modified":"2026-04-08T09:44:08","modified_gmt":"2026-04-08T09:44:08","slug":"stateful-continuation-for-ai-agents-why-transport-layers-now-matter","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ie\/387868\/","title":{"rendered":"Stateful Continuation for AI Agents: Why Transport Layers Now Matter"},"content":{"rendered":"<p>\t\t\t\t\t\t\t\t\tKey Takeaways<br \/>\n\t\t\t\t\t\t\t\t\t&#13;<br \/>\n\tAgent workflows make transport a first-order concern. Multi-turn, tool-heavy loops amplify overhead that is negligible in single-turn LLM use.&#13;<br \/>\n\tStateless APIs scale poorly with context. Re-sending the full history each turn drives linear payload growth and increases latency.&#13;<br \/>\n\tStateful continuation cuts overhead dramatically. Caching context server-side can reduce client-sent data by 80%+ and improve execution time by 15\u201329% .&#13;<br \/>\n\tThe benefit is architectural, not protocol-specific. Any approach that avoids retransmitting context can achieve similar gains.&#13;<br \/>\n\tPerformance comes with trade-offs. Stateful designs introduce challenges in reliability, observability, and portability that must be weighed carefully.&#13;<\/p>\n<p>\t\t\t\t\t\t\t\tThe Airplane Problem<\/p>\n<p>On a recent flight, I purchased the in-flight internet and tried to use Claude Code. The agent needed to read several files, understand the codebase structure, make edits, and run tests; a typical agentic workflow involving 10-15 tool calls. But the internet was so bad that by the third or fourth turn, the requests were timing out. Each turn was resending the entire conversation history \u2014 the original prompt, every file it had read, every edit it had proposed, every test output \u2014 and the payload had ballooned to hundreds of kilobytes. Over a bandwidth-constrained link, that growing payload was a bottleneck.<\/p>\n<p>This experience highlighted something that&#8217;s becoming increasingly relevant as AI coding agents mature: the transport layer matters more for agentic workflows than for simple chat. A single-turn chat completion sends a prompt and gets a response. An agentic coding session involves 10, 20, or sometimes 50+ sequential turns in which the model reads code, proposes changes, runs tests, reads error output, fixes issues, and iterates. With each turn, the conversation context grows, and over HTTP, that entire growing context must be retransmitted every time.\u00a0<\/p>\n<p>In February 2026, OpenAI introduced <a href=\"https:\/\/developers.openai.com\/api\/docs\/guides\/websocket-mode\" rel=\"nofollow noopener\" target=\"_blank\">WebSocket mode<\/a> for their responses API, which caches the conversation history in the server memory to solve this problem; I was excited to try it out and see how it performs compared to HTTP.<\/p>\n<p>The Agentic Coding Loop<\/p>\n<p>AI coding agents have moved from novelty to daily workflow for many organizations, especially since December 2025. Tools like Claude Code, OpenAI Codex, Cursor, and Cline now routinely perform multi-file edits, run test suites, and iterate on failing builds. <a href=\"https:\/\/fortune.com\/2026\/03\/04\/openai-codex-growth-enterprise-ai-agents\/\" rel=\"nofollow noopener\" target=\"_blank\">OpenAI reports over 1.6 million weekly active users<\/a> on Codex alone, with a typical engineer on the Codex team running 4-8 parallel agents.<\/p>\n<p>The core of these agents is the &#8220;agent loop&#8221;: a cycle of model inference and tool execution that repeats until the task is complete:<\/p>\n<p><img decoding=\"async\" alt=\"\" class=\"zoom-image\" src=\"https:\/\/www.infoq.com\/articles\/ai-agent-transport-layer\/articles\/ai-agent-transport-layer\/en\/resources\/199figure-1-1775031602139.jpg\" style=\"width: 692px; height: 734px;\" rel=\"share\"\/><\/p>\n<p style=\"text-align:center\">The coding agent loop: At every turn, the model either returns a response indicating task completion, or recommends tool calls, whose response is fed back to the model inference until the task is complete<\/p>\n<p>A single turn of the agent loop typically involves reading several files to understand the codebase, editing some files, and running tests, which involves 10-15 tool calls, often more for complex refactoring. The results of those tool calls are then sent to the LLM inference server. If the problem is solved, the LLM server returns a response with no more tool calls. Otherwise, the LLM server recommends additional tool calls, which starts the next turn of the agent loop, and this process continues until the problem is solved. Each turn requires the model to receive the full context of what&#8217;s happened so far.<\/p>\n<p>The HTTP Overhead Problem<\/p>\n<p>With HTTP-based APIs, including OpenAI&#8217;s Responses API over HTTP and the older Chat Completions API, each turn is a stateless request. The server doesn&#8217;t remember what happened on the previous turn, so the client must resend everything:<\/p>\n<p>&#13;<br \/>\n\tSystem instructions and tool definitions (~2 KB)&#13;<br \/>\n\tThe original user prompt&#13;<br \/>\n\tEvery prior model output (including full code blocks that the model wrote)&#13;<br \/>\n\tEvery tool call result (including file contents, command outputs)&#13;<\/p>\n<p>This means the request payload grows linearly with each turn. In our benchmarks, we measured the actual per-turn bytes sent by the client over HTTP versus WebSocket:<\/p>\n<p><img decoding=\"async\" alt=\"\" class=\"zoom-image\" src=\"https:\/\/www.infoq.com\/articles\/ai-agent-transport-layer\/articles\/ai-agent-transport-layer\/en\/resources\/148figure-2-1775031602139.jpg\" style=\"width: 1251px; height: 661px;\" rel=\"share\"\/><\/p>\n<p style=\"text-align:center\">Average bytes sent per turn across 10 task runs with gpt-4o-mini. HTTP grows linearly; WebSocket stays constant.<\/p>\n<p>By turn 9, HTTP is sending nearly 10x as much data per request as WebSocket. This is because OpenAI&#8217;s WebSocket mode for the Responses API keeps a persistent connection with server-side in-memory state. After the first turn, each subsequent turn sends only:<\/p>\n<p>&#13;<br \/>\n\tA previous_response_id referencing the cached state (~60 bytes)&#13;<br \/>\n\tThe new tool call outputs (typically 1-3 KB of file content or command output)&#13;<\/p>\n<p>The payload stays roughly constant regardless of how many turns deep you are.<\/p>\n<p>What Existing Benchmarks Show<\/p>\n<p>Before building our own test harness, we reviewed publicly available data.<\/p>\n<p>OpenAI&#8217;s claim: WebSocket mode for the Responses API is built for low-latency, long-running agents with heavy tool calls. For workflows with 20+ tool calls, it delivers up to 40% faster end-to-end execution by eliminating redundant context re-transmission and leveraging server-side in-memory state persistence across turns.<\/p>\n<p>Cline&#8217;s independent validation: The Cline team tested WebSocket mode with GPT-5.2-codex against their standard HTTP API integration and <a href=\"https:\/\/x.com\/cline\/status\/2026031848791630033\" rel=\"nofollow\">reported<\/a>:<\/p>\n<p>&#13;<br \/>\n\t~15% faster on simple tasks (few tool calls)&#13;<br \/>\n\t~39% faster on complex multi-file workflows (many tool calls)&#13;<br \/>\n\tBest cases hitting 50% faster&#13;<br \/>\n\tWebSocket handshake adds slight TTFT overhead on the first turn, but it amortizes fast&#13;<\/p>\n<p>The pattern: The speedup scales with workflow complexity. Simple tasks with 1-2 tool calls see minimal benefit (or even slight overhead from the WebSocket handshake). Complex tasks with 10+ tool calls see dramatic improvements because the cumulative savings from not retransmitting context compound with each turn.<\/p>\n<p>Our Benchmark: Validating the Claims<\/p>\n<p>To validate these claims with controlled measurements, we built a benchmark harness that simulates realistic agentic coding workflows against OpenAI&#8217;s Responses API. <a href=\"http:\/\/github.com\/anirudhmendiratta\/agentic-coding-websocket\" rel=\"nofollow noopener\" target=\"_blank\">The harness<\/a> is open source.<\/p>\n<p>Methodology<\/p>\n<p>We defined three coding tasks of varying complexity:<\/p>\n<p>&#13;<br \/>\n\tFix a failing test \u2014 Read the test file, read the component, fix the bug, run tests (~10-15 turns, 12-17 tool calls)&#13;<br \/>\n\tAdd a search feature \u2014 Read existing components, implement the feature, run tests (~5-15 turns, 4-21 tool calls)&#13;<br \/>\n\tRefactor the API layer \u2014 List the project, read files, search for callers, update multiple files, run tests (~6-11 turns, 10-20 tool calls)&#13;<\/p>\n<p>Each task uses simulated tool responses (realistic file contents, test outputs, command outputs) to isolate transport-layer differences. The model makes real API calls to OpenAI and decides which tools to call and when to stop \u2014 the non-determinism is in the model&#8217;s behavior, not the tool responses.<\/p>\n<p>Two test configurations:<\/p>\n<p>&#13;<br \/>\n\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tCell&#13;<br \/>\n\t\t\tApproach&#13;<br \/>\n\t\t\tPer-turn behavior&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\t1&#13;<br \/>\n\t\t\tHTTP Responses API&#13;<br \/>\n\t\t\tFull conversation context is re-sent every turn&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\t2&#13;<br \/>\n\t\t\tWebSocket Responses API&#13;<br \/>\n\t\t\tprevious_response_id + incremental input only&#13;<br \/>\n\t\t&#13;<br \/>\n\t&#13;<\/p>\n<p>We measured:<\/p>\n<p>&#13;<br \/>\n\tTTFT (Time to First Token): How quickly does the model start generating on each turn?&#13;<br \/>\n\tBytes sent: How much data does the client upload per task?&#13;<br \/>\n\tBytes received: How much streaming event data comes back?&#13;<br \/>\n\tTotal time: End-to-end wall-clock time for the full agentic workflow&#13;<\/p>\n<p>Each configuration was run 3 times and aggregated. We tested with two models \u2014 GPT-5.4 (a frontier coding model) and GPT-4o-mini (a smaller, faster model) \u2014 to see whether the transport-layer effects hold across model sizes.<\/p>\n<p>Results<\/p>\n<p>Across all runs, tasks averaged roughly 8-11 turns and 9-16 tool calls per task, varying by model and transport mode.<\/p>\n<p>Relative performance (WebSocket vs HTTP):<\/p>\n<p>&#13;<br \/>\n\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tMetric&#13;<br \/>\n\t\t\tGPT-5.4&#13;<br \/>\n\t\t\tGPT-4o-mini&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tTotal time&#13;<br \/>\n\t\t\t29% faster&#13;<br \/>\n\t\t\t15% faster&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tBytes sent&#13;<br \/>\n\t\t\t82% less&#13;<br \/>\n\t\t\t86% less&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tFirst-turn TTFT&#13;<br \/>\n\t\t\t14% lower&#13;<br \/>\n\t\t\t~same&#13;<br \/>\n\t\t&#13;<br \/>\n\t&#13;<\/p>\n<p>Detailed results for GPT-5.4:<\/p>\n<p>&#13;<br \/>\n\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tMetric&#13;<br \/>\n\t\t\tGPT-5.4 HTTP&#13;<br \/>\n\t\t\tGPT-5.4 WebSocket&#13;<br \/>\n\t\t\tDelta&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tAvg total time\/task&#13;<br \/>\n\t\t\t40.8 s&#13;<br \/>\n\t\t\t28.9 s&#13;<br \/>\n\t\t\t\u221229%&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tAvg TTFT (all turns)&#13;<br \/>\n\t\t\t1,253 ms&#13;<br \/>\n\t\t\t1,111 ms&#13;<br \/>\n\t\t\t\u221211%&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tAvg TTFT (first turn)&#13;<br \/>\n\t\t\t1,255 ms&#13;<br \/>\n\t\t\t1,075 ms&#13;<br \/>\n\t\t\t\u221214%&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tAvg bytes sent\/task&#13;<br \/>\n\t\t\t176 KB&#13;<br \/>\n\t\t\t32 KB&#13;<br \/>\n\t\t\t\u221282%&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tAvg bytes recv\/task&#13;<br \/>\n\t\t\t485 KB&#13;<br \/>\n\t\t\t343 KB&#13;<br \/>\n\t\t\t\u221229%&#13;<br \/>\n\t\t&#13;<br \/>\n\t&#13;<\/p>\n<p>Key Findings<\/p>\n<p>&#13;<br \/>\n\tWebSocket consistently reduces client-sent data by 80-86%. This is the most reliable finding, independent of model, API variance, or task complexity. HTTP sends 153-176 KB per task; WebSocket sends 21-32 KB. This is a direct consequence of not retransmitting the growing conversation history.&#13;<br \/>\n\tWebSocket delivers 15-29% faster end-to-end execution. With GPT-5.4, WebSocket was 29% faster \u2014 roughly consistent with Cline&#8217;s reported 39% on complex workflows. The speedup comes from a combination of less data to upload per turn and potentially faster server-side processing (no need to re-parse and tokenize the full context).&#13;<br \/>\n\tFirst-turn TTFT is similar across approaches. The WebSocket handshake doesn&#8217;t add meaningful overhead \u2014 first-turn TTFT was within noise of HTTP for both models. The advantage emerges in continuation turns, where WebSocket avoids the growing payload upload.&#13;<br \/>\n\tThe effect is model-independent. We ran the same benchmarks with GPT-4o-mini (detailed results in the <a href=\"https:\/\/github.com\/anirudhmendiratta\/agentic-coding-websocket\" rel=\"nofollow noopener\" target=\"_blank\">repo<\/a>) and saw consistent bytes-sent savings (86%) and 15% faster end-to-end execution. The time savings were larger for GPT-5.4 (29% vs 15%), likely because the frontier model generates longer responses that accumulate more context per turn.&#13;<\/p>\n<p>Why It&#8217;s Faster: The Architecture<\/p>\n<p>The performance difference is a direct consequence of eliminating redundant data transmission.<\/p>\n<p>HTTP: Stateless by Design<\/p>\n<p>&#13;<br \/>\nTurn 1: Client \u2192 [system + prompt + tools] \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0\u2192 Server&#13;<br \/>\nTurn 2: Client \u2192 [system + prompt + tools + turn1 + output1] \u00a0\u2192 Server&#13;<br \/>\nTurn 3: Client \u2192 [all of the above + turn2 + output2] \u00a0 \u00a0 \u00a0 \u00a0 \u2192 Server&#13;<br \/>\n&#8230;&#13;<br \/>\nTurn N: Client \u2192 [system + prompt + tools + ALL prior turns] \u00a0 \u2192 Server<\/p>\n<p>Each request is independent. The server processes it, returns a response, and forgets everything. The client must reconstruct the full context from scratch.<\/p>\n<p>WebSocket: Stateful Continuation<\/p>\n<p>&#13;<br \/>\nTurn 1: Client \u2192 [system + prompt + tools] \u00a0 \u00a0 \u00a0\u2192 Server \u00a0(server caches response)&#13;<br \/>\nTurn 2: Client \u2192 [prev_id + tool_output] \u00a0 \u00a0 \u00a0 \u00a0 \u2192 Server \u00a0(server loads from cache)&#13;<br \/>\nTurn 3: Client \u2192 [prev_id + tool_output] \u00a0 \u00a0 \u00a0 \u00a0 \u2192 Server \u00a0(server loads from cache)&#13;<br \/>\n&#8230;&#13;<br \/>\nTurn N: Client \u2192 [prev_id + tool_output] \u00a0 \u00a0 \u00a0 \u00a0 \u2192 Server \u00a0(constant-size payload)<\/p>\n<p>The server keeps the most recent response in connection-local memory. Continuations reference that cached state, so the client only sends what&#8217;s new.<\/p>\n<p>The Bandwidth Math: From Our Benchmarks<\/p>\n<p>Using our actual GPT-5.4 data for a typical 10-turn coding task:<\/p>\n<p>HTTP total bytes sent (client \u2192 server): 176 KB per task (measured average)<\/p>\n<p>&#13;<br \/>\n\tGrows from 2 KB on turn 0 to 38 KB on turn 9 as context accumulates&#13;<\/p>\n<p>WebSocket total bytes sent: 32 KB per task (measured average)<\/p>\n<p>&#13;<br \/>\n\tStays flat at 2-4 KB per turn throughout&#13;<\/p>\n<p>That&#8217;s an 82% reduction in client-sent bytes \u2014 144 KB saved per task, compounding across thousands of concurrent sessions.<\/p>\n<p>Architectural Lessons<\/p>\n<p>1. API Compatibility vs Performance: The Protocol Tax<\/p>\n<p>The OpenAI-compatible HTTP API (both the \/chat\/completions and Responses API) is the de facto standard. Every LLM tool, SDK, and orchestration framework speaks it. But this compatibility comes at a cost: the API is inherently stateless, requiring full context to be retransmitted on every request.<\/p>\n<p>WebSocket mode breaks this compatibility, causing fragmentation.<\/p>\n<p>Who supports WebSocket today?<\/p>\n<p>&#13;<br \/>\n\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tProvider \/ Gateway&#13;<br \/>\n\t\t\tWebSocket API&#13;<br \/>\n\t\t\tStreaming method&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tOpenAI Responses API&#13;<br \/>\n\t\t\t\u2705 (since Feb 2026)&#13;<br \/>\n\t\t\tWebSocket frames (JSON)&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tGoogle Gemini API&#13;<br \/>\n\t\t\t&#13;<\/p>\n<p>\u26d4 (text\/coding)<br \/>&#13;<br \/>\n\t\t\t\u2705 (audio\/video)<\/p>\n<p>&#13;<br \/>\n\t\t\t&#13;<br \/>\n\t\t\tWebSocket frames&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tAnthropic Claude API&#13;<br \/>\n\t\t\t\u26d4&#13;<br \/>\n\t\t\tServer-Sent Events (SSE)&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tOpenRouter&#13;<br \/>\n\t\t\t\u26d4&#13;<br \/>\n\t\t\tSSE (OpenAI-compatible)&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tCloudflare AI Gateway&#13;<br \/>\n\t\t\t\u2705 (gateway layer)&#13;<br \/>\n\t\t\tWebSocket frames&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tLocal models (Ollama, vLLM)&#13;<br \/>\n\t\t\t\u26d4&#13;<br \/>\n\t\t\tSSE&#13;<br \/>\n\t\t&#13;<br \/>\n\t&#13;<\/p>\n<p>Who supports WebSocket among coding agents?<\/p>\n<p>&#13;<br \/>\n\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tCoding Agent&#13;<br \/>\n\t\t\tWebSocket support&#13;<br \/>\n\t\t\tNotes&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tOpenAI Codex&#13;<br \/>\n\t\t\t\u2705 (native)&#13;<br \/>\n\t\t\tBuilt on the Responses API&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tCline&#13;<br \/>\n\t\t\t\u2705 (OpenAI only)&#13;<br \/>\n\t\t\tFirst to integrate, reported 39% speedup&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tClaude Code&#13;<br \/>\n\t\t\t\u26d4&#13;<br \/>\n\t\t\tUses Anthropic SSE API&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tCursor&#13;<br \/>\n\t\t\t\u26d4&#13;<br \/>\n\t\t\tHTTP-based multi-provider&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tWindsurf&#13;<br \/>\n\t\t\t\u26d4&#13;<br \/>\n\t\t\tHTTP-based multi-provider&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tRoo Code&#13;<br \/>\n\t\t\t\u26d4&#13;<br \/>\n\t\t\tCline fork, may inherit support&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tOpenCode&#13;<br \/>\n\t\t\t\u26d4&#13;<br \/>\n\t\t\tMulti-provider, HTTP-based&#13;<br \/>\n\t\t&#13;<br \/>\n\t&#13;<\/p>\n<p>WebSocket is currently an OpenAI-only advantage. If your agent needs to switch between providers, \u00a0say, using Claude for reasoning-heavy tasks and GPT for speed, you would lose the WebSocket performance benefit on every non-OpenAI call.\u00a0<\/p>\n<p>Google&#8217;s Gemini Live API uses WebSocket, but it&#8217;s designed for real-time audio\/video streaming rather than text-based agentic workflows. Cloudflare&#8217;s AI Gateway offers a WebSocket endpoint that sits in front of multiple providers, but it proxies to HTTP under the hood and doesn&#8217;t provide the server-side state caching that makes OpenAI&#8217;s implementation fast.<\/p>\n<p>2. Protocol Overhead at Scale: When Bytes Per Turn Matter<\/p>\n<p>For a single conversation, the overhead of resending context is negligible. But from the server&#8217;s perspective, the scale of agentic coding in 2026 makes this significant.<\/p>\n<p>Estimating concurrent sessions for a single major provider: OpenAI Codex has 1.6 million weekly active users. GitHub Copilot has 4.7 million paid subscribers. Claude Code is generating $2.5 billion in annualized revenue, suggesting over 1 million active developers. Cline, Cursor, Windsurf, Roo Code, and OpenCode add millions more. Conservatively, 5-10 million developers are actively using AI coding agents weekly. For a single major provider like OpenAI, assuming 10-20% of its users are active during a peak hour with overlapping sessions, we estimate roughly 1 million concurrent agentic coding sessions at peak.<\/p>\n<p>At that scale, using our measured per-task data:<\/p>\n<p>HTTP: 1,000,000 sessions \u00d7 176 KB sent per task = 176 GB of client-to-server payload per 40 second task<\/p>\n<p>WebSocket: 1,000,000 sessions \u00d7 32 KB sent per task = 32 GB of client-to-server payload per 40 second task<\/p>\n<p>That&#8217;s a 144 GB reduction in ingress traffic over a 40-second task, i.e., a 29 Gbps reduction. For a provider processing millions of requests, this reduces load on API gateways, tokenizers (which must re-tokenize the full context on each HTTP request), and network infrastructure. The server-side savings are arguably more important than the client-side savings: less data to receive, parse, and tokenize means faster time-to-first-token for everyone.<\/p>\n<p>3. Server-Side State: The Real Innovation<\/p>\n<p>The key insight is that WebSocket isn&#8217;t faster because of the protocol \u2014 TCP-based WebSocket has similar framing overhead to HTTP\/2. The speed comes from server-side state management: the WebSocket server stores the most recent response in connection-local volatile memory, enabling near-instant continuation without re-tokenizing the full conversation.<\/p>\n<p>This has architectural implications:<\/p>\n<p>&#13;<br \/>\n\tState is ephemeral: It lives only in memory on the specific server handling your connection. If the connection drops, the state is lost (unless store=true).&#13;<br \/>\n\tNo multiplexing: Each WebSocket connection handles one response at a time. For parallel agent invocations, you need multiple connections.&#13;<br \/>\n\t60-minute limit: Connections auto-terminate after 1 hour, requiring reconnection logic for sessions longer than 1 hour.&#13;<\/p>\n<p>For architects designing similar systems, the pattern is clear: if your protocol involves many sequential requests that build on prior context, keeping that context server-side (even if only in volatile memory) can dramatically reduce per-request overhead.<\/p>\n<p>4. The Statefulness Spectrum<\/p>\n<p>Different approaches to the context accumulation problem offer different trade-offs:<\/p>\n<p>&#13;<br \/>\n\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tApproach&#13;<br \/>\n\t\t\tState Location&#13;<br \/>\n\t\t\tDurability&#13;<br \/>\n\t\t\tLatency&#13;<br \/>\n\t\t\tBandwidth&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tHTTP (stateless)&#13;<br \/>\n\t\t\tClient only&#13;<br \/>\n\t\t\tN\/A&#13;<br \/>\n\t\t\tHigh (grows with context)&#13;<br \/>\n\t\t\tHigh (grows with context)&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tHTTP + store=true&#13;<br \/>\n\t\t\tServer (persisted)&#13;<br \/>\n\t\t\tDurable&#13;<br \/>\n\t\t\tMedium (server rehydrates from persistent store)&#13;<br \/>\n\t\t\tLow (incremental input)&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tWebSocket + store=false&#13;<br \/>\n\t\t\tServer (in-memory)&#13;<br \/>\n\t\t\tVolatile&#13;<br \/>\n\t\t\tLow (no rehydration)&#13;<br \/>\n\t\t\tLow (incremental input)&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tWebSocket + store=true&#13;<br \/>\n\t\t\tServer (in-memory + persisted)&#13;<br \/>\n\t\t\tDurable&#13;<br \/>\n\t\t\tLow (no rehydration in happy case)&#13;<br \/>\n\t\t\tLow (incremental input)&#13;<br \/>\n\t\t&#13;<br \/>\n\t&#13;<\/p>\n<p>The sweet spot for most agentic workflows is WebSocket + store=false: you get the fastest continuations, your data isn&#8217;t persisted on the provider&#8217;s servers (important for enterprise compliance with Zero Data Retention policies), and if the connection drops, you restart the task from scratch rather than trying to recover mid-stream.<\/p>\n<p>5. Parallel Execution: Multiple Connections, Not Multiplexing<\/p>\n<p>Each WebSocket connection handles one response at a time \u2014 there&#8217;s no multiplexing. For parallel tasks (e.g., running 4-8 agents simultaneously, as a typical Codex engineer does), you need separate WebSocket connections. The bandwidth savings from WebSocket still apply per-connection, but concurrent connections may hit API rate limits more aggressively than concurrent HTTP requests due to faster execution times.<\/p>\n<p>When HTTP Is Still the Right Choice<\/p>\n<p>WebSocket mode isn&#8217;t universally better. Use HTTP for:<\/p>\n<p>&#13;<br \/>\n\tSimple, few-turn interactions: For 1-2 turn interactions, the context retransmission overhead is negligible and doesn&#8217;t justify the added complexity.&#13;<br \/>\n\tMulti-provider support: If you need to switch among OpenAI, Anthropic, Google, and local models, the standard HTTP API is the common denominator. WebSocket mode is currently OpenAI-specific. Adopting it creates provider lock-in.&#13;<br \/>\n\tStateless infrastructure: If your backend runs on serverless functions (Lambda, Cloud Functions) that can&#8217;t maintain persistent connections, HTTP is your only option.&#13;<br \/>\n\tDebugging and observability: HTTP requests are easier to log, replay, and debug with standard tools. WebSocket streams require specialized tooling.&#13;<\/p>\n<p>Conclusion<\/p>\n<p>For agentic coding workflows, the move from stateless HTTP to stateful WebSocket connections delivers meaningful performance improvements: 29% faster end-to-end execution, 82% less client-side data sent, and 11% lower TTFT with GPT-5.4 as validated by our controlled benchmarks against the OpenAI Responses API.<\/p>\n<p>But the WebSocket advantage comes with a trade-off: it&#8217;s currently OpenAI-specific, creating provider lock-in in an ecosystem where developers increasingly want to switch between models. None of the major alternatives \u2014 Anthropic&#8217;s Claude API, Google Gemini, OpenRouter, or local model servers \u2014 offer equivalent WebSocket support for text-based agentic workflows.<\/p>\n<p>The takeaway for architects building agentic systems isn&#8217;t to blindly adopt WebSocket. It&#8217;s to recognize that as AI workflows shift from single-turn to multi-turn, the transport-layer decisions that were irrelevant for chatbots become material for agents. Any system that avoids retransmitting growing conversation context \u2014 whether through WebSocket, server-side session caching, or a custom stateful protocol \u2014 will see similar wins. The question is whether the industry converges on a standard for stateful LLM continuation, or whether this remains a provider-specific competitive advantage.<\/p>\n<p>The benchmarking harness and all results are <a href=\"http:\/\/github.com\/anirudhmendiratta\/agentic-coding-websocket\" rel=\"nofollow noopener\" target=\"_blank\">available here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"Key Takeaways &#13; Agent workflows make transport a first-order concern. Multi-turn, tool-heavy loops amplify overhead that is negligible&hellip;\n","protected":false},"author":2,"featured_media":387869,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[220,172323,20090,1287,20985,1122,11764,218,14744,8853,128809,61,60,9416,1682,20987,80,172324,172325],"class_list":{"0":"post-387868","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-technology","8":"tag-ai","9":"tag-ai-agent-transport-layer","10":"tag-ai-coding","11":"tag-ai-development","12":"tag-ai-assisted-coding","13":"tag-architecture","14":"tag-architecture-design","15":"tag-artificial-intelligence","16":"tag-code-generation","17":"tag-development","18":"tag-http","19":"tag-ie","20":"tag-ireland","21":"tag-ml-data-engineering","22":"tag-openai","23":"tag-software-development","24":"tag-technology","25":"tag-w3c","26":"tag-websocket"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/387868","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/comments?post=387868"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/387868\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media\/387869"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media?parent=387868"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/categories?post=387868"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/tags?post=387868"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}