{"id":161141,"date":"2025-11-26T18:55:08","date_gmt":"2025-11-26T18:55:08","guid":{"rendered":"https:\/\/www.newsbeep.com\/ie\/161141\/"},"modified":"2025-11-26T18:55:08","modified_gmt":"2025-11-26T18:55:08","slug":"effective-harnesses-for-long-running-agents-anthropic","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ie\/161141\/","title":{"rendered":"Effective harnesses for long-running agents \\ Anthropic"},"content":{"rendered":"<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">As AI agents become more capable, developers are increasingly asking them to take on complex tasks requiring work that spans hours, or even days. However, getting agents to make consistent progress across multiple context windows remains an open problem.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before. Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift. Because context windows are limited, and because most complex projects cannot be completed within a single window, agents need a way to bridge the gap between coding sessions.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">We developed a two-fold solution to enable the <a href=\"https:\/\/platform.claude.com\/docs\/en\/agent-sdk\/overview\" rel=\"nofollow noopener\" target=\"_blank\">Claude Agent SDK<\/a> to work effectively across many context windows: an initializer agent that sets up the environment on the first run, and a coding agent that is tasked with making incremental progress in every session, while leaving clear artifacts for the next session. You can find code examples in the accompanying <a href=\"https:\/\/github.com\/anthropics\/claude-quickstarts\/tree\/main\/autonomous-coding\" rel=\"nofollow noopener\" target=\"_blank\">quickstart.<\/a><\/p>\n<p>The long-running agent problem<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">The Claude Agent SDK is a powerful, general-purpose agent harness adept at coding, as well as other tasks that require the model to use tools to gather context, plan, and execute. It has context management capabilities such as compaction, which enables an agent to work on a task without exhausting the context window. Theoretically, given this setup, it should be possible for an agent to continue to do useful work for an arbitrarily long time.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">However, compaction isn\u2019t sufficient. Out of the box, even a frontier coding model like Opus 4.5 running on the Claude Agent SDK in a loop across multiple context windows will fall short of building a production-quality web app if it\u2019s only given a high-level prompt, such as \u201cbuild a clone of <a href=\"http:\/\/claude.ai\/redirect\/website.v1.f83cb452-0a86-4398-842a-c90031ba5548\" rel=\"nofollow noopener\" target=\"_blank\">claude.ai<\/a>.\u201d<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">Claude\u2019s failures manifested in two patterns. First, the agent tended to try to do too much at once\u2014essentially to attempt to one-shot the app. Often, this led to the model running out of context in the middle of its implementation, leaving the next session to start with a feature half-implemented and undocumented. The agent would then have to guess at what had happened, and spend substantial time trying to get the basic app working again. This happens even with compaction, which doesn\u2019t always pass perfectly clear instructions to the next agent.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">A second failure mode would often occur later in a project. After some features had already been built, a later agent instance would look around, see that progress had been made, and declare the job done.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">This decomposes the problem into two parts. First, we need to set up an initial environment that lays the foundation for all the features that a given prompt requires, which sets up the agent to work step-by-step and feature-by-feature. Second, we should prompt each agent to make incremental progress towards its goal while also leaving the environment in a clean state at the end of a session. By \u201cclean state\u201d we mean the kind of code that would be appropriate for merging to a main branch: there are no major bugs, the code is orderly and well-documented, and in general, a developer could easily begin work on a new feature without first having to clean up an unrelated mess.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">When experimenting internally, we addressed these problems using a two-part solution:<\/p>\n<p>Initializer agent: The very first agent session uses a specialized prompt that asks the model to set up the initial environment: an init.sh script, a claude-progress.txt file that keeps a log of what agents have done, and an initial git commit that shows what files were added.Coding agent: Every subsequent session asks the model to make incremental progress, then leave structured updates.1<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">The key insight here was finding a way for agents to quickly understand the state of work when starting with a fresh context window, which is accomplished with the claude-progress.txt file alongside the git history. Inspiration for these practices came from knowing what effective software engineers do every day.<\/p>\n<p>Environment management <\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">In the updated <a href=\"https:\/\/docs.claude.com\/en\/docs\/build-with-claude\/prompt-engineering\/claude-4-best-practices#multi-context-window-workflows\" rel=\"nofollow noopener\" target=\"_blank\">Claude 4 prompting guide<\/a>, we shared some best practices for multi-context window workflows, including a harness structure that uses \u201ca different prompt for the very first context window.\u201d This \u201cdifferent prompt\u201d requests that the initializer agent set up the environment with all the necessary context that future coding agents will need to work effectively. Here, we provide a deeper dive on some of the key components of such an environment.<\/p>\n<p>Feature list<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">To address the problem of the agent one-shotting an app or prematurely considering the project complete, we prompted the initializer agent to write a comprehensive file of feature requirements expanding on the user\u2019s initial prompt. In the <a href=\"http:\/\/claude.ai\/redirect\/website.v1.f83cb452-0a86-4398-842a-c90031ba5548\" rel=\"nofollow noopener\" target=\"_blank\">claude.ai<\/a> clone example, this meant over 200 features, such as \u201ca user can open a new chat, type in a query, press enter, and see an AI response.\u201d These features were all initially marked as \u201cfailing\u201d so that later coding agents would have a clear outline of what full functionality looked like.<\/p>\n<p>{<br \/>\n    &#8220;category&#8221;: &#8220;functional&#8221;,<br \/>\n    &#8220;description&#8221;: &#8220;New chat button creates a fresh conversation&#8221;,<br \/>\n    &#8220;steps&#8221;: [<br \/>\n      &#8220;Navigate to main interface&#8221;,<br \/>\n      &#8220;Click the &#8216;New Chat&#8217; button&#8221;,<br \/>\n      &#8220;Verify a new conversation is created&#8221;,<br \/>\n      &#8220;Check that chat area shows welcome state&#8221;,<br \/>\n      &#8220;Verify conversation appears in sidebar&#8221;<br \/>\n    ],<br \/>\n    &#8220;passes&#8221;: false<br \/>\n  }<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">We prompt coding agents to edit this file only by changing the status of a passes field, and we use strongly-worded instructions like \u201cIt is unacceptable to remove or edit tests because this could lead to missing or buggy functionality.\u201d After some experimentation, we landed on using JSON for this, as the model is less likely to inappropriately change or overwrite JSON files compared to Markdown files.<\/p>\n<p>Incremental progress<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">Given this initial environment scaffolding, the next iteration of the coding agent was then asked to work on only one feature at a time. This incremental approach turned out to be critical to addressing the agent\u2019s tendency to do too much at once.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">Once working incrementally, it\u2019s still essential that the model leaves the environment in a clean state after making a code change. In our experiments, we found that the best way to elicit this behavior was to ask the model to commit its progress to git with descriptive commit messages and to write summaries of its progress in a progress file. This allowed the model to use git to revert bad code changes and recover working states of the code base.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">These approaches also increased efficiency, as they eliminated the need for an agent to have to guess at what had happened and spend its time trying to get the basic app working again.<\/p>\n<p>Testing<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">One final major failure mode that we observed was Claude\u2019s tendency to mark a feature as complete without proper testing. Absent explicit prompting, Claude tended to make code changes, and even do testing with unit tests or curl commands against a development server, but would fail recognize that the feature didn\u2019t work end-to-end.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">In the case of building a web app, Claude mostly did well at verifying features end-to-end once explicitly prompted to use browser automation tools and do all testing as a human user would. <\/p>\n<p><img alt=\" Screenshots taken by Claude through the Puppeteer MCP server as it tested the claude.ai clone. \" loading=\"lazy\" width=\"1920\" height=\"1080\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/ie\/wp-content\/uploads\/2025\/11\/1764183308_940_image\"\/>Screenshots taken by Claude through the Puppeteer MCP server as it tested the claude.ai clone. <\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">Providing Claude with these kinds of testing tools dramatically improved performance, as the agent was able to identify and fix bugs that weren\u2019t obvious from the code alone.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">Some issues remain, like limitations to Claude\u2019s vision and to browser automation tools making it difficult to identify every kind of bug. For example, Claude can\u2019t see browser-native alert modals through the Puppeteer MCP, and features relying on these modals tended to be buggier as a result.<\/p>\n<p>Getting up to speed<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">With all of the above in place, every coding agent is prompted to run through a series of steps to get its bearings, some quite basic but still helpful:<\/p>\n<p>Run pwd to see the directory you\u2019re working in. You\u2019ll only be able to edit files in this directory.Read the git logs and progress files to get up to speed on what was recently worked on.Read the features list file and choose the highest-priority feature that\u2019s not yet done to work on.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">This approach saves Claude some tokens in every session since it doesn\u2019t have to figure out how to test the code. It also helps to ask the initializer agent to write an init.sh script that can run the development server, and then run through a basic end-to-end test before implementing a new feature.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">In the case of the claude.ai clone, this meant that the agent always started the local development server and used the Puppeteer MCP to start a new chat, send a message, and receive a response. This ensured that Claude could quickly identify if the app had been left in a broken state, and immediately fix any existing bugs. If the agent had instead started implementing a new feature, it would likely make the problem worse.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">Given all this, a typical session starts off with the following assistant messages:<\/p>\n<p>[Assistant] I&#8217;ll start by getting my bearings and understanding the current state of the project.<br \/>\n[Tool Use]<br \/>\n[Tool Use]<br \/>\n[Tool Use]<br \/>\n[Assistant] Let me check the git log to see recent work.<br \/>\n[Tool Use]<br \/>\n[Assistant] Now let me check if there&#8217;s an init.sh script to restart the servers.<\/p>\n<p>[Assistant] Excellent! Now let me navigate to the application and verify that some fundamental features are still working.<\/p>\n<p>[Assistant] Based on my verification testing, I can see that the fundamental functionality is working well. The core chat features, theme switching, conversation loading, and error handling are all functioning correctly. Now let me review the tests.json file more comprehensively to understand what needs to be implemented next.<\/p>\n<p class=\"headline-6\">Agent failure modes and solutions <\/p>\n<p>ProblemInitializer Agent Behavior Coding Agent BehaviorClaude declares victory on the entire project too early.Set up a feature list file: based on the input spec, set up a structured JSON file with a list of end-to-end feature descriptions.Read the feature list file at the beginning of a session. Choose a single feature to start working on.Claude leaves the environment in a state with bugs or undocumented progress.An initial git repo and progress notes file is written.Start the session by reading the progress notes file and git commit logs, and run a basic test on the development server to catch any undocumented bugs. End the session by writing a git commit and progress update.Claude marks features as done prematurely.Set up a feature list file.Self-verify all features. Only mark features as \u201cpassing\u201d after careful testing.Claude has to spend time figuring out how to run the app.Write an init.sh script that can run the development server.Start the session by reading init.sh.Summarizing four common failure modes and solutions in long-running AI agents.Future work<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">This research demonstrates one possible set of solutions in a long-running agent harness to enable the model to make incremental progress across many context windows. However, there remain open questions.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">Most notably, it\u2019s still unclear whether a single, general-purpose coding agent performs best across contexts, or if better performance can be achieved through a multi-agent architecture. It seems reasonable that specialized agents like a testing agent, a quality assurance agent, or a code cleanup agent, could do an even better job at sub-tasks across the software development lifecycle.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">Additionally, this demo is optimized for full-stack web app development. A future direction is to generalize these findings to other fields. It\u2019s likely that some or all of these lessons can be applied to the types of long-running agentic tasks required in, for example, scientific research or financial modeling.<\/p>\n<p>Acknowledgements<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">Written by Justin Young. Special thanks to David Hershey, Prithvi Rajasakeran, Jeremy Hadfield, Naia Bouscal, Michael Tingley, Jesse Mu, Jake Eaton, Marius Buleandara, Maggie Vo, Pedram Navid, Nadine Yasser, and Alex Notov for their contributions.<\/p>\n<p class=\"Body_reading-column__t7kGM body-2 serif post-text\">This work reflects the collective efforts of several teams across Anthropic who made it possible for Claude to safely do long-horizon autonomous software engineering, especially the code RL &amp; Claude Code teams. Interested candidates who would like to contribute are welcome to apply at <a href=\"http:\/\/anthropic.com\/careers\" rel=\"nofollow noopener\" target=\"_blank\">anthropic.com\/careers<\/a>.<\/p>\n<p>Footnotes<\/p>\n<p class=\"Body_reading-column__t7kGM body-3 serif post-footnote Body_footnote__8rtfr\">1. We refer to these as separate agents in this context only because they have different initial user prompts. The system prompt, set of tools, and overall agent harness was otherwise identical.<\/p>\n","protected":false},"excerpt":{"rendered":"As AI agents become more capable, developers are increasingly asking them to take on complex tasks requiring work&hellip;\n","protected":false},"author":2,"featured_media":161142,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[220,218,219,61,60,80],"class_list":{"0":"post-161141","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-ie","12":"tag-ireland","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/161141","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/comments?post=161141"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/161141\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media\/161142"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media?parent=161141"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/categories?post=161141"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/tags?post=161141"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}