{"id":138633,"date":"2025-09-12T21:34:16","date_gmt":"2025-09-12T21:34:16","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/138633\/"},"modified":"2025-09-12T21:34:16","modified_gmt":"2025-09-12T21:34:16","slug":"writing-effective-tools-for-ai-agents-using-ai-agents-anthropic","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/138633\/","title":{"rendered":"Writing effective tools for AI agents\u2014using AI agents \\ Anthropic"},"content":{"rendered":"<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">The <a href=\"https:\/\/modelcontextprotocol.io\/docs\/getting-started\/intro\" rel=\"nofollow noopener\" target=\"_blank\">Model Context Protocol (MCP)<\/a> can empower LLM agents with potentially hundreds of tools to solve real-world tasks. But how do we make those tools maximally effective?<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In this post, we describe our most effective techniques for improving performance in a variety of agentic AI systems1.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We begin by covering how you can:<\/p>\n<p>Build and test prototypes of your toolsCreate and run comprehensive evaluations of your tools with agentsCollaborate with agents like Claude Code to automatically increase the performance of your tools<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We conclude with key principles for writing high-quality tools we\u2019ve identified along the way:<\/p>\n<p>Choosing the right tools to implement (and not to implement)Namespacing tools to define clear boundaries in functionalityReturning meaningful context from tools back to agentsOptimizing tool responses for token efficiencyPrompt-engineering tool descriptions and specs<img alt=\"This is an image depicting how an engineer might use Claude Code to evaluate the efficacy of agentic tools.\" loading=\"lazy\" width=\"1920\" height=\"1080\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/09\/1757712853_62_image\"\/>Building an evaluation allows you to systematically measure the performance of your tools. You can use Claude Code to automatically optimize your tools against this evaluation.What is a tool?<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In computing, deterministic systems produce the same output every time given identical inputs, while non-deterministic systems\u2014like agents\u2014can generate varied responses even with the same starting conditions.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">When we traditionally write software, we\u2019re establishing a contract between deterministic systems. For instance, a function call like getWeather(\u201cNYC\u201d) will always fetch the weather in New York City in the exact same manner every time it is called.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Tools are a new kind of software which reflects a contract between deterministic systems and non-deterministic agents. When a user asks &#8220;Should I bring an umbrella today?,\u201d an agent might call the weather tool, answer from general knowledge, or even ask a clarifying question about location first. Occasionally, an agent might hallucinate or even fail to grasp how to use a tool.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">This means fundamentally rethinking our approach when writing software for agents: instead of writing tools and <a href=\"https:\/\/modelcontextprotocol.io\/\" rel=\"nofollow noopener\" target=\"_blank\">MCP servers<\/a> the way we\u2019d write functions and APIs for other developers or systems, we need to design them for agents.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Our goal is to increase the surface area over which agents can be effective in solving a wide range of tasks by using tools to pursue a variety of successful strategies. Fortunately, in our experience, the tools that are most \u201cergonomic\u201d for agents also end up being surprisingly intuitive to grasp as humans.<\/p>\n<p>How to write tools<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In this section, we describe how you can collaborate with agents both to write and to improve the tools you give them. Start by standing up a quick prototype of your tools and testing them locally. Next, run a comprehensive evaluation to measure subsequent changes. Working alongside agents, you can repeat the process of evaluating and improving your tools until your agents achieve strong performance on real-world tasks.<\/p>\n<p>Building a prototype<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">It can be difficult to anticipate which tools agents will find ergonomic and which tools they won\u2019t without getting hands-on yourself. Start by standing up a quick prototype of your tools. If you\u2019re using <a href=\"https:\/\/www.anthropic.com\/claude-code\" rel=\"nofollow noopener\" target=\"_blank\">Claude Code<\/a> to write your tools (potentially in one-shot), it helps to give Claude documentation for any software libraries, APIs, or SDKs (including potentially the <a href=\"https:\/\/modelcontextprotocol.io\/docs\/sdk\" rel=\"nofollow noopener\" target=\"_blank\">MCP SDK<\/a>) your tools will rely on. LLM-friendly documentation can commonly be found in flat llms.txt files on official documentation sites (here\u2019s our <a href=\"https:\/\/docs.anthropic.com\/llms.txt\" rel=\"nofollow noopener\" target=\"_blank\">API\u2019s<\/a>).<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Wrapping your tools in a <a href=\"https:\/\/modelcontextprotocol.io\/docs\/develop\/connect-local-servers\" rel=\"nofollow noopener\" target=\"_blank\">local MCP server<\/a> or <a href=\"https:\/\/www.anthropic.com\/engineering\/desktop-extensions\" rel=\"nofollow noopener\" target=\"_blank\">Desktop extension<\/a> (DXT) will allow you to connect and test your tools in Claude Code or the Claude Desktop app.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">To connect your local MCP server to Claude Code, run claude mcp add   [args&#8230;].<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">To connect your local MCP server or DXT to the Claude Desktop app, navigate to Settings &gt; Developer or Settings &gt; Extensions, respectively.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Tools can also be passed directly into <a href=\"https:\/\/docs.anthropic.com\/en\/docs\/agents-and-tools\/tool-use\/overview\" rel=\"nofollow noopener\" target=\"_blank\">Anthropic API<\/a> calls for programmatic testing.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Test the tools yourself to identify any rough edges. Collect feedback from your users to build an intuition around the use-cases and prompts you expect your tools to enable.<\/p>\n<p>Running an evaluation<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Next, you need to measure how well Claude uses your tools by running an evaluation. Start by generating lots of evaluation tasks, grounded in real world uses. We recommend collaborating with an agent to help analyze your results and determine how to improve your tools. See this process end-to-end in our <a href=\"https:\/\/github.com\/anthropics\/anthropic-cookbook\/blob\/main\/tool_evaluation\/tool_evaluation.ipynb\" rel=\"nofollow noopener\" target=\"_blank\">tool evaluation cookbook<\/a>.<\/p>\n<p><img alt=\"This graph measures the test set accuracy of human-written vs. Claude-optimized Slack MCP servers.\" loading=\"lazy\" width=\"1920\" height=\"1080\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/09\/1757712854_103_image\"\/>Held-out test set performance of our internal Slack tools<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Generating evaluation tasks<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">With your early prototype, Claude Code can quickly explore your tools and create dozens of prompt and response pairs. Prompts should be inspired by real-world uses and be based on realistic data sources and services (for example, internal knowledge bases and microservices). We recommend you avoid overly simplistic or superficial \u201csandbox\u201d environments that don\u2019t stress-test your tools with sufficient complexity. Strong evaluation tasks might require multiple tool calls\u2014potentially dozens.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Here are some examples of strong tasks:<\/p>\n<p>Schedule a meeting with Jane next week to discuss our latest Acme Corp project. Attach the notes from our last project planning meeting and reserve a conference room.Customer ID 9182 reported that they were charged three times for a single purchase attempt. Find all relevant log entries and determine if any other customers were affected by the same issue.Customer Sarah Chen just submitted a cancellation request. Prepare a retention offer. Determine: (1) why they&#8217;re leaving, (2) what retention offer would be most compelling, and (3) any risk factors we should be aware of before making an offer.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">And here are some weaker tasks:<\/p>\n<p>Schedule a meeting with jane@acme.corp next week.Search the payment logs for purchase_complete and customer_id=9182.Find the cancellation request by Customer ID 45892.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Each evaluation prompt should be paired with a verifiable response or outcome. Your verifier can be as simple as an exact string comparison between ground truth and sampled responses, or as advanced as enlisting Claude to judge the response. Avoid overly strict verifiers that reject correct responses due to spurious differences like formatting, punctuation, or valid alternative phrasings.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">For each prompt-response pair, you can optionally also specify the tools you expect an agent to call in solving the task, to measure whether or not agents are successful in grasping each tool\u2019s purpose during evaluation. However, because there might be multiple valid paths to solving tasks correctly, try to avoid overspecifying or overfitting to strategies.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Running the evaluation<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We recommend running your evaluation programmatically with direct LLM API calls. Use simple agentic loops (while-loops wrapping alternating LLM API and tool calls): one loop for each evaluation task. Each evaluation agent should be given a single task prompt and your tools.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In your evaluation agents\u2019 system prompts, we recommend instructing agents to output not just structured response blocks (for verification), but also reasoning and feedback blocks. Instructing agents to output these before tool call and response blocks may increase LLMs\u2019 effective intelligence by triggering chain-of-thought (CoT) behaviors.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">If you\u2019re running your evaluation with Claude, you can turn on <a href=\"https:\/\/docs.anthropic.com\/en\/docs\/build-with-claude\/extended-thinking#interleaved-thinking\" rel=\"nofollow noopener\" target=\"_blank\">interleaved thinking<\/a> for similar functionality \u201coff-the-shelf\u201d. This will help you probe why agents do or don\u2019t call certain tools and highlight specific areas of improvement in tool descriptions and specs.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">As well as top-level accuracy, we recommend collecting other metrics like the total runtime of individual tool calls and tasks, the total number of tool calls, the total token consumption, and tool errors. Tracking tool calls can help reveal common workflows that agents pursue and offer some opportunities for tools to consolidate.<\/p>\n<p><img alt=\"This graph measures the test set accuracy of human-written vs. Claude-optimized Asana MCP servers.\" loading=\"lazy\" width=\"1920\" height=\"1080\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/09\/1757712854_968_image\"\/>Held-out test set performance of our internal Asana tools<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Analyzing results<br \/>Agents are your helpful partners in spotting issues and providing feedback on everything from contradictory tool descriptions to inefficient tool implementations and confusing tool schemas. However, keep in mind that what agents omit in their feedback and responses can often be more important than what they include. LLMs don\u2019t always <a href=\"https:\/\/www.anthropic.com\/research\/tracing-thoughts-language-model\" rel=\"nofollow noopener\" target=\"_blank\">say what they mean<\/a>.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Observe where your agents get stumped or confused. Read through your evaluation agents\u2019 reasoning and feedback (or CoT) to identify rough edges. Review the raw transcripts (including tool calls and tool responses) to catch any behavior not explicitly described in the agent\u2019s CoT. Read between the lines; remember that your evaluation agents don\u2019t necessarily know the correct answers and strategies.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Analyze your tool calling metrics. Lots of redundant tool calls might suggest some rightsizing of pagination or token limit parameters is warranted; lots of tool errors for invalid parameters might suggest tools could use clearer descriptions or better examples. When we launched Claude\u2019s <a href=\"https:\/\/www.anthropic.com\/news\/web-search\" rel=\"nofollow noopener\" target=\"_blank\">web search tool<\/a>, we identified that Claude was needlessly appending 2025 to the tool\u2019s query parameter, biasing search results and degrading performance (we steered Claude in the right direction by improving the tool description).<\/p>\n<p>Collaborating with agents<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">You can even let agents analyze your results and improve your tools for you. Simply concatenate the transcripts from your evaluation agents and paste them into Claude Code. Claude is an expert at analyzing transcripts and refactoring lots of tools all at once\u2014for example, to ensure tool implementations and descriptions remain self-consistent when new changes are made.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In fact, most of the advice in this post came from repeatedly optimizing our internal tool implementations with Claude Code. Our evaluations were created on top of our internal workspace, mirroring the complexity of our internal workflows, including real projects, documents, and messages.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We relied on held-out test sets to ensure we did not overfit to our \u201ctraining\u201d evaluations. These test sets revealed that we could extract additional performance improvements even beyond what we achieved with &#8220;expert&#8221; tool implementations\u2014whether those tools were manually written by our researchers or generated by Claude itself.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In the next section, we\u2019ll share some of what we learned from this process.<\/p>\n<p>Principles for writing effective tools<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In this section, we distill our learnings into a few guiding principles for writing effective tools.<\/p>\n<p>Choosing the right tools for agents<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">More tools don\u2019t always lead to better outcomes. A common error we\u2019ve observed is tools that merely wrap existing software functionality or API endpoints\u2014whether or not the tools are appropriate for agents. This is because agents have distinct \u201caffordances\u201d to traditional software\u2014that is, they have different ways of perceiving the potential actions they can take with those tools<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">LLM agents have limited &#8220;context&#8221; (that is, there are limits to how much information they can process at once), whereas computer memory is cheap and abundant. Consider the task of searching for a contact in an address book. Traditional software programs can efficiently store and process a list of contacts one at a time, checking each one before moving on.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">However, if an LLM agent uses a tool that returns ALL contacts and then has to read through each one token-by-token, it&#8217;s wasting its limited context space on irrelevant information (imagine searching for a contact in your address book by reading each page from top-to-bottom\u2014that is, via brute-force search). The better and more natural approach (for agents and humans alike) is to skip to the relevant page first (perhaps finding it alphabetically).<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We recommend building a few thoughtful tools targeting specific high-impact workflows, which match your evaluation tasks and scaling up from there. In the address book case, you might choose to implement a search_contacts or message_contact tool instead of a list_contacts tool.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Tools can consolidate functionality, handling potentially multiple discrete operations (or API calls) under the hood. For example, tools can enrich tool responses with related metadata or handle frequently chained, multi-step tasks in a single tool call.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Here are some examples:<\/p>\n<p>Instead of implementing a list_users, list_events, and create_event tools, consider implementing a schedule_event tool which finds availability and schedules an event.Instead of implementing a read_logs tool, consider implementing a search_logs tool which only returns relevant log lines and some surrounding context.Instead of implementing get_customer_by_id, list_transactions, and list_notes tools, implement a get_customer_context tool which compiles all of a customer\u2019s recent &amp; relevant information all at once.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Make sure each tool you build has a clear, distinct purpose. Tools should enable agents to subdivide and solve tasks in much the same way that a human would, given access to the same underlying resources, and simultaneously reduce the context that would have otherwise been consumed by intermediate outputs.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Too many tools or overlapping tools can also distract agents from pursuing efficient strategies. Careful, selective planning of the tools you build (or don\u2019t build) can really pay off.<\/p>\n<p>Namespacing your tools<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Your AI agents will potentially gain access to dozens of MCP servers and hundreds of different tools\u2013including those by other developers. When tools overlap in function or have a vague purpose, agents can get confused about which ones to use.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Namespacing (grouping related tools under common prefixes) can help delineate boundaries between lots of tools; MCP clients sometimes do this by default. For example, namespacing tools by service (e.g., asana_search, jira_search) and by resource (e.g., asana_projects_search, asana_users_search), can help agents select the right tools at the right time.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We have found selecting between prefix- and suffix-based namespacing to have non-trivial effects on our tool-use evaluations. Effects vary by LLM and we encourage you to choose a naming scheme according to your own evaluations.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Agents might call the wrong tools, call the right tools with the wrong parameters, call too few tools, or process tool responses incorrectly. By selectively implementing tools whose names reflect natural subdivisions of tasks, you simultaneously reduce the number of tools and tool descriptions loaded into the agent\u2019s context and offload agentic computation from the agent\u2019s context back into the tool calls themselves. This reduces an agent\u2019s overall risk of making mistakes.<\/p>\n<p>Returning meaningful context from your tools<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In the same vein, tool implementations should take care to return only high signal information back to agents. They should prioritize contextual relevance over flexibility, and eschew low-level technical identifiers (for example: uuid, 256px_image_url, mime_type). Fields like name, image_url, and file_type are much more likely to directly inform agents\u2019 downstream actions and responses.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Agents also tend to grapple with natural language names, terms, or identifiers significantly more successfully than they do with cryptic identifiers. We\u2019ve found that merely resolving arbitrary alphanumeric UUIDs to more semantically meaningful and interpretable language (or even a 0-indexed ID scheme) significantly improves Claude\u2019s precision in retrieval tasks by reducing hallucinations.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In some instances, agents may require the flexibility to interact with both natural language and technical identifiers outputs, if only to trigger downstream tool calls (for example, search_user(name=\u2019jane\u2019) \u2192 send_message(id=12345)). You can enable both by exposing a simple response_format enum parameter in your tool, allowing your agent to control whether tools return \u201cconcise\u201d or \u201cdetailed\u201d responses (images below).<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">You can add more formats for even greater flexibility, similar to GraphQL where you can choose exactly which pieces of information you want to receive. Here is an example ResponseFormat enum to control tool response verbosity:<\/p>\n<p>enum ResponseFormat {<br \/>\n   DETAILED = &#8220;detailed&#8221;,<br \/>\n   CONCISE = &#8220;concise&#8221;<br \/>\n}<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Here\u2019s an example of a detailed tool response (206 tokens):<\/p>\n<p><img alt=\"This code snippet depicts an example of a detailed tool response.\" loading=\"lazy\" width=\"1920\" height=\"1006\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/09\/1757712854_526_image\"\/><\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Here\u2019s an example of a concise tool response (72 tokens):<\/p>\n<p><img alt=\"This code snippet depicts a concise tool response.\" loading=\"lazy\" width=\"1920\" height=\"725\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/09\/1757712854_409_image\"\/>Slack threads and thread replies are identified by unique thread_ts which are required to fetch thread replies. thread_ts and other IDs (channel_id, user_id) can be retrieved from a \u201cdetailed\u201d tool response to enable further tool calls that require these. \u201cconcise\u201d tool responses return only thread content and exclude IDs. In this example, we use ~\u2153 of the tokens with \u201cconcise\u201d tool responses.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Even your tool response structure\u2014for example XML, JSON, or Markdown\u2014can have an impact on evaluation performance: there is no one-size-fits-all solution. This is because LLMs are trained on next-token prediction and tend to perform better with formats that match their training data. The optimal response structure will vary widely by task and agent. We encourage you to select the best response structure based on your own evaluation.<\/p>\n<p>Optimizing tool responses for token efficiency<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Optimizing the quality of context is important. But so is optimizing the quantity of context returned back to agents in tool responses.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We suggest implementing some combination of pagination, range selection, filtering, and\/or truncation with sensible default parameter values for any tool responses that could use up lots of context. For Claude Code, we restrict tool responses to 25,000 tokens by default. We expect the effective context length of agents to grow over time, but the need for context-efficient tools to remain.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">If you choose to truncate responses, be sure to steer agents with helpful instructions. You can directly encourage agents to pursue more token-efficient strategies, like making many small and targeted searches instead of a single, broad search for a knowledge retrieval task. Similarly, if a tool call raises an error (for example, during input validation), you can prompt-engineer your error responses to clearly communicate specific and actionable improvements, rather than opaque error codes or tracebacks.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Here\u2019s an example of a truncated tool response:<\/p>\n<p><img alt=\"This image depicts an example of a truncated tool response.\" loading=\"lazy\" width=\"1920\" height=\"1162\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/09\/1757712855_42_image\"\/><\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Here\u2019s an example of an unhelpful error response:<\/p>\n<p><img alt=\"This image depicts an example of an unhelpful tool response. \" loading=\"lazy\" width=\"1920\" height=\"733\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/09\/1757712855_370_image\"\/><\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Here\u2019s an example of a helpful error response:<\/p>\n<p><img alt=\"This image depicts an example of a helpful error response.\" loading=\"lazy\" width=\"1920\" height=\"850\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/09\/1757712855_858_image\"\/>Tool truncation and error responses can steer agents towards more token-efficient tool-use behaviors (using filters or pagination) or give examples of correctly formatted tool inputs.Prompt-engineering your tool descriptions<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">We now come to one of the most effective methods for improving tools: prompt-engineering your tool descriptions and specs. Because these are loaded into your agents\u2019 context, they can collectively steer agents toward effective tool-calling behaviors.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">When writing tool descriptions and specs, think of how you would describe your tool to a new hire on your team. Consider the context that you might implicitly bring\u2014specialized query formats, definitions of niche terminology, relationships between underlying resources\u2014and make it explicit. Avoid ambiguity by clearly describing (and enforcing with strict data models) expected inputs and outputs. In particular, input parameters should be unambiguously named: instead of a parameter named user, try a parameter named user_id.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">With your evaluation you can measure the impact of your prompt engineering with greater confidence. Even small refinements to tool descriptions can yield dramatic improvements. Claude Sonnet 3.5 achieved state-of-the-art performance on the<a href=\"https:\/\/www.anthropic.com\/engineering\/swe-bench-sonnet\" rel=\"nofollow noopener\" target=\"_blank\"> SWE-bench Verified<\/a> evaluation after we made precise refinements to tool descriptions, dramatically reducing error rates and improving task completion.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">You can find other best practices for tool definitions in our <a href=\"https:\/\/docs.anthropic.com\/en\/docs\/agents-and-tools\/tool-use\/implement-tool-use#best-practices-for-tool-definitions\" rel=\"nofollow noopener\" target=\"_blank\">Developer Guide<\/a>. If you\u2019re building tools for Claude, we also recommend reading about how tools are dynamically loaded into Claude\u2019s <a href=\"https:\/\/docs.anthropic.com\/en\/docs\/agents-and-tools\/tool-use\/implement-tool-use#tool-use-system-prompt\" rel=\"nofollow noopener\" target=\"_blank\">system prompt<\/a>. Lastly, if you\u2019re writing tools for an MCP server, <a href=\"https:\/\/modelcontextprotocol.io\/specification\/2025-06-18\/server\/tools\" rel=\"nofollow noopener\" target=\"_blank\">tool annotations<\/a> help disclose which tools require open-world access or make destructive changes.<\/p>\n<p>Looking ahead<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">To build effective tools for agents, we need to re-orient our software development practices from predictable, deterministic patterns to non-deterministic ones.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Through the iterative, evaluation-driven process we\u2019ve described in this post, we&#8217;ve identified consistent patterns in what makes tools successful: Effective tools are intentionally and clearly defined, use agent context judiciously, can be combined together in diverse workflows, and enable agents to intuitively solve real-world tasks.<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">In the future, we expect the specific mechanisms through which agents interact with the world to evolve\u2014from updates to the MCP protocol to upgrades to the underlying LLMs themselves. With a systematic, evaluation-driven approach to improving tools for agents, we can ensure that as agents become more capable, the tools they use will evolve alongside them.<\/p>\n<p>Acknowledgements<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">Written by Ken Aizawa with valuable contributions from colleagues across Research (Barry Zhang, Zachary Witten, Daniel Jiang, Sami Al-Sheikh, Matt Bell, Maggie Vo), MCP (Theodora Chu, John Welsh, David Soria Parra, Adam Jones), Product Engineering (Santiago Seira), Marketing (Molly Vorwerck), Design (Drew Roper), and Applied AI (Christian Ryan, Alexander Bricken).<\/p>\n<p class=\"Body_reading-column__t7kGM paragraph-m post-text\">1Beyond training the underlying LLMs themselves.<\/p>\n<p><img alt=\"Interlocking puzzle piece with complex geometric shape and detailed surface texture\" loading=\"lazy\" width=\"1000\" height=\"1000\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\" src=\"https:\/\/www-cdn.anthropic.com\/images\/4zrzovbb\/website\/43abe7e54b56a891e74a8542944dfbd33f07f49c-1000x1000.svg\"\/>Looking to learn more?<\/p>\n<p class=\"paragraph-m tight\">Master API development, Model Context Protocol, and Claude Code with courses on Anthropic Academy. Earn certificates upon completion.<\/p>\n<p><a href=\"https:\/\/anthropic.skilljar.com\/\" class=\"ButtonCta_button__miruF detail-m ToutCallout_button__p7buc\" rel=\"noopener nofollow\" target=\"_blank\" aria-label=\"Explore courses\">Explore courses<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"The Model Context Protocol (MCP) can empower LLM agents with potentially hundreds of tools to solve real-world tasks.&hellip;\n","protected":false},"author":2,"featured_media":138634,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-138633","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/138633","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=138633"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/138633\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/138634"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=138633"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=138633"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=138633"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}