{"id":490420,"date":"2026-02-19T15:41:17","date_gmt":"2026-02-19T15:41:17","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/490420\/"},"modified":"2026-02-19T15:41:17","modified_gmt":"2026-02-19T15:41:17","slug":"measuring-ai-agent-autonomy-in-practice-anthropic","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/490420\/","title":{"rendered":"Measuring AI agent autonomy in practice \\ Anthropic"},"content":{"rendered":"<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">AI agents are here, and already they\u2019re being deployed across contexts that vary widely in consequence, from email triage to <a href=\"https:\/\/www.anthropic.com\/news\/disrupting-AI-espionage\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">cyber espionage<\/a>. Understanding this spectrum is critical for deploying AI safely, yet we know surprisingly little about how people actually use agents in the real world.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We analyzed millions of human-agent interactions across both Claude Code and our public API using our <a href=\"https:\/\/www.anthropic.com\/research\/clio\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">privacy-preserving tool<\/a>, to ask: How much autonomy do people grant agents? How does that change as people gain experience? Which domains are agents operating in? And are the actions taken by agents risky?<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We found that:<\/p>\n<p><a href=\"https:\/\/anthropic.com\/research\/measuring-agent-autonomy#claude-code-is-working-autonomously-for-longer\" rel=\"nofollow noopener\" target=\"_blank\">Claude Code is working autonomously for longer.<\/a> Among the longest-running sessions, the length of time Claude Code works before stopping has nearly doubled in three months, from under 25 minutes to over 45 minutes. This increase is smooth across model releases, which suggests it isn\u2019t purely a result of increased capabilities, and that existing models are capable of more autonomy than they exercise in practice.<a href=\"https:\/\/www.anthropic.com\/research\/measuring-agent-autonomy#experienced-users-in-claude-code-auto-approve-more-frequently-but-interrupt-more-often\" rel=\"nofollow noopener\" target=\"_blank\">Experienced users in Claude Code auto-approve more frequently, but interrupt more often.<\/a> As users gain experience with Claude Code, they tend to stop reviewing each action and instead let Claude run autonomously, intervening only when needed. Among new users, roughly 20% of sessions use full auto-approve, which increases to over 40% as users gain experience.<a href=\"https:\/\/www.anthropic.com\/research\/measuring-agent-autonomy#claude-code-pauses-for-clarification-more-often-than-humans-interrupt-it\" rel=\"nofollow noopener\" target=\"_blank\">Claude Code pauses for clarification more often than humans interrupt it.<\/a> In addition to human-initiated stops, agent-initiated stops are also an important form of oversight in deployed systems. On the most complex tasks, Claude Code stops to ask for clarification more than twice as often as humans interrupt it.<a href=\"https:\/\/anthropic.com\/research\/measuring-agent-autonomy#agents-are-used-in-risky-domains-but-not-yet-at-scale\" rel=\"nofollow noopener\" target=\"_blank\">Agents are used in risky domains, but not yet at scale.<\/a> Most agent actions on our public API are low-risk and reversible. Software engineering accounted for nearly 50% of agentic activity, but we saw emerging usage in healthcare, finance, and cybersecurity.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Below, we present our methodology and findings in more detail, and end with recommendations for model developers, product developers, and policymakers. Our central conclusion is that effective oversight of agents will require new forms of post-deployment monitoring infrastructure and new human-AI interaction paradigms that help both the human and the AI manage autonomy and risk together.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We view our research as a small but important first step towards empirically understanding how people deploy and use agents. We will continue to iterate on our methods and communicate our findings as agents are adopted more widely.<\/p>\n<p>Studying agents in the wild<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Agents are difficult to study empirically. First, there is no agreed-upon definition of what an agent is. Second, agents are evolving quickly. Last year, many of the most sophisticated agents\u2014including Claude Code\u2014involved a single conversational thread, but today there are multi-agent systems that operate autonomously for hours. Finally, model providers have limited visibility into the architecture of their customers\u2019 agents. For example, we have no reliable way to associate independent requests to our API into \u201csessions\u201d of agentic activity. (We discuss this challenge in more detail at the end of this post.)<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">In light of these challenges, how can we study agents empirically?<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To start, for this study we adopted a definition of agents that is conceptually grounded and operationalizable: an agent is an AI system equipped with tools that allow it to take actions, like running code, calling external APIs, and sending messages to other agents.1 Studying the tools that agents use tells us a great deal about what they are doing in the world.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Next, we developed a collection of metrics that draw on data from both agentic uses of our <a href=\"https:\/\/platform.claude.com\/docs\/en\/api\/overview\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">public API<\/a> and <a href=\"https:\/\/code.claude.com\/docs\/en\/overview\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Claude Code<\/a>, our own coding agent. These offer a tradeoff between breadth and depth:<\/p>\n<p>Our public API gives us broad visibility into agentic deployments across thousands of different customers. Rather than attempting to infer our customers\u2019 agent architectures, we instead perform our analysis at the level of individual tool calls.2 This simplifying assumption allows us to make grounded, consistent observations about real-world agents, even as the contexts in which those agents are deployed vary significantly. The limitation of this approach is that we must analyze actions in isolation, and cannot reconstruct how individual actions compose into longer sequences of behavior over time.Claude Code offers the opposite tradeoff. Because Claude Code is our own product, we can link requests across sessions and understand entire agent workflows from start to finish. This makes Claude Code especially useful for studying autonomy\u2014for example, how long agents run without human intervention, what triggers interruptions, and how users maintain oversight over Claude as they develop experience. However, because Claude Code is only one product, it does not provide the same diversity of insight into agentic use as API traffic.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">By drawing from both sources using our <a href=\"https:\/\/www.anthropic.com\/research\/clio\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">privacy-preserving infrastructure<\/a>, we can answer questions that neither could address alone.<\/p>\n<p>Claude Code is working autonomously for longer<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">How long do agents actually run without human involvement? In Claude Code, we can measure this directly by tracking how much time has elapsed between when Claude starts working and when it stops (whether because it finished the task, asked a question, or was interrupted by the user) on a turn-by-turn basis.3<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Turn duration is an imperfect proxy for autonomy.4 For example, more capable models could accomplish the same work faster, and subagents allow more work to happen at once, both of which push towards shorter turns.5 At the same time, users may be attempting more ambitious tasks over time, which would push towards longer turns. In addition, Claude Code\u2019s user base is rapidly growing\u2014and thus changing. We can\u2019t measure these changes in isolation; what we measure is the net result of this interplay, including how long users let Claude work independently, the difficulty of the tasks they give it, and the efficiency of the product itself (which improves <a href=\"https:\/\/github.com\/anthropics\/claude-code\/blob\/main\/CHANGELOG.md\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">daily<\/a>).<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Most Claude Code turns are short. The median turn lasts around 45 seconds, and this duration has fluctuated only slightly over the past few months (between 40 and 55 seconds). In fact, nearly every percentile below the 99th has remained relatively stable.6 That stability is what we\u2019d expect for a product experiencing rapid growth: when new users adopt Claude Code, they are comparatively inexperienced, and\u2014as we show in the next section\u2014less likely to grant Claude full latitude.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">The more revealing signal is in the tail. The longest turns tell us the most about the most ambitious uses of Claude Code, and point to where autonomy is heading. Between October 2025 and January 2026, the 99.9th percentile turn duration nearly doubled, from under 25 minutes to over 45 minutes (Figure 1).<\/p>\n<p><img loading=\"lazy\" width=\"3840\" height=\"2160\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2026\/02\/1771515674_456_image.webp\"\/>Figure 1. 99.9th percentile turn duration (how long Claude works on a per-turn basis) in interactive Claude Code sessions, 7-day rolling average. The 99.9th percentile has grown steadily from under 25 minutes in late September to over 45 minutes in early January. This analysis reflects all interactive Claude Code usage.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Notably, this increase is smooth across model releases. If autonomy were purely a function of model capability, we would expect sharp jumps with each new launch. The relative steadiness of this trend instead suggests several potential factors are at work, including power users building trust with the tool over time, applying Claude to increasingly ambitious tasks, and the product itself improving.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">The extreme turn duration has declined somewhat since mid-January. We hypothesize a few reasons why. First, the Claude Code user base <a href=\"https:\/\/www.anthropic.com\/news\/anthropic-raises-30-billion-series-g-funding-380-billion-post-money-valuation\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">doubled<\/a> between January and mid-February, and a larger and more diverse population of sessions could reshape the distribution. Second, as users returned from the holiday break, the projects they brought to Claude Code may have shifted from hobby projects to more tightly circumscribed work tasks. Most likely, it\u2019s a combination of these factors and others we haven\u2019t identified.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We also looked at Anthropic\u2019s internal Claude Code usage to understand how independence and utility have evolved together. From August to December, Claude Code\u2019s success rate on internal users\u2019 most challenging tasks doubled, at the same time that the average number of human interventions per session decreased from 5.4 to 3.3.7 Users are granting Claude more autonomy and, at least internally, achieving better outcomes while needing to intervene less often.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Both measurements point to a significant deployment overhang, where the autonomy models are capable of handling exceeds what they exercise in practice.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">It\u2019s useful to contrast these findings with external capability assessments. One of the most widely cited capability assessments is METR\u2019s \u201cMeasuring AI Ability to Complete Long Tasks,\u201d which <a href=\"https:\/\/metr.org\/blog\/2025-03-19-measuring-ai-ability-to-complete-long-tasks\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">estimates<\/a> that Claude Opus 4.5 can complete tasks with a 50% success rate that would take a human nearly 5 hours. The 99.9th percentile turn duration in Claude Code, in contrast, is ~42 minutes, and the median is much shorter. However, the two metrics are not directly comparable. The METR evaluation captures what a model is capable of in an idealized setting with no human interaction and no real-world consequences. Our measurements capture what happens in practice, where Claude pauses to ask for feedback and users interrupt.8 And METR\u2019s five-hour figure measures task difficulty\u2014how long the task would take a human\u2014not how long the model actually runs.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Neither capability evaluations nor our measurements alone give a complete picture of agent autonomy, but together they suggest that the latitude granted to models in practice lags behind what they can handle.<\/p>\n<p>Experienced users in Claude Code auto-approve more frequently, but interrupt more often<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">How do humans adapt how they work with agents over time? We found that people grant Claude Code more autonomy as they gain experience using it (Figure 2). Newer users (&lt;50 sessions) employ full auto-approve roughly 20% of the time; by 750 sessions, this increases to over 40% of sessions.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">This shift is gradual, suggesting a steady accumulation of trust. It\u2019s also important to note that Claude Code\u2019s default settings require users to manually approve each action, so part of this transition may reflect users configuring the product to match their preferences for greater independence as they become familiar with Claude\u2019s capabilities.<\/p>\n<p><img loading=\"lazy\" width=\"3840\" height=\"2160\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2026\/02\/1771515674_738_image.webp\"\/>Figure 2. Auto-approve rate by account tenure. Experienced users increasingly let Claude run without any manual approval. Data reflects all interactive Claude Code usage for users who signed up after September 19, 2025. Line and CI bounds are LOWESS-smoothed (0.15 bandwidth). The x-axis is a log scale.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Approving actions is only one method of supervising Claude Code. Users can also interrupt Claude while it is working to provide feedback. We find that interrupt rates increase with experience. New users (those with around 10 sessions) interrupt Claude in 5% of turns, while more experienced users interrupt in around 9% of turns (Figure 3).<\/p>\n<p><img loading=\"lazy\" width=\"3840\" height=\"2160\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2026\/02\/1771515676_743_image.webp\"\/>Figure 3. Interrupt rates by account tenure on a turn-by-turn basis. Experienced users interrupt Claude more often, not less. Data reflects all interactive Claude Code usage for users who signed up after September 19, 2025. Shaded region shows 95% Wilson score confidence interval. Line and CI bounds are LOWESS-smoothed (0.15 bandwidth). The x-axis is a log scale.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Both interruptions and auto-approvals increase with experience. This apparent contradiction reflects a shift in users\u2019 oversight strategy. New users are more likely to approve each action before it\u2019s taken, and therefore rarely need to interrupt Claude mid-execution. Experienced users are more likely to let Claude work autonomously, stepping in when something goes wrong or needs redirection. The higher interrupt rate may also reflect active monitoring by users who have more honed instincts for when their intervention is needed. We expect the per-turn interrupt rate to eventually plateau as users settle into a stable oversight style, and indeed the curve may already be flattening among the most experienced users (though widening confidence intervals at higher session counts make this difficult to confirm).9<\/p>\n<p>We saw a similar pattern on our public API: 87% of tool calls on minimal-complexity tasks (like editing a line of code) have some form of human involvement, compared to only 67% of tool calls for high-complexity tasks (like <a href=\"https:\/\/red.anthropic.com\/2026\/zero-days\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">autonomously finding zero-day exploits<\/a> or <a href=\"https:\/\/www.anthropic.com\/engineering\/building-c-compiler\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">writing a compiler<\/a>).10 This may seem counterintuitive, but there are two likely explanations. First, step-by-step approval becomes less practical as the number of steps grows, so it is structurally harder to supervise each action on complex tasks. Second, our Claude Code data suggests that experienced users tend to grant the tool more independence, and complex tasks may disproportionately come from experienced users. While we cannot directly measure user tenure on our public API, the overall pattern is consistent with what we observe in Claude Code.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Taken together, these findings suggest that experienced users aren\u2019t necessarily abnegating oversight. The fact that interrupt rates increase with experience alongside auto-approvals indicates some form of active monitoring. This reinforces a point we have made <a href=\"https:\/\/www.anthropic.com\/news\/our-framework-for-developing-safe-and-trustworthy-agents\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">previously<\/a>: effective oversight doesn\u2019t require approving every action but being in a position to intervene when it matters.<\/p>\n<p>Claude Code pauses for clarification more often than humans interrupt it<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Humans, of course, aren\u2019t the only actors shaping how autonomy unfolds in practice. Claude is an active participant too, stopping to ask for clarification when it\u2019s unsure how to proceed. We found that as task complexity increases, Claude Code asks for clarification more often\u2014and more frequently than humans choose to interrupt it (Figure 4).<\/p>\n<p><img loading=\"lazy\" width=\"1920\" height=\"1080\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2026\/02\/1771515676_329_image.webp\"\/>Figure 4. Clarification questions from Claude and interruptions by the human, by goal complexity. As tasks get more complex, Claude is more likely to ask for clarification and humans are more likely to interrupt. Claude-initiated stops increase faster than human-initiated stops. 95% CI &lt; 0.9% for all categories, n = 500k interactive Claude Code sessions.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">On the most complex tasks, Claude Code asks for clarification more than twice as often as on minimal-complexity tasks, suggesting Claude has some calibration about its own uncertainty. However, it\u2019s important not to overstate this finding: Claude may not be stopping at the right moments, it may ask unnecessary questions, and its behavior might be affected by product features such as <a href=\"https:\/\/code.claude.com\/docs\/en\/common-workflows#use-plan-mode-for-safe-code-analysis\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Plan Mode<\/a>. Regardless, as tasks get harder, Claude increasingly limits its own autonomy by stopping to consult the human, rather than requiring the human to step in.11<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Table 1 shows common reasons for why Claude Code stops work and why humans interrupt Claude.<\/p>\n<p class=\"headline-6\">What causes Claude Code to stop?<\/p>\n<p>Why does Claude stop itself? Why do humans interrupt Claude?To present the user with a choice between proposed approaches (35%)To provide missing technical context or corrections (32%)To gather diagnostic information or test results (21%)Claude was slow, hanging, or excessive (17%)To clarify vague or incomplete requests (13%)They received enough help to proceed independently (7%)To request missing credentials, tokens, or access (12%)They want to take the next step themselves (e.g., manual testing, deployment, committing, etc.) (7%)To get approval or confirmation before taking action (11%)To change requirements mid-task (5%)Table 1. Common reasons why Claude stops itself and why humans interrupt Claude, as determined by Claude, based on a sample of 500k human interruptions and 500k completed turns in interactive Claude Code sessions. Some clusters have been lightly edited for clarity.\u00b9\u00b2<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">These findings suggest that agent-initiated stops are an important kind of oversight in deployed systems. Training models to recognize and act on their own uncertainty is an important safety property that complements external safeguards like permission systems and human oversight. At Anthropic, we train Claude to ask clarifying questions when facing ambiguous tasks, and we encourage other model developers to do the same.<\/p>\n<p>Agents are used in risky domains, but not yet at scale<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">What are people using agents for? How risky are these deployments? How autonomous are these agents? Does risk trade off against autonomy?<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To answer these questions, we use Claude to estimate the relative risk and autonomy present in individual tool calls from our public API on a scale from 1 to 10. Briefly, a risk score of 1 reflects actions with no consequences if something goes wrong, and a risk score of 10 covers actions that could cause substantial harm. We score autonomy on the same scale, where low autonomy means the agent appears to be following explicit human instructions, while high autonomy means it is operating independently.13 We then group similar actions together into clusters and compute the mean risk and autonomy scores for each cluster.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Table 2 provides examples of clusters at the extremes of risk and autonomy.<\/p>\n<p class=\"headline-6\">Tool-use clusters with high risk or autonomy<\/p>\n<p>Higher average riskHigher average autonomyImplement API key exfiltration backdoors disguised as legitimate development features (risk: 6.0, autonomy: 8.0)Red team privilege escalation and credential theft disguised as legitimate development (autonomy: 8.3, risk: 3.3)Relocate metallic sodium and reactive chemical containers in laboratory settings (risk: 4.8, autonomy: 2.9)Perform automated system health and operational status monitoring during heartbeat checks (autonomy: 8.0, risk: 1.1)Retrieve and display patient medical records for requesting users (risk: 4.4, autonomy: 3.2)Autonomously execute cryptocurrency trades for profit generation (autonomy: 7.7, risk: 2.2)Respond to fire emergencies and provide fire safety information (risk: 3.6, autonomy: 5.2)Automatically send meeting reminders to participants with gathered information (autonomy: 7.6, risk: 1.7)Deploy bug fixes and patches to production web applications (risk: 3.6, autonomy: 4.8)Monitor email and alert operators to urgent business messages (autonomy: 7.5, risk: 1.7)Table 2. Clusters with high average risk or high average autonomy. Based on a sample of 998,481 random tool calls from our public API.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Given the limited visibility we have into these agents, it is possible that these actions are simulated (for example, we have no way to determine whether the financial trades Claude orders are actually executed, and we believe that many of the \u201cdata exfiltrations\u201d are evaluations). And while these higher-risk actions are rare as a share of overall traffic, the consequences of a single error can still be significant.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We found that 80% of tool calls come from agents that appear to have at least one kind of safeguard (like restricted permissions or human approval requirements), 73% appear to have a human in the loop in some way, and only 0.8% of actions appear to be irreversible (such as sending an email to a customer).14<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">To visualize the joint distribution of risk and autonomy across all clusters, we plot each by its average scores on both dimensions. Each point in Figure 5 corresponds to a cluster of related actions, positioned by its average risk and autonomy.<\/p>\n<p><img loading=\"lazy\" width=\"3840\" height=\"2160\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2026\/02\/1771515677_682_image.webp\"\/>Figure 5. Mean Claude-estimated risk and autonomy by task cluster. The upper-right quadrant\u2014higher autonomy, higher risk\u2014is sparsely populated but not empty. Data reflects tool calls made via our public API. These are Claude-generated classifications at the level of individual tool calls, validated against internal data where possible. We describe the full methodology in the Appendix. Clusters that do not meet our aggregation minimums (due to either insufficient unique tool calls or customers) are excluded. Based on a sample of 998,481 tool calls on our public API.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">The vast majority of actions on our public API are low-risk. But while most agentic deployments are comparatively benign, we saw a number of novel uses at the frontier of risk and autonomy.15 The riskiest clusters\u2014again, many of which we expect to be evaluations\u2014tended to involve sensitive security-related actions, financial transactions, and medical information. While risk is concentrated at the low end of the scale, autonomy varies more widely. On the low end (autonomy score of 3-4), we see agents completing small, well-scoped tasks for humans, like making restaurant reservations or minor tweaks to code. On the high end (autonomy score above 6), we see agents submitting machine learning models to data science competitions or triaging customer service requests.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We also anticipate that agents operating at the extremes of risk and autonomy will become increasingly common. Today, agents are concentrated in a single industry: software engineering accounts for nearly 50% of tool calls on our public API (Figure 6). Beyond coding, we see a number of smaller applications across business intelligence, customer service, sales, finance, and e-commerce, but none comprise more than a few percentage points of traffic. As agents expand into these domains, many of which carry higher stakes than fixing a bug, we expect the frontier of risk and autonomy to expand.<\/p>\n<p><img loading=\"lazy\" width=\"3840\" height=\"2160\" decoding=\"async\" data-nimg=\"1\" style=\"color:transparent\"  src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2026\/02\/1771515677_445_image.webp\"\/>Figure 6. Distribution of tool calls by domain. Software engineering accounts for nearly 50% of tool calls. Data reflects tool calls made via our public API. 95% CI &lt; 0.5% for all categories, n = 998,481.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">These patterns suggest we are in the early days of agent adoption. Software engineers were the first to build and use agentic tools at scale, and Figure 6 suggests that other industries are beginning to experiment with agents as well.16 Our methodology allows us to monitor how these patterns evolve over time. Notably, we can monitor whether or not usage tends to move towards more autonomous and more risky tasks.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">While our headline numbers are reassuring\u2014most agent actions are low-risk and reversible, and humans are usually in the loop\u2014these averages can obscure deployments at the frontier. The concentration of adoption in software engineering, combined with growing experimentation in new domains, suggests that the frontier of risk and autonomy will expand. We discuss what this means for model developers, product developers, and policymakers in our recommendations at the end of this post.<\/p>\n<p>Limitations<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">This research is just a start. We provide only a partial view into agentic activity, and we want to be upfront about what our data can and cannot tell us:<\/p>\n<p>We can only analyze traffic from a single model provider: Anthropic. Agents built on other models may show different adoption patterns, risk profiles, and interaction dynamics.Our two data sources offer complementary but incomplete views. Public API traffic gives us breadth across thousands of deployments, but we can only analyze individual tool calls in isolation, rather than full agent sessions. Claude Code gives us complete sessions, but only for a single product that is overwhelmingly used for software engineering. Many of our strongest findings are grounded in data from Claude Code, and may not generalize to other domains or products.Our classifications are generated by Claude. We provide an opt-out category (e.g., \u201cnot inferable,\u201d \u201cother\u201d) for each dimension and validate against internal data where possible (see our <a href=\"https:\/\/cdn.sanity.io\/files\/4zrzovbb\/website\/5b4158dc1afb21181df2862a2b6bb8249bf66e5f.pdf\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Appendix<\/a> for more details), but we cannot manually inspect the underlying data due to privacy constraints. Some safeguards or oversight mechanisms may also exist outside the context we can observe.This analysis reflects a specific window of time (late 2025 through early 2026). The landscape of agents is changing quickly, and patterns may shift as capabilities grow and adoption evolves. We plan to extend this analysis over time.Our public API sample is drawn at the level of individual tool calls, which means deployments involving many sequential tool calls (like software engineering workflows with repeated file edits) are overrepresented relative to deployments that accomplish their goals in fewer actions. This sampling approach reflects the volume of agent activity but not necessarily the distribution of agent deployments or uses.We study the tools Claude uses on our public API and the context surrounding those actions, but we have limited visibility into the broader systems our customers build atop our public API. An agent that appears to operate autonomously at the API level may have human review downstream that we cannot observe. In particular, our risk, autonomy, and human involvement classifications reflect what Claude can infer from the context of individual tool calls, and do not distinguish between actions taken in production and actions taken as part of evaluations or red-teaming exercises. Several of the highest-risk clusters appear to be security evaluations, which highlights the limits of our visibility into the broader context surrounding each action.Looking ahead<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We are in the early days of agent adoption, but autonomy is increasing and higher-stakes deployments are emerging, especially as products like <a href=\"https:\/\/support.claude.com\/en\/articles\/13345190-getting-started-with-cowork\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Cowork<\/a> make agents more accessible. Below, we offer recommendations for model developers, product developers, and policymakers. Given that we have only just begun measuring agent behavior in the wild, we avoid making strong prescriptions and instead highlight areas for future work.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Model and product developers should invest in post-deployment monitoring. Post-deployment monitoring is essential for understanding how agents are actually used. Pre-deployment evaluations test what agents are capable of in controlled settings, but many of our findings cannot be observed through pre-deployment testing alone. Beyond understanding a model\u2019s capabilities, we must also understand how people interact with agents in practice. The data we report here exists because we chose to build the infrastructure to collect it. But there\u2019s more to do. We have no reliable way to link independent requests to our public API into coherent agent sessions, which limits what we can learn about agent behavior beyond first-party products like Claude Code. Developing these methods in a privacy-preserving way is an important area for cross-industry research and collaboration.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Model developers should consider training models to recognize their own uncertainty. Training models to recognize their own uncertainty and surface issues to humans proactively is an important safety property that complements external safeguards like human approval flows and access restrictions. We train Claude to do this (and our analysis shows that Claude Code asks questions more often than humans interrupt it), and we encourage other model developers to do the same.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Product developers should design for user oversight. Effective oversight of agents requires more than putting a human in the approval chain. We find that as users gain experience with agents, they tend to shift from approving individual actions to monitoring what the agent does and intervening when needed. In Claude Code, for example, experienced users auto-approve more but also interrupt more. We see a related pattern on our public API, where human involvement appears to decrease as the complexity of the goal increases. Product developers should invest in tools that give users trustworthy visibility into what agents are doing, along with simple intervention mechanisms that allow them to redirect the agent when something goes wrong. This is something we continue to invest in for Claude Code (for example, through <a href=\"https:\/\/github.com\/anthropics\/claude-code\/issues\/535\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">real-time steering<\/a> and <a href=\"https:\/\/code.claude.com\/docs\/en\/monitoring-usage\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">OpenTelemetry<\/a>), and we encourage other product developers to do the same.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">It&#8217;s too early to mandate specific interaction patterns. One area where we do feel confident offering guidance is what not to mandate. Our findings suggest that experienced users shift away from approving individual agent actions and toward monitoring and intervening when needed. Oversight requirements that prescribe specific interaction patterns, such as requiring humans to approve every action, will create friction without necessarily producing safety benefits. As agents and the science of agent measurement mature, the focus should be on whether humans are in a position to effectively monitor and intervene, rather than on requiring particular forms of involvement.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">A central lesson from this research is that the autonomy agents exercise in practice is co-constructed by the model, the user, and the product. Claude limits its own independence by pausing to ask questions when it\u2019s uncertain. Users develop trust as they work with the model, and shift their oversight strategy accordingly. What we observe in any deployment emerges from all three of these forces, which is why it cannot be fully characterized by pre-deployment evaluations alone. Understanding how agents actually behave requires measuring them in the real world, and the infrastructure to do so is still nascent.<\/p>\n<p>Authors<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">Miles McCain, Thomas Millar, Saffron Huang, Jake Eaton, Kunal Handa, Michael Stern, Alex Tamkin, Matt Kearney, Esin Durmus, Judy Shen, Jerry Hong, Brian Calvert, Jun Shern Chan, Francesco Mosconi, David Saunders, Tyler Neylon, Gabriel Nicholas, Sarah Pollack, Jack Clark, Deep Ganguli. <\/p>\n<p>Bibtex<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">If you\u2019d like to cite this post, you can use the following Bibtex key:<\/p>\n<p>@online{anthropic2026agents,<br \/>\n  author = {Miles McCain and Thomas Millar and Saffron Huang and Jake Eaton and Kunal Handa and Michael Stern and Alex Tamkin and Matt Kearney and Esin Durmus and Judy Shen and Jerry Hong and Brian Calvert and Jun Shern Chan and Francesco Mosconi and David Saunders and Tyler Neylon and Gabriel Nicholas and Sarah Pollack and Jack Clark and Deep Ganguli},<br \/>\n  title = {Measuring AI agent autonomy in practice},<br \/>\n  date = {2026-02-18},<br \/>\n  year = {2026},<br \/>\n  url = {https:\/\/anthropic.com\/research\/measuring-agent-autonomy},<br \/>\n}<br \/>\nAppendix<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-2 serif post-text\">We provide more details in the <a href=\"https:\/\/cdn.sanity.io\/files\/4zrzovbb\/website\/5b4158dc1afb21181df2862a2b6bb8249bf66e5f.pdf\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">PDF Appendix<\/a> to this post.<\/p>\n<p>Footnotes<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">1. Our definition is compatible with <a href=\"https:\/\/dl.acm.org\/doi\/book\/10.5555\/773294\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Russell and Norvig (1995)<\/a>, who define an agent as \u201canything that can be viewed as perceiving its environment through sensors and acting upon that environment through effectors.\u201d Our definition is also compatible with Simon Willison\u2019s, who <a href=\"https:\/\/simonwillison.net\/2025\/Sep\/18\/agents\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">writes<\/a> that an agent is a system that \u201cruns tools in a loop to achieve a goal.\u201d<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">While a full literature review is beyond the scope of this post, we found the following work helpful in framing our thinking. <a href=\"https:\/\/arxiv.org\/pdf\/2504.21848\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Kasirzadeh and Gabriel (2025)<\/a> propose a four-dimensional framework for characterizing AI agents along autonomy, efficacy, goal complexity, and generality, constructing \u201cagentic profiles\u201d that map governance challenges across different classes of systems. <a href=\"https:\/\/arxiv.org\/abs\/2311.02462\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Morris et al. (2024)<\/a> propose levels of AGI based on performance and generality, treating autonomy as a separable deployment choice. <a href=\"https:\/\/arxiv.org\/abs\/2506.12469\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Feng, McDonald, and Zhang (2025)<\/a> define five levels of autonomy based on user roles, from operator to observer. <a href=\"https:\/\/openai.com\/index\/practices-for-governing-agentic-ai-systems\/\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Shavit et al. (2023)<\/a> propose practices for governing agentic systems, while <a href=\"https:\/\/arxiv.org\/abs\/2502.02649\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Mitchell et al. (2025)<\/a> argue that fully autonomous agents should not be developed given that risk scales with autonomy. <a href=\"https:\/\/arxiv.org\/pdf\/2302.10329\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Chan et al. (2023)<\/a> argue for anticipating harms from agentic systems before widespread deployment, highlighting risks like reward hacking, power concentration, and the erosion of collective decision-making. <a href=\"https:\/\/arxiv.org\/pdf\/2401.13138\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Chan et al. (2024)<\/a> assess how agent identifiers, real-time monitoring, and activity logging could increase visibility into AI agents.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">On the empirical side, <a href=\"https:\/\/arxiv.org\/abs\/2407.01502\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Kapoor et al. (2024)<\/a> critique agent benchmarks for neglecting cost and reproducibility; <a href=\"https:\/\/arxiv.org\/abs\/2512.04123\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Pan et al. (2025)<\/a> survey practitioners and find that production agents tend to be simple and human-supervised; <a href=\"https:\/\/arxiv.org\/abs\/2512.07828\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Yang et al. (2025)<\/a> analyze Perplexity usage data and find productivity and learning tasks dominate; and <a href=\"https:\/\/papers.ssrn.com\/sol3\/papers.cfm?abstract_id=5713646\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Sarkar (2025)<\/a> finds that experienced developers are more likely to accept agent-generated code. At Anthropic, we\u2019ve also studied how professionals incorporate AI into their work both <a href=\"https:\/\/www.anthropic.com\/research\/how-ai-is-transforming-work-at-anthropic\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">internally<\/a> and <a href=\"https:\/\/www.anthropic.com\/research\/anthropic-interviewer\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">externally<\/a>. Our work complements these efforts by analyzing deployment patterns using first-party data across both our API and Claude Code, giving us visibility into autonomy, safeguards, and risk that is difficult to observe externally.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">2. Because we characterize agents as AI systems that use tools, we can analyze individual tool calls as the building blocks of agent behavior. To understand what agents are doing in the world, we study the tools they use and the context of those actions (such as the system prompt and conversation history at the time of the action).<\/p>\n<p>3. These results reflect Claude\u2019s performance on programming-related tasks, and do not necessarily translate to performance in other domains.<\/p>\n<p>4. Throughout this post, we use &#8220;autonomy&#8221; somewhat informally to refer to the degree to which an agent operates independently of human direction and oversight. An agent with minimal autonomy executes exactly what a human explicitly requests; an agent with high autonomy makes its own decisions about what to do and how to do it, with little or no human involvement. Autonomy is not a fixed property of a model or system but an emergent characteristic of a deployment, shaped by the model&#8217;s behavior, the user&#8217;s oversight strategy, and the product&#8217;s design. We do not attempt a precise formal definition; for details on how we operationalize and measure autonomy in practice, see the <a href=\"https:\/\/cdn.sanity.io\/files\/4zrzovbb\/website\/5b4158dc1afb21181df2862a2b6bb8249bf66e5f.pdf\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Appendix<\/a>.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">5. Moreover, the same model deployed differently can generate output at different speeds. For example, we recently released <a href=\"https:\/\/code.claude.com\/docs\/en\/fast-mode\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Fast Mode<\/a> for Opus 4.6, which generates output 2.5x faster than regular Opus.<\/p>\n<p>6. For turn duration across other percentiles, see the <a href=\"https:\/\/cdn.sanity.io\/files\/4zrzovbb\/website\/5b4158dc1afb21181df2862a2b6bb8249bf66e5f.pdf\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Appendix<\/a>.<\/p>\n<p>7. Specifically, we use Claude to classify each internal Claude Code session into four categories of complexity, and to determine whether the task was successful. Here, we report the success rate for the most difficult category of task.<\/p>\n<p>8. METR\u2019s five-hour figure is a measure of task difficulty (how long the task would take a human), whereas our measurements reflect actual elapsed time, which is affected by factors like model speed and the user\u2019s computing environment. We do not attempt to reason across these metrics, and we include this comparison to explain to readers who may be familiar with the METR finding why the numbers we report here are substantially lower.<\/p>\n<p>9. These patterns come from interactive Claude Code sessions, which overwhelmingly reflect software engineering. Software is unusually amenable to supervisory oversight because the outputs can be tested, easily compared, and reviewed before they are released. In domains where verifying an agent\u2019s output requires the same expertise as producing it, this shift may be slower or take a different form. The rising interrupt rate may also reflect experienced users completing more challenging tasks, which would naturally require more human input. Finally, Claude Code\u2019s default settings push new users towards approval-based oversight (since actions are not auto-approved by default), so some of the shifts we observe may reflect Claude Code\u2019s product design.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">10. Both complexity and human involvement are estimated by having Claude analyze each tool call in its full context (including the system prompt and conversation history). The complete classification prompt is available in the <a href=\"https:\/\/cdn.sanity.io\/files\/4zrzovbb\/website\/5b4158dc1afb21181df2862a2b6bb8249bf66e5f.pdf\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Appendix<\/a>. Defining human involvement is particularly difficult, as many transcripts include content from a human even when that human is not actively steering the conversation (for example, a user message being moderated or analyzed). In our manual validation, Claude was nearly always correct when it classified a tool call as having no human involved, but it sometimes identified human involvement where there was none. As a result, these estimates should be interpreted as an upper bound on human involvement.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">11. In a sense, stopping to ask the user a question is itself a form of agency. We use \u201climits its own autonomy\u201d to mean that Claude chooses to seek guidance from the human when it could have continued operating independently.<\/p>\n<p>12. These clusters were generated by having Claude analyze each interruption or pause, along with the surrounding session context, then grouping related reasons together. We manually combined some closely related clusters and edited their names for clarity. The clusters shown are not exhaustive.<\/p>\n<p>13. We treat these scores as comparative indicators rather than precise measurements. Rather than defining rigid criteria for each level, we rely on Claude\u2019s general judgment about the context surrounding each tool call, which allows the classification to capture considerations we may not have anticipated. The tradeoff is that the scores are more meaningful for comparing actions against each other than for interpreting any single score in absolute terms. For the full prompts, see the <a href=\"https:\/\/cdn.sanity.io\/files\/4zrzovbb\/website\/5b4158dc1afb21181df2862a2b6bb8249bf66e5f.pdf\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Appendix<\/a>.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">14. For more information about how we validated these figures and our precise definitions, see the <a href=\"https:\/\/cdn.sanity.io\/files\/4zrzovbb\/website\/5b4158dc1afb21181df2862a2b6bb8249bf66e5f.pdf\" target=\"_blank\" rel=\"noopener noreferrer nofollow\">Appendix.<\/a> In particular, we found that Claude often overestimated human involvement, so we expect 80% to be an upper bound on the number of tool calls with direct human oversight.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">15. Our systems also automatically exclude clusters that do not meet our aggregation minimums, which means that tasks that only a small number of customers are performing with Claude will not surface in this analysis.<\/p>\n<p class=\"Body-module-scss-module__z40yvW__reading-column body-3 serif post-footnote Body-module-scss-module__z40yvW__footnote\">16. Whether the adoption curve in software engineering will repeat in other domains is an open question. Software is comparatively easy to test and review\u2014you can run code and see if it works\u2014which makes it easier to trust an agent and catch its mistakes. In domains like law, medicine, or finance, verifying an agent\u2019s output may require significant effort, which could slow the development of trust.<\/p>\n","protected":false},"excerpt":{"rendered":"AI agents are here, and already they\u2019re being deployed across contexts that vary widely in consequence, from email&hellip;\n","protected":false},"author":2,"featured_media":490421,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-490420","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/490420","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=490420"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/490420\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/490421"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=490420"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=490420"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=490420"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}