An Anthropic researcher’s efforts to get its newly released Opus 4.6 model to build a C compiler left him “excited,” “concerned,” and “uneasy.”

It also left many observers on GitHub skeptical, to say the least.

Nicholas Carlini, a researcher on Anthropic’s Safeguards team, detailed the experiment with what he called “agent teams” in a blog that coincided with the official release of Opus 4.6.

He said he “tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. After nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.”

With agent teams, he said, “multiple Claude instances work in parallel on a shared codebase without active human intervention.”

One key task was getting round the need for “an operator to be online and available to work jointly,” which we presume means removing the need for Claude Code to wait for a human to tell it what to do next.

“To elicit sustained, autonomous progress, I built a harness that sticks Claude in a simple loop… When it finishes one task, it immediately picks up the next.” Imagine if humans took that sort of approach.

Carlini continued: “I leave it up to each Claude agent to decide how to act. In most cases, Claude picks up the ‘next most obvious’ problem.” This threw up a number of lessons, including the need to “write extremely high quality tests.”

Readers were also advised to “put yourself in Claude’s shoes.” That means the “test harness should not print thousands of useless bytes” to make it easier for Claude to find what it needs.

Also, “Claude can’t tell time and, left alone, will happily spend hours running tests instead of making progress.”

Which might make you feel working with Claude is closer to working with a regular human than you might have thought. But what was the upshot of all of this?

“Over nearly 2,000 Claude Code sessions across two weeks, Opus 4.6 consumed 2 billion input tokens and generated 140 million output tokens, a total cost just under $20,000.”

This made it “an extremely expensive project” compared to the priciest Claude Max plans, Carlini said. “But that total is a fraction of what it would cost me to produce this myself – let alone an entire team.”

Other lessons? “The compiler successfully builds many projects, but not all. It’s not yet a drop-in replacement for a real compiler.” Moreover, “the generated code is not very efficient.”

He added that the Rust code quality is “reasonable but… nowhere near the quality of what an expert Rust programmer might produce.”

Carlini concluded: “Agent teams show the possibility of implementing entire, complex projects autonomously.”

But as a former pen-tester, he said fully autonomous development posed real risks. “The thought of programmers deploying software they’ve never personally verified is a real concern.” Ultimately, the experiment “excites me, [but] also leaves me feeling uneasy.”

Comments on GitHub were less equivocal, not least because they felt the $20K price tag ignored a few other elements, such as the vast amount of other programmers’ code the model was trained on in the first place.

As mohswell put it: “If I went to the supermarket, stole a bit of every bread they had, and shoved it together, no one would say I made bread from scratch. They’d say I’m a thief. If this is ‘from scratch,’ then my cooking is farm-to-table.”

While Sambit003 opined: “The comment section and the issue itself is ‘absolute cinema’ moment everyone living through😂… the longer the AI generated codes I see… the safer I feel. 😂 Still we have the jobs (for long enough years)… just enjoy the overhyping bruh.”

Serkosal added plaintively: “okay, nice, could @claude find gf for me? No? I’m not interested.” ®