Dynamic Languages Faster and Cheaper in 13-Language Claude Code Benchmark

A new benchmark by Ruby committer Yusuke Endoh tested how efficiently Claude Code generates working implementations across 13 programming languages. Over 600 runs, dynamic languages, specifically Ruby, Python, and JavaScript, were consistently the fastest, cheapest, and most stable, while statically typed languages were 1.4 to 2.6 times slower and more expensive.

The experiment, published on DEV Community with all code and results available on GitHub, asked Claude Code (Opus 4.6) to implement a simplified version of Git in each language. The task was split into two phases: v1 implementing init, add, commit, and log from an empty directory, and v2 extending the project with status, diff, checkout, and reset. Each language was run 20 times. The author used a custom hash algorithm rather than SHA-256 to eliminate differences in library dependencies across languages.

Ruby averaged $0.36 per run at 73.1 seconds, Python came in at $0.38 per run and 74.6 seconds, and JavaScript at $0.39 per run and 81.1 seconds. All three had low variance and passed all tests across all 40 runs. From fourth place onward, costs rose, and variance increased sharply. Go averaged $0.50 at 101.6 seconds, yet with a standard deviation of 37 seconds. Rust averaged $0.54 but had the widest spread at 54.8 seconds and was one of only two languages with test failures. C was the most expensive mainstream language at $0.74, weighed down by generating 517 lines of code compared to Ruby’s 219.

The type system findings may be the most practically useful result for teams evaluating AI coding workflows. Adding mypy strict checking to Python made it 1.6 to 1.7 times slower. Adding Steep type checking to Ruby imposed an even larger penalty, making it 2.0 to 3.2 times slower than plain Ruby. TypeScript was notably more expensive than JavaScript, averaging $0.62 versus $0.39, despite producing similar line counts. The author notes that the overhead is not just from generating type annotations but likely from higher thinking-token usage as the model reasons about type constraints.

Endoh is transparent about the limitations. He is a Ruby committer and flags that bias. The generated programs are roughly 200 lines of code, firmly at prototyping scale, and he acknowledges that static typing may prove advantageous in larger codebases. The experiment was also supported by Anthropic’s Claude for Open Source Program, which provided six months of free Claude Max access. The benchmark only measures generation cost and speed, not code quality, maintainability, or runtime performance.

Discussion on Lobsters challenged whether prototyping-scale conclusions can be drawn from 200-line outputs, with one commenter noting that very few useful prototypes are that small. Others pointed out that the benchmark does not account for ecosystem advantages, where languages with strong package ecosystems would require less generated code for real-world tasks. A commenter on the DEV Community post raised a qualitative concern: that a 2x speedup is potentially offset if the generated code is harder to modify later, and that Rust and Haskell test failures should not simply be categorized as bugs, since stricter type systems are designed to catch errors early rather than letting them reach production.

Endoh addresses several of these points directly. On scale, he agrees that a larger benchmark would be valuable but notes the difficulty of designing one that is fair across 15 languages. On the 2x speed difference, he argues that in iterative AI-assisted development, the gap between waiting 30 seconds and 60 seconds matters for developer flow, though he concedes the difference becomes irrelevant if future models reduce generation times to sub-second levels. On ecosystem effects, he deliberately excluded library dependencies to isolate language-level differences, using a custom hash function for exactly this reason.

Out of 600 total runs, only 3 produced failures: two in Rust and one in Haskell. In one Rust failure log, the agent claimed the tests were wrong, which the author identified as a hallucination since all other Rust trials succeeded.

The full dataset, including per-run results, execution logs, and all generated source code, is available in the benchmark repository.

Dynamic Languages Faster and Cheaper in 13-Language Claude Code Benchmark

Tags: