{"id":32560,"date":"2025-07-24T01:15:08","date_gmt":"2025-07-24T01:15:08","guid":{"rendered":"https:\/\/www.newsbeep.com\/us\/32560\/"},"modified":"2025-07-24T01:15:08","modified_gmt":"2025-07-24T01:15:08","slug":"a-new-ai-coding-challenge-just-published-its-first-results-and-they-arent-pretty","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/us\/32560\/","title":{"rendered":"A new AI coding challenge just published its first results \u2013 and they aren\u2019t pretty"},"content":{"rendered":"<p id=\"speakable-summary\" class=\"wp-block-paragraph\">A new AI coding challenge has revealed its first winner \u2014 and set a new bar for AI-powered software engineers.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">On Wednesday at 5pm PST, the nonprofit Laude Institute announced the first winner of the K Prize, a multi-round AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian prompt engineer named Eduardo Rocha de Andrade, who will receive $50,000 for the prize. But more surprising than the win was his final score: he won with correct answers to just 7.5% of the questions on the test.<\/p>\n<p class=\"wp-block-paragraph\">\u201cWe\u2019re glad we built a benchmark that is actually hard,\u201d said Konwinski. \u201cBenchmarks should be hard if they\u2019re going to matter,\u201d he continued, adding: \u201cScores would be different if the big labs had entered with their biggest models. But that\u2019s kind of the point. K Prize runs offline with limited compute, so it favors smaller and open models. I love that. It levels the playing field.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Konwinski has pledged $1 million to the first open-source model that can score higher than 90% on the test.<\/p>\n<p class=\"wp-block-paragraph\">Similar to the well-known SWE-Bench system, the K Prize tests models against flagged issues from GitHub as a test of how well models can deal with real-world programming problems. But while SWE-Bench is based on a fixed set of problems that models can train against, the K Prize is designed as a \u201ccontamination-free version of SWE-Bench,\u201d using a timed entry system to guard against any benchmark-specific training. For round one, models were due by March 12th. The K Prize organizers then built the test using only GitHub issues flagged after that date.<\/p>\n<p class=\"wp-block-paragraph\">The 7.5% top score stands in marked contrast to SWE-Bench itself, which currently shows a 75% top score on its easier \u2018Verified\u2019 test and 34% on its harder \u2018Full\u2019 test. Konwinski still isn\u2019t sure whether the disparity is due to contamination on SWE-Bench or just the challenge of collecting new issues from GitHub, but he expects the K Prize project to answer the question soon.<\/p>\n<p class=\"wp-block-paragraph\">\u201cAs we get more runs of the thing, we\u2019ll have a better sense,\u201d he told TechCrunch, \u201cbecause we expect people to adapt to the dynamics of competing on this every few months.\u201d<\/p>\n<p>Techcrunch event<\/p>\n<p>\n\t\t\t\t\t\t\t\t\tSan Francisco<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t|<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\tOctober 27-29, 2025\n\t\t\t\t\t\t\t<\/p>\n<p class=\"wp-block-paragraph\">It might seem like an odd place to fall short, given the wide range of AI coding tools already publicly available \u2013 but with benchmarks becoming too easy, many critics see projects like the K Prize as a necessary step toward solving <a rel=\"nofollow noopener\" href=\"https:\/\/www.technologyreview.com\/2025\/05\/08\/1116192\/how-to-build-a-better-ai-benchmark\/\" target=\"_blank\">AI\u2019s growing evaluation problem<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">\u201cI\u2019m quite bullish about building new tests for existing benchmarks,\u201d says Princeton researcher Sayash Kapoor, who put forward a similar idea <a rel=\"nofollow noopener\" href=\"https:\/\/arxiv.org\/abs\/2407.01502\" target=\"_blank\">in a recent paper<\/a>. \u201cWithout such experiments, we can\u2019t actually tell if the issue is contamination, or even just targeting the SWE-Bench leaderboard with a human in the loop.\u201d<\/p>\n<p class=\"wp-block-paragraph\">For Konwinski, it\u2019s not just a better benchmark, but an open challenge to the rest of the industry. \u201cIf you listen to the hype, it\u2019s like we should be seeing AI doctors and AI lawyers and AI software engineers, and that\u2019s just not true,\u201d he says. \u201cIf we can\u2019t even get more than 10% on a contamination free SWE-Bench, that\u2019s the reality check for me.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"A new AI coding challenge has revealed its first winner \u2014 and set a new bar for AI-powered&hellip;\n","protected":false},"author":2,"featured_media":32561,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[45],"tags":[182,27248,181,507,27249,27250,74],"class_list":{"0":"post-32560","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-andy-konwinski","10":"tag-artificial-intelligence","11":"tag-artificialintelligence","12":"tag-k-prize","13":"tag-laude-institute","14":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/32560","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/comments?post=32560"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/32560\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media\/32561"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media?parent=32560"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/categories?post=32560"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/tags?post=32560"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}