{"id":89131,"date":"2025-10-18T17:37:08","date_gmt":"2025-10-18T17:37:08","guid":{"rendered":"https:\/\/www.newsbeep.com\/ie\/89131\/"},"modified":"2025-10-18T17:37:08","modified_gmt":"2025-10-18T17:37:08","slug":"its-qwens-world-and-we-get-to-live-in-it-on-caisis-report-gpt-oss-update","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ie\/89131\/","title":{"rendered":"It\u2019s Qwen&#8217;s world and we get to live in it, on CAISI&#8217;s report, &#038; GPT-OSS update"},"content":{"rendered":"<p>Before getting into the latest artifacts, there are a couple of pieces of crucial open ecosystem we have to cover.<\/p>\n<p>First, the <a href=\"https:\/\/www.nist.gov\/caisi\" rel=\"nofollow noopener\" target=\"_blank\">Center for AI Standards and Innovation (CAISI)<\/a> released a report that observed the ecosystem and <a href=\"https:\/\/www.nist.gov\/news-events\/news\/2025\/09\/caisi-evaluation-deepseek-ai-models-finds-shortcomings-and-risks\" rel=\"nofollow noopener\" target=\"_blank\">evaluated<\/a> DeepSeek 3.1 against leading closed models. The evaluation scores they highlighted show some discrepancy with accepted results in the community. While MMLU-Pro, GPQA and HLE are close to the self-reported scores from DeepSeek and within usual error bars, the SWE-bench Verified scores are off by a wide margin due to a weak harness for the benchmark. The harness is the software framework the model is used in for agentic benchmarks and has as great an impact as the model itself, as shown in <a href=\"https:\/\/epoch.ai\/blog\/what-skills-does-swe-bench-verified-evaluate#scaffolds-matter-as-much-as-models\" rel=\"nofollow noopener\" target=\"_blank\">this SWE-bench analysis by Epoch AI<\/a>. <\/p>\n<p>The CAISI report thus undersells the capabilities of DeepSeek\u2019s models on a core benchmark for recent models (e.g. it is one of the benchmarks that Anthropic most heavily relies on for <a href=\"https:\/\/www.interconnects.ai\/p\/claude-4-and-anthropics-bet-on-code\" rel=\"nofollow noopener\" target=\"_blank\">marketing of Claude<\/a>).<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/substackcdn.com\/image\/fetch\/$s_!yoYR!,f_auto,q_auto:good,fl_progressive:steep\/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb23fd663-ef06-4616-a3c4-f1491e1dd69e_1562x706.png\" data-component-name=\"Image2ToDOM\" rel=\"nofollow noopener\" class=\"image-link image2 is-viewable-img\"><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/ie\/wp-content\/uploads\/2025\/10\/https:\/\/substack-post-media.s3.amazonaws.com\/public\/images\/b23fd663-ef06-4616-a3c4-f1491e1dd69e_1562.png\" width=\"1456\" height=\"658\" data-attrs=\"{&quot;src&quot;:&quot;https:\/\/substack-post-media.s3.amazonaws.com\/public\/images\/b23fd663-ef06-4616-a3c4-f1491e1dd69e_1562x706.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:658,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" alt=\"Image\" title=\"Image\"   fetchpriority=\"high\" class=\"sizing-normal\"\/><\/a><\/p>\n<p>Later in the report, CAISI shows a graph with cumulative download numbers from HuggingFace (left), something we also show on <a href=\"https:\/\/atomproject.ai\/\" rel=\"nofollow noopener\" target=\"_blank\">atomproject.ai<\/a> (middle, right). However, our numbers differ greatly from the numbers from CAISI and those differ even more from the ones by <a href=\"https:\/\/huggingface.co\/blog\/lbourdois\/huggingface-models-stats\" rel=\"nofollow noopener\" target=\"_blank\">HuggingFace itself<\/a>. So, what is going on? <\/p>\n<p><a target=\"_blank\" href=\"https:\/\/substackcdn.com\/image\/fetch\/$s_!EWYT!,f_auto,q_auto:good,fl_progressive:steep\/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe60a29-3da8-468c-b368-aad74cd1255e_2043x501.png\" data-component-name=\"Image2ToDOM\" rel=\"nofollow noopener\" class=\"image-link image2\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/ie\/wp-content\/uploads\/2025\/10\/https:\/\/substack-post-media.s3.amazonaws.com\/public\/images\/efe60a29-3da8-468c-b368-aad74cd1255e_2043.jpeg\" width=\"1456\" height=\"357\" data-attrs=\"{&quot;src&quot;:&quot;https:\/\/substack-post-media.s3.amazonaws.com\/public\/images\/efe60a29-3da8-468c-b368-aad74cd1255e_2043x501.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:357,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:212129,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image\/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https:\/\/www.interconnects.ai\/i\/176399506?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe60a29-3da8-468c-b368-aad74cd1255e_2043x501.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" alt=\"\"   class=\"sizing-normal\"\/><\/a><\/p>\n<p>In short, it depends on which data you look at and how you clean it. For ATOM Project, we only consider models which were released after ChatGPT and are LLMs (based on our assessment). This excludes models like GPT-2 (which is the reason why OpenAI trumps all in the CAISI number, left), BERT-like models and ViTs like SigLIP (which are dominating the Google download numbers). <\/p>\n<p>On top of this, we performed basic outlier filtering on daily downloads per model. Many models, such as <a href=\"https:\/\/huggingface.co\/Qwen\/Qwen2.5-1.5B-Instruct\" rel=\"nofollow noopener\" target=\"_blank\">Qwen 2.5 1.5B<\/a>, which is one of the most downloaded models of all time, has extreme outliers on the order of 10M+ downloads that can heavily skew the overall numbers. These outliers affect every organization, but in different magnitudes. Furthermore, we also exclude quantized (like FP8, MLX or GGUF) versions, as those might skew the numbers. <\/p>\n<p data-attrs=\"{&quot;url&quot;:&quot;https:\/\/www.interconnects.ai\/p\/latest-open-models-15-its-qwens-world?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}\" data-component-name=\"ButtonCreateButton\" class=\"button-wrapper\"><a href=\"https:\/\/www.interconnects.ai\/p\/latest-open-models-15-its-qwens-world?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share\" rel=\"nofollow noopener\" class=\"button primary\" target=\"_blank\">Share<\/a><\/p>\n<p>The second news item is sharing an update on the utility of <a href=\"https:\/\/www.interconnects.ai\/p\/gpt-oss-openai-validates-the-open\" rel=\"nofollow noopener\" target=\"_blank\">GPT-OSS<\/a> \u2014 when the model first dropped it was plagued by implementation difficulties downstream of architecture choices (e.g. a new 4-point precision) and complex tool use (multiple tool options per category). OpenAI is actually ahead of the curve on the complexity of tools they support with these models among open options. Since release, the use of GPT-OSS\u2019s <a href=\"https:\/\/huggingface.co\/openai\/gpt-oss-20b\" rel=\"nofollow noopener\" target=\"_blank\">20B<\/a> and <a href=\"https:\/\/huggingface.co\/openai\/gpt-oss-120b\" rel=\"nofollow noopener\" target=\"_blank\">120B<\/a> models is very strong with 5.6M and 3.2M downloads in the last month, respectively. These models are outperforming some popular models, such as <a href=\"https:\/\/huggingface.co\/Qwen\/Qwen3-4B\" rel=\"nofollow noopener\" target=\"_blank\">Qwen 3 4B<\/a> or <a href=\"https:\/\/huggingface.co\/Qwen\/Qwen3-VL-30B-A3B-Instruct\" rel=\"nofollow noopener\" target=\"_blank\">Qwen3-VL-30B-A3B-Instruct<\/a>. Additionally, I got very strong feedback from the community when I did a <a href=\"https:\/\/x.com\/natolambert\/status\/1974121935408935337\" rel=\"nofollow\">basic pulse check<\/a> on the models. These are one of the first models I\u2019d try on my <a href=\"https:\/\/x.com\/natolambert\/status\/1979291280833229246\" rel=\"nofollow\">new Nvidia DGX-Spark<\/a> to get a feel for things.<\/p>\n<p><a href=\"https:\/\/huggingface.co\/ibm-granite\/granite-4.0-h-small\" rel=\"nofollow noopener\" target=\"_blank\">granite-4.0-h-small<\/a> by <a href=\"https:\/\/huggingface.co\/ibm-granite\" rel=\"nofollow noopener\" target=\"_blank\">ibm-granite<\/a>: We\u2019ve been covering IBM and their Granite LLM series for a while. With this series, IBM finally scaled up the model size as well, bringing a series of hybrid (attention + mamba) models, ranging from a 3B dense to a 32B-A9B MoE. We used the models and were impressed, although not surprised, by the quality, given the continued persistence of IBM\u2019s team to release better and better models. <\/p>\n<p>Granite, for at least the 3B variant, is roughly in the <a href=\"https:\/\/huggingface.co\/blog\/smollm3\" rel=\"nofollow noopener\" target=\"_blank\">SmolLM3<\/a> quality range, being only surpassed by Qwen3 4B in terms of multilingual and instruction following capabilities. The tone of Granite 4.0 is refreshingly non-exciting compared to the sloptimized models recently (i.e. the trend across the industry for playful, emoji-filled, and often sycophantic models), making it feel like old Mistral models in a good way. Interestingly enough, they are also following Qwens lead and will release a separate reasoning model later in the year. We\u2019ve heard many reports from people training models that hybrid reasoning \u2014 i.e. a toggle of thinking tokens on and off \u2014 adds a major complexity cost in training that lowers the peak performance of both modes. IBM <a href=\"https:\/\/huggingface.co\/ibm-granite\/granite-3.2-8b-instruct\" rel=\"nofollow noopener\" target=\"_blank\">debuted<\/a> the hybrid thinking approach (togglable via prompts) very early on for open models, which was adopted by others later.<\/p>\n<p><a href=\"https:\/\/huggingface.co\/Qwen\/Qwen3-VL-235B-A22B-Instruct\" rel=\"nofollow noopener\" target=\"_blank\">Qwen3-VL-235B-A22B-Instruct<\/a> by <a href=\"https:\/\/huggingface.co\/Qwen\" rel=\"nofollow noopener\" target=\"_blank\">Qwen<\/a>: The Qwen VL series finally gets its long-awaited and anticipated update with small (4B, 8B) dense and larger (30B-A3B, 235B-A22B) MoEs in both instruct and reasoning versions. We want to shine a special spotlight on the <a href=\"https:\/\/huggingface.co\/Qwen\/Qwen3-VL-8B-Instruct\" rel=\"nofollow noopener\" target=\"_blank\">8B variants<\/a>: Their text benchmarks have also improved across the board compared to the initial 8B release \u2014 reinforcing our point on the challenge of hybrid reasoning. As the 8B versions did not get a 2507 refresh, these versions should be a no-brainer update and drop-in replacement if you were using Qwen3 8B (or are still using Llama3.1 8B).<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/substackcdn.com\/image\/fetch\/$s_!kyNm!,f_auto,q_auto:good,fl_progressive:steep\/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb992e217-ba81-49a1-82ec-b394f9be7d03_3949x2349.jpeg\" data-component-name=\"Image2ToDOM\" rel=\"nofollow noopener\" class=\"image-link image2 is-viewable-img\"><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/ie\/wp-content\/uploads\/2025\/10\/https:\/\/substack-post-media.s3.amazonaws.com\/public\/images\/b992e217-ba81-49a1-82ec-b394f9be7d03_3949.jpeg\" width=\"1456\" height=\"866\" data-attrs=\"{&quot;src&quot;:&quot;https:\/\/substack-post-media.s3.amazonaws.com\/public\/images\/b992e217-ba81-49a1-82ec-b394f9be7d03_3949x2349.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:866,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" alt=\"\"   loading=\"lazy\" class=\"sizing-normal\"\/><\/a><\/p>\n<p><a href=\"https:\/\/huggingface.co\/zai-org\/GLM-4.6\" rel=\"nofollow noopener\" target=\"_blank\">GLM-4.6<\/a> by <a href=\"https:\/\/huggingface.co\/zai-org\" rel=\"nofollow noopener\" target=\"_blank\">zai-org<\/a>: Zhipu has released an update to their main series of models. This release is notable because <a href=\"https:\/\/x.com\/victormustar\/status\/1973735580283625618\" rel=\"nofollow\">many<\/a> <a href=\"https:\/\/x.com\/Tim_Dettmers\/status\/1974421423713386661\" rel=\"nofollow\">people<\/a> say that it\u2019s basically a Sonnet (or a Haiku) 4.5 at home, although it falls off (harder) at longer context than closed models. Still, a high praise and a continuation of the theme that Chinese open models improve at an astonishing rate, being close to the best closed models.<\/p>\n<p><a href=\"https:\/\/huggingface.co\/inclusionAI\/Ling-1T\" rel=\"nofollow noopener\" target=\"_blank\">Ling-1T<\/a> by <a href=\"https:\/\/huggingface.co\/inclusionAI\" rel=\"nofollow noopener\" target=\"_blank\">inclusionAI<\/a>: Inclusion AI is waking up and starting to adopt the release cadence of its bigger brother by releasing models left, right and center. Similar to Qwen, they also started to scale up model sizes considerably, hitting the 1T threshold. They also <a href=\"https:\/\/huggingface.co\/inclusionAI\/Ring-1T-preview\" rel=\"nofollow noopener\" target=\"_blank\">release a reasoning version<\/a> and experiment with different architectures and modalities. Keep an eye on them!<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/substackcdn.com\/image\/fetch\/$s_!5EW-!,f_auto,q_auto:good,fl_progressive:steep\/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2127eef5-ff98-4a00-8b54-a50feb1ca654_4770x2187.png\" data-component-name=\"Image2ToDOM\" rel=\"nofollow noopener\" class=\"image-link image2 is-viewable-img\"><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/ie\/wp-content\/uploads\/2025\/10\/https:\/\/substack-post-media.s3.amazonaws.com\/public\/images\/2127eef5-ff98-4a00-8b54-a50feb1ca654_4770.png\" width=\"1456\" height=\"668\" data-attrs=\"{&quot;src&quot;:&quot;https:\/\/substack-post-media.s3.amazonaws.com\/public\/images\/2127eef5-ff98-4a00-8b54-a50feb1ca654_4770x2187.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:668,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" alt=\"\"   loading=\"lazy\" class=\"sizing-normal\"\/><\/a><\/p>\n<p><a href=\"https:\/\/huggingface.co\/moondream\/moondream3-preview\" rel=\"nofollow noopener\" target=\"_blank\">moondream3-preview<\/a> by <a href=\"https:\/\/huggingface.co\/moondream\" rel=\"nofollow noopener\" target=\"_blank\">moondream<\/a>: Moondream is far from an unknown player these days and known for punching well above its size. They too adopted the MoE architecture with 9B total, 2B active parameters and improved the already great benchmarks even further. An interesting aspect is its unique (in the AI world) license:<\/p>\n<p>TL;DR \u2014 You can use Moondream 3 (Preview) freely for personal, research, and most commercial uses. What\u2019s NOT allowed without a separate deal is offering a paid product that competes with M87 Labs\u2019 paid versions (e.g., selling hosted or embedded access to the model\u2019s capabilities to third parties).<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/substackcdn.com\/image\/fetch\/$s_!Hub9!,f_auto,q_auto:good,fl_progressive:steep\/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f4f4cde-4ecd-4272-b5bc-216ea67369ea_2322x1050.jpeg\" data-component-name=\"Image2ToDOM\" rel=\"nofollow noopener\" class=\"image-link image2 is-viewable-img\"><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/ie\/wp-content\/uploads\/2025\/10\/https:\/\/substack-post-media.s3.amazonaws.com\/public\/images\/9f4f4cde-4ecd-4272-b5bc-216ea67369ea_2322.jpeg\" width=\"1456\" height=\"658\" data-attrs=\"{&quot;src&quot;:&quot;https:\/\/substack-post-media.s3.amazonaws.com\/public\/images\/9f4f4cde-4ecd-4272-b5bc-216ea67369ea_2322x1050.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:658,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;benchmarks&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" alt=\"benchmarks\" title=\"benchmarks\"   loading=\"lazy\" class=\"sizing-normal\"\/><\/a><\/p>\n<p>In the rest of the issue we highlight the long-tail of models, which again highlights the sweeping approach we\u2019ve seen throughout the year from Qwen, but with continuing support from other rising Chinese labs. One of the sad things in this issue is that there are actually 0 datasets that cleared our bar of relevance. Open data continues to be in a very precarious position.<\/p>\n<p><a href=\"https:\/\/huggingface.co\/Qwen\/Qwen3-Next-80B-A3B-Instruct\" rel=\"nofollow noopener\" target=\"_blank\">Qwen3-Next-80B-A3B-Instruct<\/a> by <a href=\"https:\/\/huggingface.co\/Qwen\" rel=\"nofollow noopener\" target=\"_blank\">Qwen<\/a>: Of course, Qwen is also exploring different architectures, releasing a LLM with hybrid attention, consisting of Gated DeltaNet and Gated Attention. This model is trained on over 15T tokens and could be the groundwork for the next generation of Qwen models. Junyang Lin writes in a tweet about this series:<\/p>\n<p>Qwen3-Next, or to say, a preview of our next generation (3.5?) is out!<\/p>\n<p>This time we try to be bold, but actually we have been doing experiments on hybrid models and linear attention for about a year. We believe that our solution should be at least a stable and solid solution to new model architecture for super long context!<\/p>\n","protected":false},"excerpt":{"rendered":"Before getting into the latest artifacts, there are a couple of pieces of crucial open ecosystem we have&hellip;\n","protected":false},"author":2,"featured_media":89132,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[220,218,219,61,60,80],"class_list":{"0":"post-89131","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-ie","12":"tag-ireland","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/89131","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/comments?post=89131"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/89131\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media\/89132"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media?parent=89131"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/categories?post=89131"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/tags?post=89131"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}