{"id":54778,"date":"2025-10-01T11:07:09","date_gmt":"2025-10-01T11:07:09","guid":{"rendered":"https:\/\/www.newsbeep.com\/ie\/54778\/"},"modified":"2025-10-01T11:07:09","modified_gmt":"2025-10-01T11:07:09","slug":"new-project-makes-wikipedia-data-more-accessible-to-ai","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ie\/54778\/","title":{"rendered":"New project makes Wikipedia data more accessible to AI"},"content":{"rendered":"<p id=\"speakable-summary\" class=\"wp-block-paragraph\">On Wednesday, Wikimedia Deutschland announced a new database that will make Wikipedia\u2019s wealth of knowledge more accessible to AI models.<\/p>\n<p class=\"wp-block-paragraph\">Called the Wikidata Embedding Project, the system applies a vector-based semantic search \u2014 a technique that helps computers understand the meaning and relationships between words \u2014 to the existing data on Wikipedia and its sister platforms, consisting of nearly 120 million entries. <\/p>\n<p class=\"wp-block-paragraph\">Combined with new support for the Model Context Protocol (MCP), a standard that helps AI systems communicate with data sources, the project makes the data more accessible to natural language queries from LLMs.<\/p>\n<p class=\"wp-block-paragraph\">The project was undertaken by Wikimedia\u2019s German branch in collaboration with the neural search company Jina.AI and DataStax, a real-time training-data company owned by IBM.<\/p>\n<p class=\"wp-block-paragraph\">Wikidata has offered machine-readable data from Wikimedia properties for years, but the pre-existing tools only allowed for keyword searches and SPARQL queries, a specialized query language. The new system will work better with retrieval-augmented generation (RAG) systems that allow AI models to pull in external information, giving developers a chance to ground their models in knowledge verified by Wikipedia editors.<\/p>\n<p class=\"wp-block-paragraph\">The data is also structured to provide crucial semantic context. Querying the database for <a rel=\"nofollow noopener\" href=\"https:\/\/www.wikidata.org\/wiki\/Q901\" target=\"_blank\">the word \u201cscientist,\u201d<\/a> for instance, will produce lists of prominent nuclear scientists as well as scientists who worked at Bell Labs. There are also translations of the word \u201cscientist\u201d into different languages, a Wikimedia-cleared image of scientists at work, and extrapolations to related concepts like \u201cresearcher\u201d and \u201cscholar.\u201d<\/p>\n<p class=\"wp-block-paragraph\">The database is <a rel=\"nofollow noopener\" href=\"https:\/\/wd-vectordb.toolforge.org\" target=\"_blank\">publicly accessible on Toolforge<\/a>. Wikidata is also hosting <a rel=\"nofollow noopener\" href=\"https:\/\/www.wikidata.org\/wiki\/Event:Embedding_Project_Webinar\" target=\"_blank\">a webinar for interested developers<\/a> on October 9th.<\/p>\n<p>Techcrunch event<\/p>\n<p>\n\t\t\t\t\t\t\t\t\tSan Francisco<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t|<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\tOctober 27-29, 2025\n\t\t\t\t\t\t\t<\/p>\n<p class=\"wp-block-paragraph\">The new project comes as AI developers are scrambling for high-quality data sources that can be used to fine-tune models. The training systems themselves have become more sophisticated \u2014 often assembled <a href=\"https:\/\/techcrunch.com\/2025\/09\/21\/silicon-valley-bets-big-on-environments-to-train-ai-agents\/\" rel=\"nofollow noopener\" target=\"_blank\">as complex training environments<\/a> rather than simple datasets \u2014 but they still require closely curated data to function well. For deployments that require high accuracy, the need for reliable data is particularly urgent, and while some might look down on Wikipedia, its data is significantly more fact-oriented than catchall datasets like <a rel=\"nofollow noopener\" href=\"https:\/\/commoncrawl.org\/\" target=\"_blank\">the Common Crawl<\/a>, which is a massive collection of web pages scraped from across the internet.<\/p>\n<p class=\"wp-block-paragraph\">In some cases, the push for high-quality data can have expensive consequences for AI labs. In August, Anthropic offered to settle a lawsuit with a group of authors whose works had been used as training material, by agreeing to <a href=\"https:\/\/techcrunch.com\/2025\/08\/26\/anthropic-settles-ai-book-training-lawsuit-with-authors\/\" rel=\"nofollow noopener\" target=\"_blank\">pay $1.5 billion<\/a> to end any claims of wrongdoing.<\/p>\n<p class=\"wp-block-paragraph\">In a statement to the press, Wikidata AI project manager Philippe Saad\u00e9 emphasized his project\u2019s independence from major AI labs or large tech companies. \u201cThis Embedding Project launch shows that powerful AI doesn\u2019t have to be controlled by a handful of companies,\u201d Saad\u00e9 told reporters. \u201cIt can be open, collaborative, and built to serve everyone.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"On Wednesday, Wikimedia Deutschland announced a new database that will make Wikipedia\u2019s wealth of knowledge more accessible to&hellip;\n","protected":false},"author":2,"featured_media":54779,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[220,218,219,61,60,80,15920,38970],"class_list":{"0":"post-54778","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-ie","12":"tag-ireland","13":"tag-technology","14":"tag-training-data","15":"tag-wikimedia"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/54778","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/comments?post=54778"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/54778\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media\/54779"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media?parent=54778"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/categories?post=54778"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/tags?post=54778"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}