{"id":326232,"date":"2026-03-12T16:34:07","date_gmt":"2026-03-12T16:34:07","guid":{"rendered":"https:\/\/www.newsbeep.com\/nz\/326232\/"},"modified":"2026-03-12T16:34:07","modified_gmt":"2026-03-12T16:34:07","slug":"journal-of-medical-internet-research-44","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/nz\/326232\/","title":{"rendered":"Journal of Medical Internet Research"},"content":{"rendered":"<p class=\"abstract-paragraph\">Key Takeaways<\/p>\n<p>An open access dataset has highlighted how bad data can propagate through the research ecosystem.When trained on unvalidated datasets, machine learning can amplify misinformation, erode trust in science, and harm vulnerable populations.Enforced data provenance systems could play a key role in preventing bad data from corrupting the scientific record.<\/p>\n<p class=\"abstract-paragraph\">When an unvalidated dataset recently made it into the medical literature, it exposed several weaknesses in data governance. The dataset was uploaded to Kaggle\u2014a large online platform where users can share publicly accessible data, code, and models\u2014and was fundamentally flawed []. Its developer had compiled unverified images of children from websites related to autism to train an artificial intelligence (AI) model to \u201cdetect the presence of autism or the absence thereof\u201d from the scraped images [].<\/p>\n<p class=\"abstract-paragraph\">A sharp-eyed reviewer exposed the problem only at the publication stage; by December 2025, it was estimated that over 90 published papers had incorporated the bad data, leading to investigations and double-digit retractions [,].<\/p>\n<p class=\"abstract-paragraph\">These kinds of data integrity and governance failures are particularly consequential because of how early they occur in the research life cycle, and with open access datasets fueling large-scale machine learning and other AI research, analyses can be generated and published at unprecedented speed and scale, allowing data issues to propagate more rapidly throughout the research ecosystem. Far from an isolated incident [-], this situation highlights the need for more robust and proactive data governance solutions.<\/p>\n<p>The Impact of Bad Data<\/p>\n<p class=\"abstract-paragraph\">Anne Borden is an autism advocate, journalist, and author of the upcoming book The Informed Parent\u2014a decision-making guide for parents of autistic children. For Borden, the priority here is to learn from this \u201cbizarre story\u201d and fix the system without delay. \u201cYou really have to stop misinformation being perpetuated under the banner of science,\u201d she says, \u201cbecause once it\u2019s out there, you\u2019re done. The Internet is forever.\u201d<\/p>\n<p>Bad Data, Bad Science: Who Should Fix This?<\/p>\n<p class=\"abstract-paragraph\">Who are the custodians of good data during their migration from a spreadsheet to the scientific record? What role should each stakeholder play in maintaining data integrity? While responsibility for data governance is distributed across many actors (including researchers and regulators), data-sharing platforms, research and funding institutions, and academic publishers help determine how data are shared, vetted, and ultimately incorporated into the scientific record.<\/p>\n<p>The Data-Sharing Platforms<\/p>\n<p class=\"abstract-paragraph\">Open access databases and data repositories, like Kaggle and GitHub, are popular resources that software developers and data scientists use to train their machine learning algorithms for free. Software development benefits from these repositories, yet the datasets they host often lack the documentation, governance, and quality practices required for careful medical research or clinical algorithm development [].<\/p>\n<p class=\"abstract-paragraph\">Alan Katz, MBChB, MSc, CCFP, is a professor of family medicine and community health sciences and a senior scientist at the Manitoba Centre for Health Policy (MCHP). Katz found the dataset revelations \u201cboth shocking, but also not surprising\u201d due to the rapid expansion of open access databases and their widespread use in machine learning and AI research. The Kaggle-style data-sharing platforms differ sharply from established medical databases, such as those maintained by the MCHP, which employs full-time staff tasked with validating all new data before uploading them. Katz says, \u201cWe take our ethical standards as seriously as clinical trials do.\u201d<\/p>\n<p class=\"abstract-paragraph\">Elizabeth Green, DPhil, is a lecturer in business and law at the University of the West of England, Bristol. Her research focuses on data integrity, and while she has seen cases like this before, she doesn\u2019t believe locking data away is necessarily the solution []. For example, DermAtlas\u2014an open-source medical database of skin conditions\u2014is a \u201cfantastic resource,\u201d she says, and \u201cextremely helpful, especially in [diagnosing] some extremely rare cases.\u201d To balance the risks and benefits of open data, the focus should instead be on building better governance systems.<\/p>\n<p>The Institutions<\/p>\n<p class=\"abstract-paragraph\">Other stakeholders in the data transformation journey are the institutions that conduct primary medical research and the public agencies that fund that research. Is it time to adopt and enforce international data integrity and ethics standards at all research institutions, or would this be an affront to academic freedom?<\/p>\n<p class=\"abstract-paragraph\">Funding bodies have traditionally taken a dim view of researchers that waste public funds on bogus science, which impacts their future grants. Indeed, in many but not all regions of the world, funding is contingent on maintaining ethical research standards. In Canada, Katz says, \u201cour existence is 100% dependent on having those strict ethical guidelines.\u201d<\/p>\n<p>The Journals<\/p>\n<p class=\"abstract-paragraph\">The research integrity pipeline involves several stakeholders, with each having distinct roles in maintaining the standards of academic research. Gatekeepers in the system\u2014one of the last lines of defense\u2014are the academic journals. Journals have a vested interest in maintaining high academic standards and may be well placed to dictate the terms of engagement.<\/p>\n<p class=\"abstract-paragraph\">Felix Ritchie, PhD\u2014a colleague of Elizabeth Green\u2014developed the Five Safes data integrity framework for just this purpose []. Ritchie describes it as \u201ca flexible structure for thinking about [data],\u201d which includes the provenance and ethics of data use. Numerous organizations worldwide have adopted the Five Safes framework to date, and Australia has recently legislated it [].<\/p>\n<p class=\"abstract-paragraph\">Viewed through an ethical lens, the Five Safes could form the backbone of a data provenance system that requires compliance before a manuscript can be considered for publication.<\/p>\n<p>Data Provenance: The Five Safes in Action<\/p>\n<p class=\"abstract-paragraph\">Ritchie\u2019s Five Safes framework allows for effective data validation and, when combined with modern ethical standards, can restore trust by filtering data sources through five discrete tests:<\/p>\n<p>Safe Project: Data should be ethically collected and clinically validated by experts.Safe People: Researchers accessing the data must be qualified and specifically trained in using AI-based datasets.Safe Data: Data should be independently validated, and any accesses or modifications should be tracked.Safe Settings: Were health data acquired in a clinical setting and the data securely stored?Safe Outputs: Were valid methodologies and statistics used to derive the results?Restoring Data Integrity<\/p>\n<p class=\"abstract-paragraph\">How can one implement a data provenance system?<\/p>\n<p class=\"abstract-paragraph\">Ritchie feels that applying the Five Safes framework to an ethical dataset is the way forward. \u201cThere is a need for a register of validated, ethical datasets,\u201d he says,\u201d that would really be a game changer.\u201d<\/p>\n<p class=\"abstract-paragraph\">A possible workflow could include the following:<\/p>\n<p>Data are collected by medical experts and validated by a third-party certification service.The data are stored in an accredited data registry and protected by blockchain cybersecurity\u2014the same technology that safeguards financial transactions.Researchers access these datasets and use them for approved research purposes.A submitted manuscript would need ethical approval and a data security certificate before verification by a journal\u2019s research integrity team.<\/p>\n<p class=\"abstract-paragraph\">Ritchie sums it up nicely: \u201cUnless you use a validated data set, you\u2019re not getting published, mate.\u201d That\u2019s a powerful incentive.<\/p>\n<p>Opportunity for Self-Reflection and Correction<\/p>\n<p class=\"abstract-paragraph\">Machine learning and other AI technologies have the capacity to transform medical research in ways we are only beginning to understand. However, human frailties, such as blind trust in open access data and lack of institutional ethical oversight within our publish-or-perish culture, have shown how quickly such technologies can amplify misinformation.<\/p>\n<p class=\"abstract-paragraph\">While the impact of this situation was ultimately contained, it is nevertheless an important opportunity for self-reflection among all in the research ecosystem. It\u2019s a chance and, perhaps, a responsibility to fix the flaws and prevent history from repeating itself.<\/p>\n<p>None declared.<\/p>\n<p>\u00a9 JMIR Publications. Originally published in the Journal of Medical Internet Research (https:\/\/www.jmir.org), 12.Mar.2026. <\/p>\n","protected":false},"excerpt":{"rendered":"Key Takeaways An open access dataset has highlighted how bad data can propagate through the research ecosystem.When trained&hellip;\n","protected":false},"author":2,"featured_media":326233,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[175813,111,139,69,147],"class_list":{"0":"post-326232","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-science","8":"tag-artificial-intelligence-data-management-data-sharing-research-ethics-data-quality-research-integrity-scientific-misconduct-data-integrity-data-provenance-retraction-of-publication","9":"tag-new-zealand","10":"tag-newzealand","11":"tag-nz","12":"tag-science"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/posts\/326232","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/comments?post=326232"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/posts\/326232\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/media\/326233"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/media?parent=326232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/categories?post=326232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/tags?post=326232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}