{"id":261291,"date":"2025-11-04T05:28:18","date_gmt":"2025-11-04T05:28:18","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/261291\/"},"modified":"2025-11-04T05:28:18","modified_gmt":"2025-11-04T05:28:18","slug":"evaluating-gpt-4s-ability-to-generate-informed-consent-material-for-genetic-testing","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/261291\/","title":{"rendered":"Evaluating GPT-4\u2019s ability to generate informed consent material for genetic testing"},"content":{"rendered":"<p>Our findings showed that GPT-4-generated materials for both NIPT and BRCA cases remained difficult to read according to established readability thresholds for patient-facing information<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 22\" title=\"Borgeaud, S. et al. Improving language models by retrieving from trillions of tokens. ArXiv:2112.04426 [Cs]. &#010;                  https:\/\/arxiv.org\/abs\/2112.04426&#010;                  &#010;                 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR22\" id=\"ref-link-section-d95692104e1902\" rel=\"nofollow noopener\" target=\"_blank\">22<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 23\" title=\"Gao, Y. et al. Retrieval-augmented generation for large language models: a survey. ArXiv.org. &#010;                  https:\/\/doi.org\/10.48550\/arXiv.2312.10997&#010;                  &#010;                 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR23\" id=\"ref-link-section-d95692104e1905\" rel=\"nofollow noopener\" target=\"_blank\">23<\/a>. However, when evaluated using the same metrics, they were, in some cases, easier to read than the human-generated materials. The model did not hallucinate across its outputs. Our analysis showed notable differences in how respondents evaluated the materials, particularly those generated by GPT-4. These inconsistencies raise key questions about the model\u2019s reliability and potential role in the IC process. Part of this variation could also reflect the inherent variability in both human- and GPT-4-generated IC materials. That is, human-generated text can be shaped based on the author\u2019s background, communication style, and institutional norms. This can significantly limit its standardization as a benchmark. Similarly, GPT-4-generated outputs are subject to variability mainly due to the probabilistic nature of the model per se, meaning outputs can differ across sessions even with identical prompts. This dual variability should be considered when interpreting these types of findings, as it reflects the real-world variation that occurs in both clinical communication and LLM-generated text. Acknowledging this dual variability strengthens the transparency of our methodological approach and contextualizes the inconsistencies observed in both content coverage and evaluation.<\/p>\n<p>GPT-4 struggled to generate NIPT and BRCA materials at the readability levels recommended by leading health organizations<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 24\" title=\"Wasir, A. S., Volgman, A. S. &amp; Jolly, M. Assessing readability and comprehension of web-based patient education materials by American Heart Association (AHA) and CardioSmart online platform by American College of Cardiology (ACC): How useful are these websites for patient understanding?. Am. Heart J. Cardiol. Res. Pract. 32, 100308 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR24\" id=\"ref-link-section-d95692104e1912\" rel=\"nofollow noopener\" target=\"_blank\">24<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 25\" title=\"CDC. Health Literacy. Health Literacy. &#010;                  https:\/\/www.cdc.gov\/health-literacy\/?CDC_AAref_Val=https:\/\/www.cdc.gov\/healthliteracy\/pdf\/simply_put.pdf&#010;                  &#010;                 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR25\" id=\"ref-link-section-d95692104e1915\" rel=\"nofollow noopener\" target=\"_blank\">25<\/a> when no specific instructions were provided. Readability varied widely, particularly in GPT-4-generated material. These results align with previous research in general medical fields<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 7\" title=\"Decker, H. et al. Large language model&#x2212;based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw. Open 6, e2336997 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR7\" id=\"ref-link-section-d95692104e1919\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Cocci, A. et al. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis. 23, 103&#x2013;108 (2023).\" href=\"#ref-CR26\" id=\"ref-link-section-d95692104e1922\">26<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Sahin, S. et al. Evaluating ChatGPT-4&#x2019;s performance as a digital health advisor for otosclerosis surgery. Front. Surg. 11, &#10;                  https:\/\/doi.org\/10.3389\/fsurg.2024.1373843&#10;                  &#10;                 (2024).\" href=\"#ref-CR27\" id=\"ref-link-section-d95692104e1922_1\">27<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"McCarthy, C. J., Berkowitz, S. A., Ramalingam, V. &amp; Ahmed, M. Evaluation of an artificial intelligence chatbot for delivery of interventional radiology patient education material: a comparison with societal website content. J. Vasc. Interventional Radiol. 34, 1760&#x2013;1768, &#10;                  https:\/\/doi.org\/10.1016\/j.jvir.2023.05.037&#10;                  &#10;                 (2023).\" href=\"#ref-CR28\" id=\"ref-link-section-d95692104e1922_2\">28<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Paran, M., Almog, A., Dreznik, Y., Nesher, N. &amp; Kravarusic, D. A new era in medical information: ChatGPT outperforms medical information provided by online information sheets about congenital malformations. J. Pediatric Surg. 60, &#10;                  https:\/\/doi.org\/10.1016\/j.jpedsurg.2024.161894&#10;                  &#10;                 (2024).\" href=\"#ref-CR29\" id=\"ref-link-section-d95692104e1922_3\">29<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 30\" title=\"Walker, H. L. et al. Reliability of medical information provided by ChatGPT: assessment against clinical guidelines and patient information quality instrument. &#010;                  https:\/\/doi.org\/10.2196\/47479&#010;                  &#010;                 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR30\" id=\"ref-link-section-d95692104e1925\" rel=\"nofollow noopener\" target=\"_blank\">30<\/a> but contrast with studies where ChatGPT-3.5 and GPT-4 were explicitly instructed to simplify consent forms<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 7\" title=\"Decker, H. et al. Large language model&#x2212;based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw. Open 6, e2336997 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR7\" id=\"ref-link-section-d95692104e1929\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 10\" title=\"Patel, I., Om, A., Cuzzone, D. &amp; Garcia Nores, G. Comparing ChatGPT vs. surgeon-generated informed consent documentation for plastic surgery procedures. Aesthetic Surg. J. Open Forum. &#010;                  https:\/\/doi.org\/10.1093\/asjof\/ojae092&#010;                  &#010;                 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR10\" id=\"ref-link-section-d95692104e1932\" rel=\"nofollow noopener\" target=\"_blank\">10<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 11\" title=\"Ali, R. et al. Bridging the literacy gap for surgical consents: an AI-human expert collaborative approach. npj Digit. Med. 7, 1&#x2013;6 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR11\" id=\"ref-link-section-d95692104e1935\" rel=\"nofollow noopener\" target=\"_blank\">11<\/a>. Since GPT-4 was not given pre-written text to simplify, it may have generated material at a higher reading level than expected. GPT-4 Zero-shot learning resulted in slightly harder-to-read material than GPT-4 RAG in both English and Italian, partially aligning with Lai et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 31\" title=\"Lai, V. D. et al. ChatGPT Beyond English: towards a comprehensive evaluation of large language models in multilingual learning. Arxiv. &#010;                  https:\/\/doi.org\/10.48550\/arxiv.2304.05613&#010;                  &#010;                 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR31\" id=\"ref-link-section-d95692104e1939\" rel=\"nofollow noopener\" target=\"_blank\">31<\/a>, who found that GPT-4 zero-shot learning underperforms across different languages. GPT-4 generated the most readable texts in Greek. However, this is possibly due to limitations in existing readability assessment tools (e.g., SMOG) that are not optimized for Greek\u2019s morphology and semantic density. Overall, GPT-4\u2019s readability scores closely matched those of human-generated IC material, suggesting that its default text generation mimics human writing unless explicitly instructed otherwise.<\/p>\n<p>GPT-4-generated materials included some IC components but omitted others, with variation across languages. In German, GPT-4-RAG-generated material outperformed human-written materials, likely due to RAG\u2019s ability to retrieve structured knowledge from reliable sources<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 32\" title=\"Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP Tasks. ArXiv.org. &#010;                  https:\/\/arxiv.org\/abs\/2005.11401?utm_source=chatgpt.com&#010;                  &#010;                 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR32\" id=\"ref-link-section-d95692104e1946\" rel=\"nofollow noopener\" target=\"_blank\">32<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 33\" title=\"Lakatos, R., Pollner, P., Hajdu, A. &amp; Tam&#xE1;s, J. Investigating the performance of retrieval-augmented generation and domain-specific fine-tuning for the development of AI-driven knowledge-based systems. Mach. Learn. Knowl. Extract. 7, 15&#x2013;15 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR33\" id=\"ref-link-section-d95692104e1949\" rel=\"nofollow noopener\" target=\"_blank\">33<\/a>. This is particularly relevant for BRCA testing, where rising demand<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 34\" title=\"Desai, S. &amp; Jena, A. B. Do celebrity endorsements matter? Observational study of BRCA gene testing and mastectomy rates after Angelina Jolie&#x2019;s New York Times editorial. BMJ 355, i6357 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR34\" id=\"ref-link-section-d95692104e1953\" rel=\"nofollow noopener\" target=\"_blank\">34<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 35\" title=\"Lippi, G. The risk of unjustified BRCA testing after the &#x201C;Angelina Jolie effect&#x201D;: how can we save (laboratory) medicine from the Internet?. Clin. Chem. Lab. Med.56, e33&#x2013;e35 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR35\" id=\"ref-link-section-d95692104e1956\" rel=\"nofollow noopener\" target=\"_blank\">35<\/a> may have expanded databases, improving RAG\u2019s accuracy. However, GPT-4-RAG struggled with non-primary results, yielding lower scores in this area. These findings align with research showing ChatGPT\u2019s difficulty in addressing complex genetics-related questions<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 36\" title=\"Khosravi, T., Sudani, A. &amp; Morteza O. To what extent does ChatGPT understand genetics? Innov. Educ. Teach. Int. 61, 1320&#x2013;1329 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR36\" id=\"ref-link-section-d95692104e1960\" rel=\"nofollow noopener\" target=\"_blank\">36<\/a> further highlighting its limitations in capturing nuanced IC components regardless of the prompting technique used. GPT-4\u2019s challenges in these components may stem from the distinction between an IC form and an IC process. As a model trained on large text corpora, GPT-4 may be more suited to generating static IC forms that follow standardized formats, such as those commonly used in medical settings where risks, benefits, and procedures are relatively consistent and widely documented. Moreover, as previously noted, most studies evaluating LLMs for IC have focused on surgical settings. These studies<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 7\" title=\"Decker, H. et al. Large language model&#x2212;based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw. Open 6, e2336997 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR7\" id=\"ref-link-section-d95692104e1964\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 11\" title=\"Ali, R. et al. Bridging the literacy gap for surgical consents: an AI-human expert collaborative approach. npj Digit. Med. 7, 1&#x2013;6 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR11\" id=\"ref-link-section-d95692104e1967\" rel=\"nofollow noopener\" target=\"_blank\">11<\/a> typically involve feeding the model existing IC forms, often sourced from large medical centers<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 11\" title=\"Ali, R. et al. Bridging the literacy gap for surgical consents: an AI-human expert collaborative approach. npj Digit. Med. 7, 1&#x2013;6 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR11\" id=\"ref-link-section-d95692104e1971\" rel=\"nofollow noopener\" target=\"_blank\">11<\/a>, and prompting it to simplify the content. This may contribute to GPT-4\u2019s stronger performance with such materials, as it aligns closely with both its training data and the structure of the evaluation tasks. In contrast, genomic testing often requires more individualized, context-sensitive information, raising concerns about GPT-4\u2019s ability to generate consent materials that go beyond the scope of standard form templates.<\/p>\n<p>Additionally, GPT-4 underperformed in Greek, highlighting language biases in AI development. Similar disparities have been observed in Japanese<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 17\" title=\"Ando, K. et al. A comparative study of English and Japanese ChatGPT responses to anaesthesia-related medical questions. BJA Open 10, 100296 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR17\" id=\"ref-link-section-d95692104e1978\" rel=\"nofollow noopener\" target=\"_blank\">17<\/a> and Spanish<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 18\" title=\"Gonzalez Fiol, A. et al. Accuracy of Spanish and English-generated ChatGPT responses to commonly asked patient questions about labor epidurals: a survey-based study among bilingual obstetric anesthesia experts. International J. Obstetric Anesthesia 61, 104290 &#010;                  https:\/\/doi.org\/10.1016\/j.ijoa.2024.104290&#010;                  &#010;                 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR18\" id=\"ref-link-section-d95692104e1982\" rel=\"nofollow noopener\" target=\"_blank\">18<\/a> compared to English. This supports the assumption that less commonly spoken languages receive weaker AI performance across domains<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 17\" title=\"Ando, K. et al. A comparative study of English and Japanese ChatGPT responses to anaesthesia-related medical questions. BJA Open 10, 100296 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR17\" id=\"ref-link-section-d95692104e1986\" rel=\"nofollow noopener\" target=\"_blank\">17<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 18\" title=\"Gonzalez Fiol, A. et al. Accuracy of Spanish and English-generated ChatGPT responses to commonly asked patient questions about labor epidurals: a survey-based study among bilingual obstetric anesthesia experts. International J. Obstetric Anesthesia 61, 104290 &#010;                  https:\/\/doi.org\/10.1016\/j.ijoa.2024.104290&#010;                  &#010;                 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR18\" id=\"ref-link-section-d95692104e1989\" rel=\"nofollow noopener\" target=\"_blank\">18<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 37\" title=\"Jaeger, F. N., Pellaud, N., Laville, B. &amp; Klauser, P. The migration-related language barrier and professional interpreter use in primary health care in Switzerland. BMC Health Serv. Res. 19, &#010;                  https:\/\/doi.org\/10.1186\/s12913-019-4164-4&#010;                  &#010;                 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR37\" id=\"ref-link-section-d95692104e1992\" rel=\"nofollow noopener\" target=\"_blank\">37<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 38\" title=\"Eslier, M. et al. Association between language barrier and inadequate prenatal care utilization among migrant women in the PreCARE prospective cohort study. Eur. J. Public Health 33, 403&#x2013;410 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR38\" id=\"ref-link-section-d95692104e1995\" rel=\"nofollow noopener\" target=\"_blank\">38<\/a>. Language-based disparities in healthcare have been linked to reduced primary care utilization and poorer health outcomes<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 37\" title=\"Jaeger, F. N., Pellaud, N., Laville, B. &amp; Klauser, P. The migration-related language barrier and professional interpreter use in primary health care in Switzerland. BMC Health Serv. Res. 19, &#010;                  https:\/\/doi.org\/10.1186\/s12913-019-4164-4&#010;                  &#010;                 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR37\" id=\"ref-link-section-d95692104e1999\" rel=\"nofollow noopener\" target=\"_blank\">37<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 38\" title=\"Eslier, M. et al. Association between language barrier and inadequate prenatal care utilization among migrant women in the PreCARE prospective cohort study. Eur. J. Public Health 33, 403&#x2013;410 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR38\" id=\"ref-link-section-d95692104e2002\" rel=\"nofollow noopener\" target=\"_blank\">38<\/a>, raising concerns about the potential for AI-driven inequities in medical information accessibility.<\/p>\n<p>Participants\u2019 preference for human-generated materials for both NIPT and BRCA was modest. This trend was more pronounced for BRCA, where GPT-4-RAG-generated material was frequently misidentified as human-written. This suggests that RAG may not only improve informational content but also enhance tone, structure, and style in ways that closely resemble human-generated materials. This, however, also implies that users need to provide relevant information, which can be more challenging for those new to the task or those with limited time. Nevertheless, it overall indicates that prompting should be viewed not merely as a technical step to produce an output, but as a critical design decision that can significantly influence both content quality and audience reception.<\/p>\n<p>While participants preferred the human-generated material, the relatively small differences in ratings and the difficulty distinguishing GPT-4-generated from human text suggest that, in some contexts, GPT-4 may already be producing content that meets user expectations at least at the level of surface communication. This reinforces our approach to evaluating materials not only in terms of readability, but also in terms of content completeness and clinical relevance. Finally, the variation in participants\u2019 ability to identify GPT-4-generated text in NIPT compared to BRCA further suggests that topic complexity or familiarity may influence how GPT-4-generated materials are interpreted. These findings challenge the assumption that human authorship is inherently superior across different contexts<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 39\" title=\"Wang, S. &amp; Huang, G. The impact of machine authorship on news audience perceptions: a meta-analysis of experimental studies. Commun. Res. 51, 815&#x2013;842 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR39\" id=\"ref-link-section-d95692104e2013\" rel=\"nofollow noopener\" target=\"_blank\">39<\/a>. However, they also align with existing literature that demonstrates human difficulty in distinguishing between ChatGPT-generated medical manuscripts and those written by humans. This has important implications for the medical community, particularly regarding the circulation of inaccurate material and the risk of increased public distrust<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 40\" title=\"Helgeson, S. A. et al. Human reviewers&#x2019; ability to differentiate human-authored or artificial intelligence&#x2013;generated medical manuscripts. Mayo Clin. Proc. 100, 622&#x2013;633 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR40\" id=\"ref-link-section-d95692104e2017\" rel=\"nofollow noopener\" target=\"_blank\">40<\/a>.<\/p>\n<p>This study has several limitations. First, we deliberately sought evaluations from healthcare providers. While this ensured expert assessments, it excluded patients, the primary end-users of IC material. This is a critical limitation, as patient feedback is essential to evaluate whether GPT-4-generated content is clear, relevant, and accessible to its intended audience. Future studies should incorporate patient-centered evaluation. For example, small-scale cognitive interviews or think-aloud sessions could help assess how patients interpret and engage with GenAI-generated IC materials. We are currently conducting a separate study using a think-aloud protocol with patients, which directly builds on this limitation by exploring how patients engage with GenAI-generated IC materials.<\/p>\n<p>Second, our analysis focused on two genetic tests (NIPT and BRCA), limiting the generalizability of these findings to other genetic contexts. However, we observed key patterns, including the omission of certain IC components without explicit prompting, elevated readability levels, and variable expert confidence in patient-facing material quality. These challenges are not unique to NIPT or BRCA and may similarly affect GPT-4-generated materials in other contexts, including carrier screening or whole-genome sequencing, suggesting broader relevance. Future research should assess whether these patterns persist across additional testing scenarios.<\/p>\n<p>Third, our small sample size (N\u2009=\u200925) and uneven distribution across languages pose constraints, and readers should consider the findings preliminary. Specifically, the sample size limits the ability to capture the full range of variation in GPT-4-generated material and provider assessments, thereby reducing the generalizability of the results. Also, the uneven representation of languages limits our ability to draw robust conclusions about language-specific patterns or to generalize across linguistic and cultural contexts. Some findings, therefore, may reflect features unique to specific language groups rather than broader trends in language use. Also, convenience sampling may have attracted individuals with a stronger interest in IC processes or genetic education, introducing selection bias. This could have influenced how participants engaged with the materials, possibly resulting in evaluations that are not fully representative of the broader clinical community. Moreover, the small sample size limits the ability to explore variation across professional roles, language groups, and test types. As a result, generalizability is limited, and the findings should be interpreted with caution. Future studies should use larger and more diverse samples to capture broader perspectives.<\/p>\n<p>Another limitation of this study is the exclusive use of the GPT-4 model. We did not include other LLMs, such as Gemini, Copilot, or medical models like Med-PaLM and Med-Mistral, as they were outside the scope of the study. Such LLMs are likely to differ in clinical accuracy, terminology, and style. For instance, the Mistral 8\u2009x\u200922\u2009B LLM has shown promise in enhancing the readability, understandability, and actionability of IC forms without compromising accuracy or completeness<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 41\" title=\"Shi, Q. et al. Transforming informed consent generation using large language models: insights, best practices, and lessons learned for clinical trials. JMIR Med. Inform. &#010;                  https:\/\/doi.org\/10.2196\/68139&#010;                  &#010;                 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s44387-025-00036-4#ref-CR41\" id=\"ref-link-section-d95692104e2036\" rel=\"nofollow noopener\" target=\"_blank\">41<\/a>. While this highlights the potential of domain-specific models, our focus on a general-purpose model like GPT-4 strengthens the relevance of our findings to broader, real-world clinical contexts where fine-tuned models may not be readily accessible. Finally, while we carefully designed our prompts, they did not account for patient-specific factors, such as literacy level, clinical history, gender, or age. Although our approach enabled a controlled comparison between GPT-4- and human-generated IC material, it did not allow for personalized content generation. Future research should explore LLM-generated IC material tailored to individual patient needs through personalized prompts.<\/p>\n<p>To conclude, GPT-4 struggled to produce comprehensive IC material, failing to address all IC components for NIPT and BRCA testing. Similar results were observed across both testing scenarios and all examined languages, including English. Despite these limitations, the model performed well in structured IC components, such as explaining the purpose of the test, its intended benefits, and the general aim for testing. These components often follow standardized formats and appear in publicly available patient-facing health materials. Considering this, GPT-4 may be most effective in generating standardized patient instructions, medical protocols, or discharge summaries rather than IC materials. GPT-4-RAG-generated materials were more often perceived as human-authored, showed better readability than human-written materials in German and zero-shot outputs in English and Italian, and received more consistent evaluations from participants. Although these differences were not statistically significant, they suggest that RAG may offer practical advantages over zero-shot prompting in complex clinical communication tasks, such as IC for genetic testing, particularly in non-English languages. Integrating explicit instructions through RAG may improve model performance by ensuring more complete coverage of IC components. Its performance in German, Italian, and Greek was poorer compared to English. If LLM-generated IC materials favor English-language content, non-English-speaking patients may receive lower-quality health information, further exacerbating existing inequities. Addressing these challenges requires a multifaceted strategy: improving dataset curation, applying multilingual fine-tuning using high-quality, domain-specific texts from underrepresented languages, and designing culturally adapted prompts that reflect local examples, idioms, and healthcare structures. These, along with post-generation validation techniques, should be prioritized as technical, methodological, and ethical imperatives. For now, a hybrid approach, where GPT-4 generates material and clinicians review and refine it, may be more effective for the IC process in genetic testing.<\/p>\n","protected":false},"excerpt":{"rendered":"Our findings showed that GPT-4-generated materials for both NIPT and BRCA cases remained difficult to read according to&hellip;\n","protected":false},"author":2,"featured_media":261292,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[34],"tags":[254,64,63,153546,35301,137,500],"class_list":{"0":"post-261291","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-healthcare","8":"tag-artificial-intelligence","9":"tag-au","10":"tag-australia","11":"tag-complex-networks","12":"tag-computational-biology-and-bioinformatics","13":"tag-health","14":"tag-healthcare"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/261291","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=261291"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/261291\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/261292"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=261291"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=261291"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=261291"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}