Stay informed on the latest psychology and neuroscience research—follow PsyPost on LinkedIn for daily updates and insights.

A new study involved participants evaluating medical responses that were either written by a medical doctor or generated by a large language model. Results showed participants could not distinguish between doctors’ responses and AI-generated responses, but preferred AI-generated ones. They found high-accuracy AI responses to be the best, but rated low-accuracy AI responses and those given by a medical doctor similarly. The paper was published in NEJM AI.

The use of artificial intelligence (AI) systems in the field of medicine and health care has increased dramatically in recent years. This increase has occurred across various domains, from radiology imaging and mental health chatbots to drug discovery.

One particular application of AI, especially large language models (LLMs), in the medical field is answering patients’ questions. One study showed that ChatGPT was able to generate higher quality and more empathetic responses to patient questions compared to those from medical doctors. AI also seems to excel in diagnostics. One study found that AI alone outperformed physicians in making diagnoses, while a follow-up showed that physicians augmented with AI performed comparably to AI alone, and both groups outperformed physicians working without AI.

Study author Shruthi Shekar and her colleagues wanted to investigate how well people distinguish between responses to patients’ questions given by medical doctors and those generated by AI. Participants were also asked to rate the validity, trustworthiness, completeness, and other aspects of the answers.

The researchers retrieved 150 anonymous medical questions and doctors’ responses from the forum HealthTap. These questions covered six domains of medicine: preventative and risk factors; conditions and symptoms; diagnostics and tests; procedures and surgeries; medication and treatments; and recovery and wellness, with equal distribution.

The researchers then used GPT-3 to create AI responses for each of those questions. These AI-generated responses were then evaluated by four physicians to establish their accuracy. This process was used to classify AI responses into high- and low-accuracy ones.

Next, in the first experiment, a group of 100 online study participants were presented with 10 medical question-response pairs randomly selected from a collection of 30 high-accuracy AI responses, 30 low-accuracy AI responses and 30 doctors’ responses.

In the second experiment, 100 online participants rated their understanding of the question and the response and its perceived validity. They also rated the trustworthiness of the response, its completeness and their satisfaction with it, whether they would search for additional information based on the response, whether they would follow the given advice, and whether they would seek subsequent medical attention based on the response.

In the third experiment, 100 online participants provided the same ratings for the responses, but participants were randomly informed that the responses were from either a doctor, an AI, or a doctor assisted by an AI.

Results showed that participants were unable to effectively distinguish between AI-generated responses and doctors’ responses. However, they showed a preference for AI-generated responses, rating high-accuracy AI-generated responses as significantly more valid, trustworthy, and complete than the other two types of responses. Low-accuracy AI responses tended to receive ratings similar to those given to doctors’ responses.

Interestingly, participants not only found the low-accuracy AI responses to be as trustworthy as those given by doctors, they also reported a high tendency to follow the potentially harmful medical advice contained in those responses and to incorrectly seek unnecessary medical attention as a result of the response. These problematic reactions were comparable with the reactions they displayed toward doctors’ responses, and sometimes even stronger. The study authors note that both experts (raters) and nonexperts (participants) tended to find AI-generated responses to be more thorough and accurate than doctors’ responses, but they still valued the involvement of a doctor in the delivery of their medical advice.

“The increased trust placed in inaccurate or inappropriate AI-generated medical advice can lead to misdiagnosis and harmful consequences for individuals seeking help. Further, participants were more trusting of high-accuracy AI-generated responses when told they were given by a doctor, and experts rated AI-generated responses significantly higher when the source of the response was unknown,” the study authors concluded.

The study sheds light on how humans perceive medical advice generated by AI systems. However, it should be noted that the questions and responses used in the study were taken from an online forum, where medical doctors tend to contribute their content voluntarily. It is likely that the answers were given with the aim of being useful and not with the aim of being the best or the most thorough answers the medical doctors providing them could give. The results of studies comparing answers of doctors clearly aiming to provide their best answers with AI-generated content might not be identical.

The paper, “People Overtrust AI-Generated Medical Advice despite Low Accuracy,” was authored by Shruthi Shekar, Pat Pataranutaporn, Chethan Sarabu, Guillermo A. Cecchi, and Pattie Maes.