Safety of a large language model-based clinical decision support system in African primary healthcare

The AI-enabled CDSS was used in 36,670 (46.8%) of the 78,366 clinical consultations at the 16 Penda Health facilities over the 3-month study period. Uptake increased over the evaluation period, although with substantial variation across clinics. In July 2024, consultations using the tool ranged from 29.3% (Zimmerman) to 53.7% (Kawangware). By September 2024, usage had increased across nearly all clinics, with 7/16 exceeding 60% utilization. The highest proportions were observed in Kangemi (80.6%), and Kasarani (75.1%). Clinics such as Zimmerman and Umoja 2 showed a more modest increase (reaching 44.7% and 40.7%, respectively, by September) (Supplementary Fig. 1).

Study cohort characteristics

A total of 1,469 patient encounters were included in the evaluation, representing 1.9% of the clinical consultations over the evaluation period. Approximately a quarter of the patients were below 5 years of age (n = 350), 15% were aged 5–17 years (n = 225), half were aged 18–39 years (n = 730), 10% were aged 40–59 years (n = 150) and 1% were 60 years and above (n = 14). Based on the gender recorded in the electronic medical record (EMR), 612 (42%) were male and 857 (58%) were female. Key results stratified by patient age group are presented in Table 1.

Table 1 Selected evaluation metrics across all domains of quality and safety, stratified by patient age

Respiratory system presentations were the most common (n = 561; 38%), followed by gastrointestinal (n = 362; 25%) and then genitourinary/reproductive (n = 217; 15%) presentations. Other categories included dermatological (n = 149; 10%), musculoskeletal (n = 130; 9%), febrile/infectious (n = 109; 7%), neurological/psychiatric (n = 73; 5%), ear, nose and throat/dental/ophthalmological (n = 64; 4%), cardiovascular (n = 39; 3%) and unspecified/other (n = 101; 7%).

Baseline assessment of the clinicians’ input (to the AI Consult) documentation

Initial clinical documentation was rated as acceptable for 878/1469 records (60%, 95% confidence interval (CI) 57.2–62.9) while a further 250 (17%, 95% CI 15.1–19.0) were graded as high quality. Documentation requiring significant improvement accounted for 279 cases (19%, 95% CI 17.0–21.1), and 62 records (4%, 95% CI 3.3–5.4) were rated as totally inadequate. Interrater reliability statistics for all metrics are reported in Supplementary Table 1.

Safety concerns were identified in the initial documentation of 547 records (37%, 95% CI 34.8–39.8). The most common safety concerns were inappropriate medication in 272 cases (49%, 95% CI 45.5–54.0), omission of critical differential diagnoses in 230 (42%, 95% CI 37.9–46.3) and incorrect diagnoses in 120 cases (22%, 95% CI 18.5–25.6). Less frequent issues included incorrect or unsafe dosing (61; 11.2%, 95% CI 8.6–14.1), inappropriate investigations or tests (46; 8.4%, 95% CI 6.2–11.1), culturally or contextually inappropriate guidance (5; 0.9%, 95% CI 0.3–2.1) and other concerns (66; 12.1%, 95% CI 9.5–15.1). The perceived likelihood of harm associated with these concerns varied considerably: low in 243 cases (44%, 95% CI 40.2–48.7), moderate in 211 cases (39%, 95% CI 34.5–42.8), guaranteed in 42 cases (7.7%, 95% CI 5.6–10.2) and none in 51 cases (9.3%, 95% CI 7.0–12.1). Similarly, the severity of potential harm was rated as moderate in 212 cases (39%, 95% CI 34.7–42.9), mild in 198 (36%, 95% CI 32.2–40.4), severe in 112 (20%, 95% CI 17.2–24.1) and life-threatening in 25 (4.6%, 95% CI 3.0–6.7). Safety concerns were less common when initial documentation was high quality and rose as quality declined. Specifically, 36/250 high-quality notes (14%, 95% CI 10.3–19.4) contained safety concerns, compared 309/878 acceptable notes (35%, 95% CI 32.0–38.5), 163/279 records needing significant improvement (58%, 95% CI 52.4–64.3) and 39/62 totally inadequate records (63%, 95% CI 49.7–74.8). In lower-quality strata, concerns also shifted towards higher likelihood and severity categories. Summaries of the likelihood, severity and type of harm, stratified by each documentation quality rating, are presented in Supplementary Tables 2–6 and illustrated in Fig. 2.

Fig. 2: Sankey plot of documentation quality, associated risk, likelihood, severity and type of harm.

Illustration of flows from the quality of initial clinical documentation (high quality, acceptable, needs significant improvement or totally inadequate) to the presence of identified safety concerns, the estimated likelihood of harm (none, low, moderate or guaranteed) and the corresponding severity of harm (mild, moderate, severe or life-threatening). Flow widths are proportional to the number of encounters.

Appropriateness of the LLM-based CDSS outputsGlobal assessment

Of the 1,469 AI-generated responses, 1,213 (83%, 95% CI 80.5–84.5) fully addressed the clinical issues identified in the initial documentation, with a further 206 (14%, 95% CI 12.3–15.9) mostly addressing them. Only 49 responses (3.3%, 95% CI 2.5–4.4) partially addressed the issues, and a single response (<0.1%, 95% CI 0.0–0.4) was judged to have no relevance to the clinical problem. Nearly all AI-generated responses, 1,456.1469 (99%, 95% CI 98.5–99.5) prioritized the advice appropriately based on the initial documentation, and 1,459 responses (99% 95% CI 98.8–99.7) were assessed as having effectively communicated their advice.

Hallucinated content, defined as fabricated or clinically inaccurate information, was identified in 50 responses (3.4%, 95% CI 2.5–4.5). Post-hoc qualitative review of evaluators’ comments grouped hallucinations into the following four themes: misidentified drugs or misexpanded acronyms (n = 14) (for example, ‘FGC’ read as ‘female genital circumcision’ rather than fair general condition); contradictions of the initial documentation (n = 8) (for example, stating there were no signs of dehydration when the clinical note did not document hydration status); clinical parameter misinterpretation (n = 8) (for example, flagging SpO2 95% at Nairobi’s altitude ~1,800 m as abnormal); and cases in which the rationale for the hallucination was insufficiently specified (n = 20).

Diagnostic reasoning

The presence of any diagnostic reasoning by the LLM was noted in nearly all outputs (1,443; 98%, 95% CI 97.4–98.8). Regarding the quality of differential diagnosis, 1,220 (83%, 95% CI 81.0–84.9) of responses demonstrated strong and well-reasoned consideration of alternatives (or rightly affirmed that the clinician’s differential was appropriate), while 223 (15%, 95% CI 13.4–17.1) had partial or incomplete differential reasoning (for example, demonstrated by a 5-year-old child with a month-long history of rhinorrhea and snoring, focusing on sinusitis management without considering adenoidal hypertrophy or other obstructive etiologies). A small number of responses (24; 1.6%, 95% CI 1.0–2.4) contained major reasoning gaps or misleading information such as inappropriate reassurance about granulocytosis or missing pregnancy testing in a 28-year-old woman with right lower quadrant pain, and only two responses (0.1%, 95% CI 0.0–0.5) were deemed irrelevant (that is, grossly misaligned with the input content). In terms of novelty (given that sometimes simply agreeing with the clinician’s differential was deemed a high-quality response), 1,384 AI-generated responses (94%, 95% CI 92.9–95.4) provided new diagnostic insights. For example, in a 25-year-old woman initially diagnosed only with acute rhinitis, the LLM identified unrecognized microcytic anaemia (based on abnormal hemogram indices) and suggested iron deficiency as a plausible comorbidity, which the clinician subsequently incorporated into the final documentation. In 85 encounters (5.8%, 95% CI 4.6–7.1) the LLM responses offered no novel insight or were misleading. For contextual relevance (for example, an awareness of the local epidemiology) in its diagnostic reasoning, 1,384 responses (94%, 95% CI 92.9–95.4) were fully aligned with the patient’s clinical and social context, with partial alignment observed in 78 cases (5.3%, 95% CI 4.2–6.6) and complete misalignment in just 7 cases (0.5%, 95% CI 0.2–1.0).

Clinical reasoning (investigations and treatment planning)

The vast majority of AI responses (1,460; 99%, 95% CI 98.8–99.7) provided clinical management advice (for example, examinations, tests, referrals and treatments). Almost all responses (1,455; 99%, 95% CI 98.4–99.5) aligned with local clinical guidelines, and 1,461 responses (99%, 95% CI 98.9–99.8) contributed additional management-related insights beyond the initial documentation. Appropriateness of the LLM response in the patient’s socioeconomic or cultural context or resource limitations was high with no conflict in 1,153 encounters (78%, 95% CI 76.3–80.6); fully adapted in 309 (21%, 95% CI 19.0–23.2); with only small numbers showing issues—2 attempted but failed to adapt (0.1%, 95% CI 0.0–0.5), 2 did not attempt to adapt (0.1%, 95% CI 0.0–0.5) or 3 not culturally/contextually appropriate (0.2%, 95% CI 0.0–0.6).

Safety concerns associated with LLM outputs

Overall, 115 responses (7.8%, 95% CI 6.5–9.3) included active recommendations that evaluators considered potentially harmful: 37 (2.5%, 95% CI 1.8–3.5) were regarded to have posed major safety concerns and 78 (5.3%, 95% CI 4.2–6.5%) minor concerns. Among these 115 responses, evaluators judged that, in 25 cases (22%,95% CI 14.6–30.4), the clinician appeared to have fully adopted the harmful advice, and in 42 cases (37%, 95% CI 27.7–46.0) partially adopted it. In 48 cases (42%, 95% CI 32.6–51.3), the harmful recommendations were not acted upon. The types of issues identified in the subset of potentially harmful AI responses were broadly similar to those found in initial clinical documentation. Follow-up reviews confirmed no resulting adverse patient outcomes.

The most common potentially harmful LLM outputs were inappropriate medication recommendations noted in 54 cases (46.2%, 95% CI 36.9–55.7) and omission of critical differential diagnoses in 37 cases (31.6%, 95% CI 23.3–40.9). Incorrect diagnoses were noted in 17 responses (14.8%, 95% CI 9.0–22.3), while less frequent issues included incorrect or unsafe dosing in 6 cases (5.2%, 95% CI 1.9–11.0), inappropriate investigations in 3 (2.6%, 95% CI 0.5–7.4), culturally/contextually inappropriate guidance in 3 (2.6%, 95% CI 0.5–7.4) and other concerns in 24 cases (20.9%, 95% CI 13.9–29.2).

Clinician response to LLM recommendations and impact on patient safety

In most encounters (917; 62%, 95% CI 59.9–64.9), clinicians made no LLM-induced changes; 358 (24%, 95% CI 22.2–26.6) had minor LLM-induced changes. For example, a 2-year-old with a short history of cough, sneezing and runny nose was correctly diagnosed with a viral upper respiratory infection. The LLM did not alter the diagnosis or overall management but made small documentation refinements, including adjusting the cetirizine dose to an age-appropriate level. In 194 encounters (13%, 95% CI 11.5–15.0), the LLM induced major changes. For example, in a child initially assessed for cough and abdominal pain, the original documentation lacked any bowel-related diagnosis; after the LLM highlighted chronic faecal soiling and the possibility of functional constipation, the clinician added a diagnosis of encopresis, performed an anal examination and instituted a full management plan including laxatives, dietary advice and a toileting routine. Among the 552 encounters with any change, edits most often concerned the treatment plan (436; 79%, 95% CI 75.3–82.3), followed by the differential diagnosis (184; 33%, 95% CI 29.4–37.4), history or examination findings (131; 24%, 95% CI 20.2–27.5) and the follow-up plan (129; 23%, 95% CI 19.9–27.1); investigations were revised in 98 (18%, 95% CI 14.7–21.2), and other edits occurred in 18 (3%, 95% CI 1.9–5.1). Of these 552 encounters, the LLM fully mitigated a potential harm in 67 (12.1%, 95% CI 9.5–15.2) and partially mitigated risk in 51 (9.2%, 95% CI 7.0–12.0); the remaining 434 (78.6%, 95% CI 75.0–82.0) either had no initial risk identified or showed no mitigation.

In 803 encounters (55%, 95% CI 52.1–57.2), no risk was initially identified, and no new safety concerns were introduced. However, in 420 cases (29%, 95% CI 26.3–31.0), clinicians continued with the potentially harmful course of action identified in the initial documentation, despite the LLM explicitly providing helpful guidance in 362 cases (25%, 95% CI 22.5–26.9). Apart from the aforementioned 115 cases of LLM-induced harmful action, there were another 13 cases (0.9%, 95% CI 0.5–1.5) of spontaneous (not LLM-induced) changes to the final documentation that introduced a risk to the patient (which were not present in the initial documentation) (Fig. 3). Figure 4 highlights three (de-identified) real-world examples of LLM-related harm introduction.

Fig. 3: Sankey plot of modifications to final clinical documentation, relationship to LLM outputs and implications for patient safety.

Linkage between LLM helpfulness ratings and clinician modifications and resulting safety outcomes. Flow widths are proportional to encounter counts and illustrate that many consultations showed no change, while beneficial and harmful recommendations were differentially adopted.

Fig. 4: Summaries of real-world examples of AI Consult impacting (or not) care quality (both positively and negatively).

Three real-world cases demonstrate different patterns of LLM influence on care. Top: a potentially unsafe LLM recommendation was appropriately overridden by the clinician, resulting in correct diagnosis and referral. Middle: a harmful recommendation was followed, leading to inappropriate antibiotic prescribing. Bottom: an example of prompt overfitting, where model outputs reflected prompt wording rather than clinical necessity. GI, gastro-intestinal; OTC, over the counter; UTI, urinary tract infection.

Inference costs

The total cost of running the LLM-based CDSS for the 1,469 encounters in this study was US$7.81, which translates to an average of 0.5 (US) cents, assuming one application programming interface (API) call per encounter with an average of 670 input tokens and 364 output tokens.

Safety of a large language model-based clinical decision support system in African primary healthcare

Tags: