Evaluation of ChatGPT-5 responses in obstetric and gynecological emergencies: concordance, readability, and clinical reliability | BMC Emergency Medicine

Study design

This study was designed as a prospective, scenario-based, double-blind, observational study to evaluate the performance of ChatGPT-5 in obstetric and gynecological emergency scenarios.

A total of 15 standardized clinical scenarios were developed for this study. All scenarios were created by a single obstetrician with over 10 years of clinical experience in obstetric and gynecologic emergency management. Each case was designed in accordance with current international guidelines (ACOG, RCOG, and WHO) and represented a range of high-risk conditions, including postpartum hemorrhage, eclampsia, HELLP syndrome, uterine rupture, amniotic fluid embolism, ruptured ectopic pregnancy, and major gynecologic hemorrhages. Developing all scenarios by a single author ensured conceptual and methodological consistency across cases, while subsequent independent expert evaluations helped reduce potential subjectivity or bias. All scenarios were conducted in Turkish, as it represents the native language of the clinical setting and the standard mode of physician–patient communication in obstetric and gynecologic emergencies. Although the selected cases covered a broad clinical spectrum—from common and guideline-defined emergencies to complex, multidisciplinary situations—no formal difficulty grading was applied. The primary aim was to ensure comprehensive coverage of both obstetric and gynecologic emergencies rather than to stratify them by difficulty level. Case complexity was inherently reflected by the nature of each scenario and its management requirements.

The study was conducted between July and August 2025, and all scenarios were run using a single large language model (ChatGPT-5). All questions were asked in the same session, using the same device (a MacBook Air, M1, 2020, with 8 GB RAM and macOS Monterey), and a stable fiber internet connection (100 Mbps). All interactions with ChatGPT-5 were conducted using default system parameters (temperature = 0.7, top-p = 1.0, and maximum token length = 4096). These settings were deliberately maintained to emulate a realistic clinical user experience, reflecting how healthcare professionals would typically engage with the system in practice. A separate chat window was opened for each scenario, and only the first responses were analyzed. This is because the probabilistic nature of large language models allows for different responses to the same question across different trials. Therefore, the “regenerate response” feature was not used—a total of 15 scenarios × five questions = 75 responses were obtained.

Evaluation process

Responses generated by ChatGPT-5 were recorded and anonymized for analysis. In parallel, the same 15 clinical scenarios were independently evaluated by four clinicians: two obstetricians, one emergency medicine specialist, and one anesthesiologist.

The first obstetrician (OB1–POK) had 12 years and 4 months of clinical experience in obstetrics and gynecology, the second obstetrician (OB2–TA) had 10 years and 2 months, the emergency physician (EM–MG) had 9 years and 6 months, and the anesthesiologist (AN–KA) had 11 years and 1 month of professional experience in obstetric and emergency care. All assessors had active roles in tertiary-level hospitals and routinely managed high-acuity obstetric and gynecologic emergencies.

Each response was evaluated using a structured, guideline-based scoring system encompassing five key parameters:

Diagnostic accuracy: Whether the most likely diagnosis was determined accurately and comprehensively.

Investigations: Appropriateness and logical prioritization of recommended diagnostic tests.

Treatment plan: Compliance with the initial treatment approach and current clinical guidelines.

Clinical safety and applicability: Whether the recommendations are safe and applicable in obstetric emergency department settings.

Decision complexity: Whether the response involves clinical reasoning beyond superficial pattern recognition.

Each criterion was scored as 1 point (correct/complete) or 0 points (incorrect/incomplete). This allowed for a maximum of 5 points per case. Total scores ranged from 0 to 5, and the results were categorized as follows:

Quality and readability analysis

In addition to clinical relevance, ChatGPT-5’s responses were evaluated for scientific quality and readability. The same four assessors (two obstetricians, one emergency medicine specialist, and one anesthesiologist) who conducted the primary clinical evaluations also performed the quality and readability assessments to ensure methodological consistency. None of the assessors were involved in the scenario development process. Prior to evaluation, a short calibration session was conducted in which five sample responses were jointly reviewed to establish scoring consistency. Each response was anonymized and labeled with coded identifiers (e.g., Response A, B, C) to maintain blinding. Scientific quality and clinical value were assessed using the modified DISCERN (mDISCERN) and Global Quality Scale (GQS) tools, following established criteria from the obstetrics and gynecology literature and international guidelines (ACOG, 2022; WHO, 2018).

Readability was analyzed using standardized indices—Flesch Reading Ease Score (FRES), Flesch–Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), and Coleman–Liau Index (CLI)—to determine the clarity and comprehensibility of the generated text. These measures allowed for an integrated evaluation of both the informational quality and accessibility of the AI-generated outputs.

mDISCERN scores

mDISCERN is a modified version of the original DISCERN scale and is one of the most frequently used methods for assessing electronic health information [10, 11]. Each “yes” answer is worth 1 point; the total score ranges from 0 to 5. Higher scores indicate more reliable and higher-quality information. The scale consists of five questions:

Are the purpose and scope clear?

Is there alignment between the source and guideline?

Is the information balanced and unbiased (benefits, risks, alternatives)?

Are patient safety and limitations addressed?

Are points of uncertainty and need for expert opinion acknowledged?

Global Quality Scale (GQS) scores

GQS is a five-point Likert scale widely used to evaluate the overall quality and clinical application value of online health content [12]. GQS is graded from 1 to 5, with higher values indicating higher content quality.

1 point: Very low quality – clinically useless

2 points: Low quality – critical deficiencies present

3 points: Fair quality – basic information present but insufficient detail

4 points: Good quality – most necessary information present

5 points: Excellent quality – comprehensive, detailed, and clinically valuable

In this study, GQS scores were divided into three categories:

1–2: Low quality

3: Moderate quality

4–5: High quality

Readability analysis

Response comprehensibility was assessed using various readability metrics, including the Flesch Reading Ease Score (FRES), Flesch Kincaid Grade Level (FKGL), SMOG, and Coleman–Liau Index (CLI). Additionally, the total number of words, sentences, and lexical density (meaningful word ratio) were calculated. Careful consideration was given to formulas and recommended threshold values, such as:

FRES is evaluated on a scale of 0–100; higher values indicate easy readability, while lower values indicate technical and complex content.

FKGL and CLI estimate the level of education required to understand the text, while SMOG expresses this level in years of schooling.

For general readers, it is recommended that FRES be ≥ 60 and other indices be below 7. The validity of these analyses in medical contexts has also been supported by recent research [13].

Ethical approval and participation

This study does not require ethics committee approval because it does not involve any invasive procedures on humans or animals. The study evaluated only scenario-based simulations and responses generated by artificial intelligence (AI), specifically ChatGPT-5. Furthermore, the ethical implications of using AI tools in medical education and clinical decision support processes were considered, and the principles of data confidentiality, impartiality, transparency, and scientific ethics were adhered to.

Statistical analysis

Inter-rater reliability among the four independent assessors (two obstetricians, one emergency medicine specialist, and one anesthesiologist) was evaluated using the Intraclass Correlation Coefficient (ICC, two-way random effects, absolute agreement model). ICC values were interpreted as follows: <0.40 = low, 0.40–0.75 = moderate, and > 0.75 = high agreement. Each scenario was categorized according to overall concordance with expert standards as high (5/5), moderate (4/5), or low (≤ 3/5) agreement. The proportions of these categories were calculated, and 95% Wilson confidence intervals were applied to account for the binomial distribution of proportions. Scenarios with a lower bound of < 60% were interpreted as showing potentially limited diagnostic agreement. Readability and quality metrics—including the modified DISCERN (mDISCERN), Global Quality Scale (GQS), Flesch Reading Ease Score (FRES), Flesch–Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), Coleman–Liau Index (CLI), and lexical density—were computed for each scenario. Continuous variables were summarized as mean ± standard deviation (SD) or median (interquartile range), as appropriate. All statistical analyses were conducted using IBM SPSS Statistics version 29.0 (IBM Corp., Armonk, NY, USA). A two-tailed p-value < 0.05 was considered statistically significant.

Evaluation of ChatGPT-5 responses in obstetric and gynecological emergencies: concordance, readability, and clinical reliability | BMC Emergency Medicine

Tags: