Study design
We developed P&P Care, an LLM-powered primary care chatbot using a dual-track role-play codesign framework. The framework delineates roles and responsibilities for community stakeholders (patients, caregivers and community health workers, including village doctors) and research stakeholders (public health experts, primary care physicians and LLM engineers) to ensure parity in influence. A core innovation is reciprocal role-play, wherein stakeholders simulate one another’s perspectives; for example, researchers simulated patient interactions under low-literacy constraints while community members critiqued prototypes from a designer’s viewpoint. This approach mitigates common limitations of traditional codesign, such as tokenism and power imbalances.
We evaluated the codesigned system of the P&P Care chatbot and its integrated literacy-bridging e-learning modules in a parallel RCT. The trial was conducted across 11 geographically and culturally diverse Chinese provinces, encompassing both rural and urban communities. Study sites were selected to represent China’s tiered healthcare development, spanning eastern (Beijing, Shandong), central (Hubei, Shanxi) and western (Chongqing, Gansu, Shaanxi, Sichuan, Guangxi, Inner Mongolia and Xinjiang; Fig. 1d) regions. For reporting participant dropouts due to network instability, Hubei was reclassified with the eastern group because of its comparable infrastructure, while Shanxi was grouped with the western provinces. The inclusion of three of China’s five autonomous regions (Guangxi, Inner Mongolia and Xinjiang) ensured representation of major ethnic minority populations with distinct linguistic and cultural backgrounds. The trial included community residents who were randomly assigned to receive a consultation with preparatory e-learning (e-learning plus) or without (consultation-only) in a 1:1 ratio. The primary outcome was the objective awareness of health needs. All primary care consultations with the chatbot were recorded, along with participant feedback to assess consultation quality. Trial reporting adheres to the CONSORT-AI guidelines41.
Ethical approval
The study protocol received approval from the Ethics Review Committee of the Chinese Academy of Medical Sciences and Peking Union Medical College and was prospectively registered with the Chinese Clinical Trial Registry (identifier ChiCTR2500098101). All participants provided informed consent in accordance with the Declaration of Helsinki, with explicit disclosure that this is an exploratory experiment and that the chatbot health advice given during consultations should not be used for disease management without clinician oversight. Stringent data protection protocols were implemented in this study, ensuring that all data were anonymized and encrypted for privacy protection.
Codesign processContextual understanding
To understand community needs and existing technological limitations, we initiated codesign workshops involving 36 community stakeholders (12 patients, 8 caregivers and 16 health workers from Beijing) and 8 role-playing researchers (from Peking Union Medical College; Extended Data Table 1). Three community health worker liaisons (with more than 6 years of community engagement experience) and four researchers, each with expertise in specific areas (digital healthcare, public health, primary care and LLM techniques), facilitated the codesign process. This phase involved the iterative interaction with and evaluation of three prevalent AI health consultation platforms in China: XiaoHeJianKang (TikTok), XunFeiXiaoYi (Tencent) and LingYiZhiHui (Baidu), through periodic in-person meetings, virtual conferences (WeChat and Tencent Meeting) and collaborative online documentation (WPS 365).
In the community track, stakeholders articulated unmet needs and contextual challenges in utilizing existing health consultation chatbots (for example, difficulties with technical terminology and verbose responses) by methods including storytelling, focus groups and on-site demonstrations. They engaged in role-playing as researchers to define key challenges and desired improvements for a user-friendly interface (for example, a WeChat mini-program). In the research track, researchers defined LLM capabilities and limitations, and established compliance requirements and ethical guidelines (for example, aligning with WHO principles for ethical AI in primary care). Researchers role-played as users to interact with existing chatbots, directly experiencing limitations such as health literacy barriers and jargon-heavy outputs.
Qualitative data generated from this process underwent reflexive thematic analysis to pinpoint critical limitations in current AI tools. The initial codebook was created from preliminary responses and iteratively refined through ongoing analysis cycles. Two independent researchers coded transcripts, achieving high interrater reliability (Fleiss’ kappa κ = 0.80); all discrepancies were resolved through consensus. Community participants, particularly older adults, expressed a critical need for explicit guidance on initiating dialogues with chatbots, articulating concerns regarding uncertainty at the interaction onset and the potential for providing inaccurate information, as exemplified by the direct quote: “I am at a loss know what to do at the start, I do not know what the chatbot is asking for, and I worry that I fail to provide the correct information to ruin the consultation.” In addition, a majority (8/12) of patients valued longitudinal disease management above episodic symptom resolution, with one stating: “I want to know how I could live with the disease in my remaining life, not just address symptoms today.”
Cocreation
The cocreation phase produced P&P Care prototypes and e-learning modules through parallel, role-playing tracks. Community stakeholders (patients, caregivers and health workers) contributed their lived experiences to design interaction modules. Through role-play, they acted as designers to ensure usability and cultural appropriateness, shaping features such as bite-sized video content and culturally relevant health recommendations. Researchers (physicians, public health experts and LLM engineers) translated community needs into technical solutions. Adopting user personas, they role-played low-literacy interactions to evaluate social feasibility, focusing on language simplification and culturally sensitive delivery.
The chatbot was developed using systematic prompt engineering through an iterative two-step process: role-specific fine-tuning and ethical-operational refinement, ensuring alignment with Chinese primary care guidelines and WHO ethical AI principles.
For role-specific fine-tuning, we implemented a bidirectional dialogue architecture with two clinical reasoning stages. Bidirectional interactions referred to proactive interaction, where the chatbot not only responded to users’ questions but also proactively inquired about them. During the inquiry stage, the chatbot was trained through Reinforcement Learning from Human Feedback42 to engage patients in active, multiturn dialogues to elicit comprehensive health-related information aligned with primary care standards, including demographics, medical history, presenting symptoms, living environment, lifestyle behaviours and psychological status. In the conclusion stage, the chatbot was trained to generate one to three differential relevant diagnosis possibilities, each accompanied by supporting or refuting evidence (for example, symptom alignment and risk factors) and suggest preventative care actions (for example, screenings and lifestyle changes). Furthermore, the chatbot was trained to suggest one to three actionable recommendations for physical/laboratory examinations, with rationales tied to clinical guidelines (for example, HbA1c testing for prediabetes screening in individuals with a high body mass index), through Chain of Thought prompting43. Real-time pop-up alerts from a health risk detector ensured user awareness of potential health risks. In addition, chatbot response generation incorporated a robust verification pipeline: all outputs sequentially cleared a safety filter (Llama2-Guard, confidence >0.5)44 and a hallucination detector (SelfCheckGPT, mean contradiction probability <0.5)45. Failed checks triggered regeneration (up to three attempts) before a fallback.
For the ethical and operational refinement, we used prompt augmentation and agent techniques to refine the model46. A multidisciplinary feedback-and-refinement team (36 community stakeholders, 2 primary care physicians, 2 public health experts and 1 AI-ethics-trained graduate student) iteratively interacted with the model via adversarial testing, prioritizing (1) safety and hallucination—avoiding harmful, biased or non-compliant outputs (for example, rejecting inappropriate treatment suggestions); (2) patient-centredness—enhancing empathetic communication (for example, probing unmet needs); and (3) context adaptability—optimizing performance for low-resource settings (for example, simplifying language for low-literacy users).
To effectively address users’ needs following their e-learning experience, we instructed an LLM-driven agent to assist the counsellor chatbot in providing appropriate responses. This agent is designed to understand the intent behind patient inquiries through intention analysis, specifically following e-learning modules 2.a, 4.f and g in Supplementary Table 3. The prompts were tailored as responses to the specific e-learning modules, and the outputs from the agent were structured in JSON format for straightforward parsing and further processing by the chatbot.
Testing and refinement
During the testing and refinement phase, community stakeholders engaged directly with prototypes through role-play, adopting researcher perspectives to define evaluation metrics on consultation quality (Supplementary Table 1). Rubrics of attention (matching consultation duration with participant expectations) and conciseness (clarity and brevity in conversations) surfaced from community participants role-playing as researchers during the codesign workshops, in response to the question, “If you could design a chatbot for health conversation, what aspects would you value most?”.
Meanwhile, researchers conducted simulated patient interactions to identify technical limitations, such as integrating visual aids for literacy adaptation. Model refinement integrated two complementary approaches: human interaction via our codesign team to ensure contextual relevance, and simulated patient interactions leveraging the Chinese MedDialog dataset containing 3.4 million dialogues across 172 ICD-10 domains. Virtual patient behaviours were informed by community-derived interaction patterns through a process we term community-informed virtual patient interaction.
For chatbot training, we developed a community-informed patient agent that emulated realistic patient behaviours using synthetically generated profiles structured via a knowledge graph architecture47. The graph formalized patient attributes as interconnected nodes with schema definitions encoded via the Resource Description Framework48, populated using Simple Protocol and Resource Description Framework Query Language queries49. We synthesized 1,200 unique virtual patient profiles (see Supplementary Fig. 2 for illustrations), with 50% requiring interdisciplinary consultation, all validated by five board-certified clinicians through consensus validation. The patient agent emulated behavioural patterns derived from community interactions, including dialogue hesitancy and ambiguous symptom articulation, with termination protocols ending interactions when virtual patients acknowledged query resolution.
A randomly selected subset of 600 profiles refined the chatbot’s prompt templates using a codesigned rubric evaluated by a blinded panel of five experts and two community laypersons. A score below 4 triggered refinement.
For e-learning module refinement, the remaining 600 patient profiles were allocated to e-learning plus or consultation-only groups. In the e-learning plus group, the patient agent internalized educational modules through a JSON-structured decision model, while the consultation-only group initiated dialogues with deactivated education. Initial dialogues from 300 profiles were evaluated by the expert panel to identify suboptimal elements, prompting iterative content revisions prioritizing clarity and WHO primary care communication guideline alignment. The remaining 300 profiles, unexposed to prior refinement, served for validation and sample estimation, assessed by the same panel using the codesigned rubric with anonymized, random case assignment.
Implementation and evolution
During the implementation and evolution phase, our dual-track approach facilitated continuous refinement through role-playing exercises and two localized pilot studies (20 patients, Beijing, November–December 2024; and 50 patients, the remaining 10 provinces, January–February 2025). Community stakeholders identified critical implementation barriers—including network limitations, impracticality of usual care controls and culturally specific recruitment needs—while analysing usage patterns to develop network attrition mitigation strategies and refine feedback questionnaires for capturing nuanced communication outcomes. Concurrently, researchers conducted computational audits to detect demographic performance disparities, expanding dialectal training corpora based on these findings. Through role-playing as users in resource-constrained scenarios, researchers stress-tested system resilience, validating the chatbot’s operational capacity in low-infrastructure environments.
In pilot studies, primary outcome and secondary outcomes related to consultation quality were the same as those from the testing and refinement phase. Secondary outcomes on patient consultation experiences were structured on the basis of patients’ qualitative feedback via thematic analysis. Qualitative data on the consultation experience, derived from eight iterative role-play workshops, underwent reflexive thematic analysis to establish patient-centred evaluation metrics. The initial codebook was developed from preliminary responses and iteratively refined through successive analysis cycles. Two independent researchers coded transcripts, achieving high interrater reliability (Fleiss’ kappa κ = 0.86); all discrepancies were resolved through consensus. The emergent themes, perceived usefulness, interface user-friendliness, satisfaction, future use intention and facilitation of physician communication, were subsequently operationalized into a 5-point Likert scale survey. This questionnaire underwent three codesign iterations with community stakeholders to enhance its relevance and clarity. The questionnaires’ internal consistency was robust (Cronbach’s α > 0.70 for all dimensions), and face validity was established through iterative feedback from 20 community residents, ensuring relevance to community primary care contexts. The finalized survey instrument is presented in Supplementary Table 4.
Four engineering optimizations addressed network constraints: (1) implementing a tiered service backbone that dynamically adjusts context window size and output length on the basis of network latency; (2) using local caching to enable offline functionality; (3) designing adaptive latency handling through extended timeouts and decoupled processing; and (4) prioritizing text transmission to optimize bandwidth. We evaluated performance under severe network throttling simulating worst-case rural conditions: internet protocol instability (10% probability of switching every 10 s), connection drops (20% probability every 10 s), 60% packet loss, 1,000 ms latency and 100 kilobits per second bandwidth. An ablation study assessing consultation completion rates systematically removed each optimization: tiered service elimination, cached functionality disablement, latency handling deactivation, bandwidth optimization removal and combined strategy ablation.
Trial designParticipants
Participants in the trial must demonstrate a need for health consultation or express a willingness to engage in LLM-based health consultations. Moreover, participants were required to be between 20 and 80 years of age. Individuals who presented with psychological disorders or drug abuse and any other conditions that may compromise communicative interactions or the integrity of assessments were excluded.
Intervention and comparator
This is a double-blinded, parallel group trial. Participants were randomly assigned to either the e-learning plus or the consultation-only group upon enrolment. All participants were informed that they would go through two phases: preparation for the consultation and then participation in health consultations. In the e-learning plus group, participants received training on AI health literacy through the e-learning modules during the preparation phase before engaging in health consultations. In the consultation-only group, participants underwent controlled preparation, where the e-learning modules were deactivated, before engaging in health consultations. During the preparation phase, participants were required to recall the symptoms and conditions of their recent and current illnesses. They then conducted health consultations with the P&P Care chatbot (see Supplementary Tables 5 and 6 for illustrations). After consultation, participants completed a post-consultation questionnaire to provide feedback on their experience.
Outcomes
The primary outcome measures objective awareness of health needs. Secondary outcomes included attention, integrity, listenability, conciseness and empathy (Supplementary Table 1). These outcomes serve as objective indicators of the consultation dialogue and were evaluated by the same panel involved in the chatbot development phase, who were blinded to group assignment. All assessments were conducted post-recruitment, with one trained assessor assigned to each dialogue. The secondary outcomes also include self-reported awareness of health needs, satisfaction, usefulness, user-friendliness, acceptability and recommendation, which were retrieved from feedback questionnaires provided by the participants (Supplementary Table 4). Additional outcomes include the number of conversational turns and the average number of words per participant case.
Sample size
Sample sizes were estimated based on the difference in health awareness between the e-learning plus and consultation-only groups. Using data from preliminary virtual experiments, we calculated that a sample size of 1,200 participants would be required to achieve 80% power at a significance level of 0.05. To account for inflated literacy scores in virtual patients compared with real-world community residents, we conducted an interim analysis after enrolling 500 participants. This adjustment necessitated recalculating the differences, requiring a revised sample size of 2,000 participants to maintain statistical power.
Recruitment
Our local codesigned teams, who lived in the area and were proficient in the local dialect, assisted with participant recruitment and the trial. The codesigned team proactively contacted potential community participants who needed LLM-based health consultations or were willing to try these consultations. For those who indicated interest, the teams provided comprehensive descriptions of the study, emphasizing that it is exploratory and that any advice rendered by P&P Care serves solely as a reference and should not be utilized as a definitive basis for disease therapy. Participants received an informed consent form before enrolment and had the opportunity to ask questions. After this process, potential participants who met the established inclusion and exclusion criteria were formally recruited. The community engagement track facilitated iterative collaboration with stakeholders across localized pilot studies, identifying the need for culturally appropriate recruitment strategies. Recruitment took place from 4 to 30 March 2025.
Randomization and blinding
We used individual-level parallel randomization without stratification, utilizing a computer-generated random sequence for participant assignment to each experimental group. We implemented allocation concealment to maintain the confidentiality of the random allocation and minimize bias. Throughout the intervention and analysis phases, the group information and operational materials remained undisclosed to all participants and researchers.
Statistical methods
To ensure objectivity, data collection and subsequent statistical analyses were conducted by independent researchers. We assessed the normality of scale value distributions across 11 dimensions and used two-sample t-tests with unequal variances for intergroup comparisons where appropriate. For dimensions that exhibited significant skewness, we used non-parametric Mann–Whitney U tests. All statistical tests were two-tailed, with a significance threshold set at P < 0.05. In addition, we applied the Benjamini–Hochberg adjustment for multiple testing corrections based on the total number of tests performed. R 4.3.0 was used to perform the statistical analyses and present the results.
Subgroup analyses were performed on the basis of age, community setting, sex and ICD-10 domains. To match ICD-10 domains with the dialogue data, we collaborated with local primary care workers to create a comprehensive list of 23 ICD-10 domains (Supplementary Table 7). We then prompted a language model to match each dialogue with the identified ICD-10 domains. One dialogue could be assigned to more than one ICD-10 domain. The accuracy of the assignment was corroborated by a panel consisting of five medical professionals.
To establish a contextual benchmark for evaluating the P&P Care chatbot, we prospectively collected 110 dialogue samples (10 per province) from existing primary care encounters in townships and communities following informed consent protocols. In addition, we collected 110 primary care dialogue samples (10 per province) from the telemedicine platform (HaoDF). Because these contextual dialogues frequently incorporated informal exchanges characteristic of established patient–physician relationships, which differed substantively from the interactions facilitated by P&P Care and were readily discernible, they were maintained as a separate dataset and not integrated with P&P Care patient dialogues for blinded evaluation. Each dialogue underwent independent evaluation by two trained members of the same evaluation panel. Interrater reliability was rigorously assessed and confirmed using Fleiss’ kappa statistics (κ > 0.80).
Furthermore, we expanded our analysis to compare P&P Care with usual care and telemedicine across four domains central to primary care quality: (1) history-taking, (2) diagnostic precision, (3) test-ordering appropriateness and (4) long-term disease management. Blinded evaluations by independent expert clinicians (five board-certified primary care physicians) assessed conversation dialogues using validated 5-point Likert scales for completeness, appropriateness and clinical relevance, according to standardized evaluation guidelines (Supplementary Table 8). Interrater reliability for these assessments was high (κ > 0.80).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.