Objectives
This study aimed to evaluate the limitations of large language models (LLMs) in orthodontics by comparing their performance on licensing exam questions categorized by knowledge domains and error types.
Materials and methods
Deepseek-R1(DS) and ChatGPT-4(GPT) were evaluated using 396 text-based questions from the Chinese National Orthodontic Specialist Licensing Examination. Questions were classified through dual taxonomies: (1) “knowledge domains” including foundational biomechanical principles, cross-disciplinary medical integration, specialized orthodontic theory, and clinical decision-making skills; (2) “error types” including factual inaccuracies, logical deficits, and semantic misinterpretations.
Results
DS demonstrated significantly higher overall accuracy than GPT (80.3% vs 52.3%, p < 0.001), with statistically significant differences in foundational knowledge (79.8% vs 43.4%) and cross-disciplinary domains (81.0% vs 53.0%). Factual errors were predominant in both models (DS:57.7%, GPT:69.3%), though DS exhibited higher logical error rates (24.4% vs 16.4%).
Conclusions
While DS outperforms GPT in general orthodontic knowledge assessment, both models show limitations in specialized domains requiring clinical reasoning.
Clinical relevance
The superior performance of DS in standardized exams suggests potential for AI-assisted decision support in orthodontic training and licensing evaluation. However, the persistent factual errors and domain-specific limitations highlight the necessity for clinician verification in real-world applications. Integrating domain-specific knowledge refinement with logical reasoning modules could enhance LLMs’ clinical utility in orthodontic practice.