AI Chatbots Tackle the Board: How 7 Large Language Models Fared on Endodontic Exam Questions

As artificial intelligence (AI) continues to integrate into healthcare, its potential in dental education is being put to the test—literally.

A new study published in Journal of Endodontics evaluated the ability of 7 leading AI chatbots to correctly answer 100 board-style multiple-choice questions in endodontics. The findings show that while some models demonstrate strong potential as study aids, others fall short, and none are ready to replace traditional learning methods.

A Rigorous Test for Digital Tutors

The study aimed to assess the performance and reasoning quality of large language model (LLM)-based chatbots: Gemini Advanced, Gemini, Microsoft Copilot, GPT-3.5, GPT-4.0, GPT-4o, and Claude 3.5 Sonnet. Using a strict testing protocol, each chatbot was queried three separate times per question to test response consistency.

Each of the 100 questions was written to match the content and format of the American Board of Endodontics (ABE) Written Examination and spanned both textbook-based and literature-based topics. The goal was not only to assess accuracy but also to evaluate the logical coherence and relevance of the AI’s reasoning behind each answer using a standardized rubric.

Performance Highlights: Who Got It Right?

Out of a possible perfect score of 100%, the chatbots achieved accuracy rates ranging from 48% to 71%. The top performers—Gemini Advanced, GPT-3.5, GPT-4o, and Claude 3.5 Sonnet—each reached or exceeded the 70% threshold. At the other end of the spectrum, Microsoft Copilot lagged behind with just 48% accuracy, despite its real-time web access capabilities.

A notable trend emerged: All chatbots performed better on textbook-derived questions than on literature-based ones, particularly GPT models and Claude. The authors speculated that this discrepancy may stem from the lack of access to proprietary academic literature in the training data of these models.

Gemini Advanced distinguished itself further by delivering the highest-quality explanations, with 81% of its responses earning a top score for logical coherence and relevance. Microsoft Copilot again ranked lowest, with 42% of its responses scoring zero.

Why These Results Matter

For endodontic residents, educators, and practicing clinicians considering AI as a study aid, these results offer a nuanced picture. While some models show promise, particularly in reinforcing textbook knowledge and simulating exam-style reasoning, others fall short on both accuracy and explanatory depth.

While the chatbots can mimic understanding, their reasoning may not always be grounded in accurate clinical knowledge, the researchers noted. This highlights the need for close supervision, critical evaluation, and cross-referencing with trusted sources if AI is being used in a learning context.

Microsoft Copilot’s disappointing performance raises questions about the value of real-time internet access in clinical AI tools. Despite theoretically having the most current data, Copilot struggled to contextualize and filter relevant information in endodontics, echoing findings from prior studies that showed similar shortcomings in medical question-answering.

Next Steps: Toward Smarter, Safer Integration

The findings support the idea that AI chatbots, particularly top-performing models like Gemini Advanced and GPT-4o, could play a role in board exam preparation, clinical decision support, and continuing education. However, their current limitations underscore the need for transparency around AI model training data and robust fact checking.

As the authors note, these tools may eventually simulate oral board scenarios or generate high-quality practice questions—but for now, they remain best used as supplementary aids, not standalone solutions.

Issues

Continuing Education

Educational Resources

AI Chatbots Tackle the Board: How 7 Large Language Models Fared on Endodontic Exam Questions