Determining the reliability of Artificial Intelligence programs in medical faculty board exams
Artificial intelligence in medical school
DOI:
https://doi.org/10.12669/pjms.41.12.12855Keywords:
Artificeducationsial Intelligence, ChatGPT, Gemini, Copilot, Anatomy.Abstract
Objective: In recent years, artificial intelligence (AI) applications have become widespread in many fields, including medical education. This study aims to examine the reliability of widely used generative AI programs, such as Microsoft Copilot, Google Gemini and OpenAI ChatGPT, by evaluating the accuracy of first- and second-year medical students' responses to anatomy questions on mid-term board, final and make-up exams.
Methodology: Total 286 anatomy questions from the 2023-2024 academic year, 222 with analysis reports were included in the study. The difficulty levels of the questions were divided into four groups (very difficult, difficult, medium, easy) based on students' correct answer rates. The same questions were then posed to three AI applications. The data were analyzed by SPSS-version 27.
Results: According to the findings, Copilot, ChatGPT and Gemini achieved significantly higher accuracy compared to students, with 97.7% accuracy, 94.4% accuracy and 86.5% accuracy, respectively. However, Gemini and ChatGPT remained similar to students, particularly on very challenging questions. Gemini did not perform as well on questions requiring basic knowledge (first year) as on questions requiring clinical interpretation (second year).
Conclusion: While the study found that AI applications provide higher accuracy compared to students, systems that fail to achieve 100% accuracy are not suitable for unrestricted and unsupervised use in critical basic medical sciences like anatomy. Because AI models lack clinical reasoning and human experience, they should be used only as supplementary educational tools and integrated in a controlled manner to enhance student success.




