Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency
Jazyk angličtina Země Velká Británie, Anglie Médium electronic
Typ dokumentu časopisecké články
Grantová podpora
Cooperatio, Medical Diagnostics and Basic Medical Sciences
Charles University in Prague
MH CZ-DRO, Motol University Hospital, 00064203
Ministerstvo Zdravotnictví Ceské Republiky
PubMed
40830600
PubMed Central
PMC12364795
DOI
10.1186/s41747-025-00591-0
PII: 10.1186/s41747-025-00591-0
Knihovny.cz E-zdroje
- Klíčová slova
- Artificial intelligence, Education (medical), Educational measurement, European Diploma in Radiology, Radiology,
- MeSH
- generativní umělá inteligence MeSH
- lidé MeSH
- radiologie * výchova MeSH
- výuka - hodnocení * metody MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- Geografické názvy
- Evropa MeSH
BACKGROUND: We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions. METHODS: ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations. Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident). Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.0-1.0). RESULTS: Claude 3.5 Sonnet achieved the highest score per question (0.84 ± 0.26, mean ± standard deviation) compared to ChatGPT-4o (0.76 ± 0.31), ChatGPT-4o-mini (0.64 ± 0.35), Copilot (0.62 ± 0.37), and Gemini (0.54 ± 0.39) (p < 0.001). A self-reported confidence in answering the questions was 9.0 ± 0.9 for Claude 3.5 Sonnet followed by ChatGPT-4o (8.7 ± 1.1), compared to ChatGPT-4o-mini (8.2 ± 1.3), Copilot (8.2 ± 2.2), and Gemini (8.2 ± 1.6, p < 0.001). Claude 3.5 Sonnet demonstrated superior consistency, changing responses in 5.4% of cases between the two iterations, compared to ChatGPT-4o (6.5%), ChatGPT-4o-mini (8.8%), Copilot (13.8%), and Gemini (18.5%). All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination. CONCLUSION: Claude 3.5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well. The variation in performance among the evaluated models was substantial. RELEVANCE STATEMENT: Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings. KEY POINTS: Claude 3.5 Sonnet outperformed other chatbots in accuracy and response consistency. ChatGPT-4o ranked second, showing strong but slightly less reliable performance. All chatbots surpassed EDiR candidates in text-based EDiR questions.
Era Radiology Center Izmir Turkey
European Board of Radiology Av Diagonal 383 L'Eixample 08008 Barcelona Spain
Zobrazit více v PubMed
Nazi ZA, Peng W (2024) Large language models in healthcare and medical domain: a review. Informatics 11:57. 10.3390/informatics11030057
Bhayana R (2024) Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology 310:e232756. 10.1148/radiol.232756 PubMed
Stoehr F, Kämpgen B, Müller L et al (2023) Natural language processing for automatic evaluation of free-text answers—a feasibility study based on the European Diploma in Radiology examination. Insights Imaging 14:150. 10.1186/s13244-023-01507-5 PubMed PMC
Busch F, Hoffmann L, Santos D et al (2024) Large language models for structured reporting in radiology: past, present, and future. Eur Radiol. 10.1007/s00330-024-11107-6 PubMed PMC
Meddeb A, Lüken S, Busch F et al (2024) Large language model ability to translate CT and MRI free-text radiology reports into multiple languages. Radiology 313:e241736. 10.1148/radiol.241736 PubMed
Lee J, Park S, Shin J, Cho B (2024) Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 24:366. 10.1186/s12911-024-02709-7 PubMed PMC
Bhayana R, Krishna S, Bleakney RR (2023) Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 307:e230582. 10.1148/radiol.230582 PubMed
Pristoupil J, Oleaga L, Junquero V et al (2025) Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists. Insights Imaging 16:66. 10.1186/s13244-025-01941-7 PubMed PMC
European Board of Radiology (2018) The European Diploma in Radiology (EDiR): investing in the future of the new generations of radiologists. Insights Imaging 9:905–909. 10.1007/s13244-018-0665-7 PubMed PMC
Oleaga Zufiría L (2023) European Diploma in Radiology (EDiR). Radiologia 65:193–194. 10.1016/j.rxeng.2023.01.007 PubMed
Hayden N, Gilbert S, Poisson LM et al (2024) Performance of GPT-4 with vision on text- and image-based ACR diagnostic radiology in-training examination questions. Radiology 312:e240153. 10.1148/radiol.240153 PubMed
Brin D, Sorin V, Barash Y et al (2024) Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 10.1007/s00330-024-11035-5 PubMed PMC
European Board of Radiology (2020) Implementation of the clinically oriented reasoning evaluation: impact on the European Diploma in Radiology (EDiR) exam. Insights Imaging 11:45. 10.1186/s13244-020-00844-z PubMed PMC
Savage T, Wang J, Gallo R et al (2025) Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J Am Med Inform Assoc 32:139–149. 10.1093/jamia/ocae254 PubMed PMC
Keshavarz P, Bagherieh S, Nabipoorashrafi SA et al (2024) ChatGPT in radiology: a systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 105:251–256. 10.1016/j.diii.2024.04.003 PubMed
Oc Y, Hassen H (2024) Comparing the effectiveness of multiple-answer and single-answer multiple-choice questions in assessing student learning. Mark Educ Rev 35:44–57. 10.1080/10528008.2024.2417106
Toyama Y, Harigai A, Abe M et al (2024) Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol 42:201–207. 10.1007/s11604-023-01491-2 PubMed PMC
Rossettini G, Rodeghiero L, Corradi F et al (2024) Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC Med Educ 24:694. 10.1186/s12909-024-05630-9 PubMed PMC
Tepe M, Emekli E (2024) Assessing the responses of large language models (ChatGPT-4, Gemini, and Microsoft Copilot) to frequently asked questions in breast imaging: a study on readability and accuracy. Cureus 16:e59960. 10.7759/cureus.59960 PubMed PMC
Mavrych V, Yaqinuddin A, Bolgova O (2025) Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience. Adv Physiol Educ 49:430–437. 10.1152/advan.00093.2024 PubMed
Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E (2024) Benchmarking the confidence of large language models in clinical questions. MedRxiv 2024.08.11.24311810. 10.1101/2024.08.11.24311810
Siepmann R, Huppertz M, Rastkhiz A et al (2024) The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation. Eur Radiol 34:6652–6666. 10.1007/s00330-024-10727-2 PubMed PMC
Funk PF, Hoch CC, Knoedler S et al (2024) ChatGPT’s response consistency: a study on repeated queries of medical examination questions. Eur J Invest Health Psychol Educ 14:657. 10.3390/ejihpe14030043 PubMed PMC
Zanardo M, Visser JJ, Colarieti A et al (2024) Impact of AI on radiology: a EuroAIM/EuSoMII 2024 survey among members of the European Society of Radiology. Insights Imaging 15:240. 10.1186/s13244-024-01801-w PubMed PMC