Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency

. 2025 Aug 19 ; 9 (1) : 79. [epub] 20250819

Jazyk angličtina Země Velká Británie, Anglie Médium electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid40830600

Grantová podpora
Cooperatio, Medical Diagnostics and Basic Medical Sciences Charles University in Prague
MH CZ-DRO, Motol University Hospital, 00064203 Ministerstvo Zdravotnictví Ceské Republiky

Odkazy

PubMed 40830600
PubMed Central PMC12364795
DOI 10.1186/s41747-025-00591-0
PII: 10.1186/s41747-025-00591-0
Knihovny.cz E-zdroje

BACKGROUND: We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions. METHODS: ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations. Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident). Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.0-1.0). RESULTS: Claude 3.5 Sonnet achieved the highest score per question (0.84 ± 0.26, mean ± standard deviation) compared to ChatGPT-4o (0.76 ± 0.31), ChatGPT-4o-mini (0.64 ± 0.35), Copilot (0.62 ± 0.37), and Gemini (0.54 ± 0.39) (p < 0.001). A self-reported confidence in answering the questions was 9.0 ± 0.9 for Claude 3.5 Sonnet followed by ChatGPT-4o (8.7 ± 1.1), compared to ChatGPT-4o-mini (8.2 ± 1.3), Copilot (8.2 ± 2.2), and Gemini (8.2 ± 1.6, p < 0.001). Claude 3.5 Sonnet demonstrated superior consistency, changing responses in 5.4% of cases between the two iterations, compared to ChatGPT-4o (6.5%), ChatGPT-4o-mini (8.8%), Copilot (13.8%), and Gemini (18.5%). All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination. CONCLUSION: Claude 3.5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well. The variation in performance among the evaluated models was substantial. RELEVANCE STATEMENT: Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings. KEY POINTS: Claude 3.5 Sonnet outperformed other chatbots in accuracy and response consistency. ChatGPT-4o ranked second, showing strong but slightly less reliable performance. All chatbots surpassed EDiR candidates in text-based EDiR questions.

Zobrazit více v PubMed

Nazi ZA, Peng W (2024) Large language models in healthcare and medical domain: a review. Informatics 11:57. 10.3390/informatics11030057

Bhayana R (2024) Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology 310:e232756. 10.1148/radiol.232756 PubMed

Stoehr F, Kämpgen B, Müller L et al (2023) Natural language processing for automatic evaluation of free-text answers—a feasibility study based on the European Diploma in Radiology examination. Insights Imaging 14:150. 10.1186/s13244-023-01507-5 PubMed PMC

Busch F, Hoffmann L, Santos D et al (2024) Large language models for structured reporting in radiology: past, present, and future. Eur Radiol. 10.1007/s00330-024-11107-6 PubMed PMC

Meddeb A, Lüken S, Busch F et al (2024) Large language model ability to translate CT and MRI free-text radiology reports into multiple languages. Radiology 313:e241736. 10.1148/radiol.241736 PubMed

Lee J, Park S, Shin J, Cho B (2024) Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 24:366. 10.1186/s12911-024-02709-7 PubMed PMC

Bhayana R, Krishna S, Bleakney RR (2023) Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 307:e230582. 10.1148/radiol.230582 PubMed

Pristoupil J, Oleaga L, Junquero V et al (2025) Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists. Insights Imaging 16:66. 10.1186/s13244-025-01941-7 PubMed PMC

European Board of Radiology (2018) The European Diploma in Radiology (EDiR): investing in the future of the new generations of radiologists. Insights Imaging 9:905–909. 10.1007/s13244-018-0665-7 PubMed PMC

Oleaga Zufiría L (2023) European Diploma in Radiology (EDiR). Radiologia 65:193–194. 10.1016/j.rxeng.2023.01.007 PubMed

Hayden N, Gilbert S, Poisson LM et al (2024) Performance of GPT-4 with vision on text- and image-based ACR diagnostic radiology in-training examination questions. Radiology 312:e240153. 10.1148/radiol.240153 PubMed

Brin D, Sorin V, Barash Y et al (2024) Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 10.1007/s00330-024-11035-5 PubMed PMC

European Board of Radiology (2020) Implementation of the clinically oriented reasoning evaluation: impact on the European Diploma in Radiology (EDiR) exam. Insights Imaging 11:45. 10.1186/s13244-020-00844-z PubMed PMC

Savage T, Wang J, Gallo R et al (2025) Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J Am Med Inform Assoc 32:139–149. 10.1093/jamia/ocae254 PubMed PMC

Keshavarz P, Bagherieh S, Nabipoorashrafi SA et al (2024) ChatGPT in radiology: a systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 105:251–256. 10.1016/j.diii.2024.04.003 PubMed

Oc Y, Hassen H (2024) Comparing the effectiveness of multiple-answer and single-answer multiple-choice questions in assessing student learning. Mark Educ Rev 35:44–57. 10.1080/10528008.2024.2417106

Toyama Y, Harigai A, Abe M et al (2024) Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol 42:201–207. 10.1007/s11604-023-01491-2 PubMed PMC

Rossettini G, Rodeghiero L, Corradi F et al (2024) Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC Med Educ 24:694. 10.1186/s12909-024-05630-9 PubMed PMC

Tepe M, Emekli E (2024) Assessing the responses of large language models (ChatGPT-4, Gemini, and Microsoft Copilot) to frequently asked questions in breast imaging: a study on readability and accuracy. Cureus 16:e59960. 10.7759/cureus.59960 PubMed PMC

Mavrych V, Yaqinuddin A, Bolgova O (2025) Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience. Adv Physiol Educ 49:430–437. 10.1152/advan.00093.2024 PubMed

Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E (2024) Benchmarking the confidence of large language models in clinical questions. MedRxiv 2024.08.11.24311810. 10.1101/2024.08.11.24311810

Siepmann R, Huppertz M, Rastkhiz A et al (2024) The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation. Eur Radiol 34:6652–6666. 10.1007/s00330-024-10727-2 PubMed PMC

Funk PF, Hoch CC, Knoedler S et al (2024) ChatGPT’s response consistency: a study on repeated queries of medical examination questions. Eur J Invest Health Psychol Educ 14:657. 10.3390/ejihpe14030043 PubMed PMC

Zanardo M, Visser JJ, Colarieti A et al (2024) Impact of AI on radiology: a EuroAIM/EuSoMII 2024 survey among members of the European Society of Radiology. Insights Imaging 15:240. 10.1186/s13244-024-01801-w PubMed PMC

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...