Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists
Status PubMed-not-MEDLINE Language English Country Germany Media electronic
Document type Journal Article
Grant support
MH CZ-DRO, Motol University Hospital, 00064203 and General University Hospital in Prague, 00064165
Ministerstvo Zdravotnictví Ceské Republiky
Cooperatio, Medical Diagnostics and Basic Medical Sciences
Charles University in Prague
PubMed
40120065
PubMed Central
PMC11929644
DOI
10.1186/s13244-025-01941-7
PII: 10.1186/s13244-025-01941-7
Knihovny.cz E-resources
- Keywords
- Artificial intelligence, Examination, Natural language processing, Radiology,
- Publication type
- Journal Article MeSH
OBJECTIVES: This study aims to assess the accuracy of generative pre-trained transformer 4o (GPT-4o) in answering multiple response questions from the European Diploma in Radiology (EDiR) examination, comparing its performance to that of human candidates. MATERIALS AND METHODS: Results from 42 EDiR candidates across Europe were compared to those from 26 fourth-year medical students who answered exclusively using the ChatGPT-4o in a prospective study (October 2024). The challenge consisted of 52 recall or understanding-based EDiR multiple-response questions, all without visual inputs. RESULTS: The GPT-4o achieved a mean score of 82.1 ± 3.0%, significantly outperforming the EDiR candidates with 49.4 ± 10.5% (p < 0.0001). In particular, chatGPT-4o demonstrated higher true positive rates while maintaining lower false positive rates compared to EDiR candidates, with a higher accuracy rate in all radiology subspecialties (p < 0.0001) except informatics (p = 0.20). There was near-perfect agreement between GPT-4 responses (κ = 0.872) and moderate agreement among EDiR participants (κ = 0.334). Exit surveys revealed that all participants used the copy-and-paste feature, and 73% submitted additional questions to clarify responses. CONCLUSIONS: GPT-4o significantly outperformed human candidates in low-order, text-based EDiR multiple-response questions, demonstrating higher accuracy and reliability. These results highlight GPT-4o's potential in answering text-based radiology questions. Further research is necessary to investigate its performance across different question formats and candidate populations to ensure broader applicability and reliability. CRITICAL RELEVANCE STATEMENT: GPT-4o significantly outperforms human candidates in factual radiology text-based questions in the EDiR, excelling especially in identifying correct responses, with a higher accuracy rate compared to radiologists. KEY POINTS: In EDiR text-based questions, ChatGPT-4o scored higher (82%) than EDiR participants (49%). Compared to radiologists, GPT-4o excelled in identifying correct responses. GPT-4o responses demonstrated higher agreement (κ = 0.87) compared to EDiR candidates (κ = 0.33).
See more in PubMed
Bhayana R (2024) Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology 310:e232756 PubMed
Bera K, O’Connor G, Jiang S, Tirumani SH, Ramaiya N (2024) Analysis of ChatGPT publications in radiology: literature so far. Curr Probl Diagn Radiol 53:215–225 PubMed
Cozzi A, Pinker K, Hidber A et al (2024) BI-RADS category assignments by GPT-3.5, GPT-4, and Google Bard: a multilanguage study. Radiology 311:e232133 PubMed PMC
Zhang T, Tan T, Wang X et al (2023) RadioLOGIC, a healthcare model for processing electronic health records and decision-making in breast disease. Cell Rep Med 4:101131 PubMed PMC
Currie G, Singh C, Nelson T, Nabasenja C, Al-Hayek Y, Spuur K (2023) ChatGPT in medical imaging higher education. Radiography 29:792–799 PubMed
Lecler A, Duron L, Soyer P (2023) Revolutionizing radiology with GPT-based models: current applications, future possibilities and limitations of ChatGPT. Diagn Interv Imaging 104:269–274 PubMed
Hasani AM, Singh S, Zahergivar A et al (2024) Evaluating the performance of generative pre-trained transformer-4 (GPT-4) in standardizing radiology reports. Eur Radiol 34:3566–3574 PubMed
Bhayana R, Krishna S, Bleakney RR (2023) Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 307:e230582 PubMed
European Board of Radiology (2018) The European Diploma in Radiology (EDiR): investing in the future of the new generations of radiologists. Insights Imaging 9:905–909 PubMed PMC
Chen TC, Multala E, Kearns P et al (2023) Assessment of ChatGPT’s performance on neurology written board examination questions. BMJ Neurol Open 5:e000530 PubMed PMC
Ali R, Tang OY, Connolly ID et al (2023) Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery 93:1090–1098 PubMed
Noda R, Izaki Y, Kitano F, Komatsu J, Ichikawa D, Shibagaki Y (2024) Performance of ChatGPT and Bard in self-assessment questions for nephrology board renewal. Clin Exp Nephrol 28:465–469 PubMed
Nori H, King N, McKinney SM, Carignan D, Horvitz E (2023) Capabilities of GPT-4 on medical challenge problems. Preprint at 10.48550/arXiv.2303.13375
Toyama Y, Harigai A, Abe M et al (2024) Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol 42:201–207 PubMed PMC
Brin D, Sorin V, Barash Y et al (2024) Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 10.1007/s00330-024-11035-5 PubMed PMC
Ghanem D, Covarrubias O, Raad M, LaPorte D, Shafiq B (2023) ChatGPT performs at the level of a third-year orthopaedic surgery resident on the orthopaedic in-training examination. JB JS Open Access 8:e23.00103 PubMed PMC
Rossettini G, Rodeghiero L, Corradi F et al (2024) Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC Med Educ 24:694 PubMed PMC
Massey PA, Montgomery C, Zhang AS (2023) Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. J Am Acad Orthop Surg 31:1173–1179 PubMed PMC
Sarangi PK, Narayan RK, Mohakud S, Vats A, Sahani D, Mondal H (2023) Assessing the capability of ChatGPT, Google Bard, and Microsoft Bing in solving radiology case vignettes. Indian J Radiol Imaging 34:276 PubMed PMC
Rajpurkar P, Lungren MP (2023) The current and future state of AI interpretation of medical images. N Engl J Med 388:1981–1990 PubMed