Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases

. 2025 Oct 14 ; 121 () : 105957. [epub] 20251014

Status Publisher Jazyk angličtina Země Nizozemsko Médium print-electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid41092581
Odkazy

PubMed 41092581
PubMed Central PMC12552141
DOI 10.1016/j.ebiom.2025.105957
PII: S2352-3964(25)00401-3
Knihovny.cz E-zdroje

BACKGROUND: Large language models (LLMs) are increasingly used medicine for diverse applications including differential diagnostic support. The training data used to create LLMs such as the Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but LLMs could be used across the globe to support diagnostics if language barriers could be overcome. Initial pilot studies on the utility of LLMs for differential diagnosis in languages other than English have shown promise, but a large-scale assessment on the relative performance of these models in a variety of European and non-European languages on a comprehensive corpus of challenging rare-disease cases is lacking. METHODS: We created 4917 clinical vignettes using structured data captured with Human Phenotype Ontology (HPO) terms with the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema. These clinical vignettes span a total of 360 distinct genetic diseases with 2525 associated phenotypic features. We used translations of the Human Phenotype Ontology together with language-specific templates to generate prompts in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish. We applied GPT-4o, version gpt-4o-2024-08-06, and the medically fine-tuned Meditron3-70B to the task of delivering a ranked differential diagnosis using a zero-shot prompt. An ontology-based approach with the Mondo disease ontology was used to map synonyms and to map disease subtypes to clinical diagnoses in order to automate evaluation of LLM responses. FINDINGS: For English, GPT-4o placed the correct diagnosis at the first rank 19.9% and within the top-3 ranks 27.0% of the time. In comparison, for the nine non-English languages tested here the correct diagnosis was placed at rank 1 between 16.9% and 20.6%, within top-3 between 25.4% and 28.6% of cases. The Meditron3 model placed the correct diagnosis within the first 3 ranks for 20.9% of cases in English and between 19.9% and 24.0% for the other nine languages. INTERPRETATION: The differential diagnostic performance of LLMs across a comprehensive corpus of rare-disease cases was largely consistent across the ten languages tested. This suggests that the utility of LLMs in clinical settings may extend to non-English clinical settings. FUNDING: NHGRI 5U24HG011449, 5RM1HG010860, R01HD103805 and R24OD011883. P.N.R. was supported by a Professorship of the Alexander von Humboldt Foundation; P.L. was supported by a National Grant (PMP21/00063 ONTOPREC-ISCIII, Fondos FEDER). C.M., J.R. and J.H.C. were supported in part by the Director, Office of Science, Office of Basic Energy Sciences, of the US Department of Energy (Contract No. DE-AC0205CH11231).

Berlin Institute of Health at Charité Universitätsmedizin Berlin Berlin Germany

Berlin Institute of Health at Charité Universitätsmedizin Berlin Berlin Germany; The Jackson Laboratory for Genomic Medicine Farmington CT USA

Berlin Institute of Health at Charité Universitätsmedizin Berlin Berlin Germany; Utrecht University Utrecht Netherlands

Chinese HPO Consortium Beijing China

Department of Biology and Medical Genetics 2nd Faculty of Medicine Charles University Prague and Motol University Hospital Prague Czech Republic

Department of Human Genetics Bioscientia Healthcare GmbH Ingelheim Germany

Department of Human Genetics Donders Institute for Brain Cognition and Behaviour Radboud University Medical Center Nijmegen the Netherlands

Department of Ophthalmology University Clinic Marburg Campus Fulda Fulda Germany

Department of Pathology and Laboratory Medicine University of Pennsylvania Philadelphia PA USA

Deutsches Herzzentrum der Charité Berlin Germany

INGEM ToMMo Tohoku University Miyagi Japan

INGEMM Idipaz Institute of Medical and Molecular Genetics Hospital Universitario La Paz Madrid Spain

INGEMM Idipaz Institute of Medical and Molecular Genetics Hospital Universitario La Paz Madrid Spain; CIBERER Centro de Investigación Biomédica en Red de Enfermedades Raras Instituto de Salud Carlos Madrid Spain

Institute for Maternal and Child Health IRCCS Burlo Garofolo Trieste Trieste 34137 Italy

Lawrence Berkeley National Laboratory Berkeley CA USA

Lawrence Berkeley National Laboratory Berkeley CA USA; Trinity College Hartford CT USA

Medical University of Gdansk ul M Skłodowskiej Curie 3a 80 210 Gdańsk Poland

Semanticly Athens Greece

The Jackson Laboratory for Genomic Medicine Farmington CT USA

University of North Carolina at Chapel Hill Chapel Hill NC USA

William Harvey Research Institute Barts and the London School of Medicine and Dentistry Queen Mary University of London London UK

Zobrazit více v PubMed

Singhal K., Azizi S., Tu T., et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. PubMed PMC

Statistics of common crawl monthly archives by commoncrawl. https://commoncrawl.github.io/cc-crawl-statistics/plots/languages

Hayase J., Liu A., Choi Y., Oh S., Smith N.A. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. 2024. Data mixture inference attack: BPE tokenizers reveal training data compositions.https://openreview.net/pdf?id=EHXyeImux0

Liu X., Wu J., Shao A., et al. Uncovering language disparity of ChatGPT on retinal vascular disease classification: cross-sectional study. J Med Internet Res. 2024;26 PubMed PMC

Sallam M., Al-Mahzoum K., Almutawaa R.A., et al. The performance of OpenAI ChatGPT-4 and google gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses. BMC Res Notes. 2024;17:247. PubMed PMC

Lai V.D., Ngo N., Veyseh A.P.B., et al. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. ChatGPT beyond english: towards a comprehensive evaluation of large language models in multilingual learning; pp. 13171–13189.

Takagi S., Watari T., Erabi A., Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ. 2023;9 PubMed PMC

Wu J., Wu X., Qiu Z., et al. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inform Assoc. 2024;31:2054–2064. PubMed PMC

Jung L.B., Gudera J.A., Wiegand T.L.T., Allmendinger S., Dimitriadis K., Koerte I.K. ChatGPT passes German state examination in medicine with picture questions omitted. Deutsch Arztebl Int. 2023;120 doi: 10.3238/arztebl.m2023.0113. PubMed DOI PMC

Accurate diagnosis of rare diseases remains difficult despite strong physician interest. Global Genes; 2014. https://globalgenes.org/raredaily/accurate-diagnosis-of-rare-diseases-remains-difficult-despite-strong-physician-interest/

Haendel M., Vasilevsky N., Unni D., et al. How many rare diseases are there? Nat Rev Drug Discov. 2020;19:77–78. PubMed PMC

Clark M.M., Stark Z., Farnaes L., et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. NPJ Genom Med. 2018;3:16. PubMed PMC

Kim J., Wang K., Weng C., Liu C. Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease. Am J Hum Genet. 2024;111:2190–2202. PubMed PMC

Chen Z., Cano A.H., Romanou A., et al. MEDITRON-70B: scaling medical pretraining for large language models. 2023. http://arxiv.org/abs/2311.16079 preprint.

Vasilevsky N.A., Matentzoglu N.A., Toro S., et al. Mondo: unifying diseases for the world, by the world. medRxiv. 2022 doi: 10.1101/2022.04.13.22273750. DOI

Reese J.T., Chimirri L., Bridges Y., et al. Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools. medRxiv. 2024 doi: 10.1101/2024.07.22.24310816. DOI

Gallifant J., Afshar M., Ameen S., et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025;31:60–69. PubMed PMC

Robinson P.N., Köhler S., Bauer S., Seelow D., Horn D., Mundlos S. The human phenotype ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet. 2008;83:610–615. PubMed PMC

Gargano M.A., Matentzoglu N., Coleman B., et al. The human phenotype ontology in 2024: phenotypes around the world. Nucleic Acids Res. 2024;52:D1333–D1346. PubMed PMC

Jacobsen J.O.B., Baudis M., Baynam G.S., et al. The GA4GH phenopacket schema defines a computable representation of clinical data. Nat Biotechnol. 2022;40:817–820. PubMed PMC

Danis D., Bamshad M.J., Bridges Y., et al. A corpus of GA4GH phenopackets: case-level phenotyping for genomic diagnostics and discovery. HGG Adv. 2025;6 PubMed PMC

Reese J.T., Danis D., Caufield J.H., et al. On the limitations of large language models in clinical diagnosis. medRxiv. 2024 doi: 10.1101/2023.07.13.23292613. DOI

Grattafiori A., Dubey A., Jauhri A., et al. The llama 3 herd of models. 2024. http://arxiv.org/abs/2407.21783 preprint.

Soroush A., Glicksberg B.S., Zimlichman E., et al. Large language models are poor medical coders — benchmarking of medical code querying. NEJM AI. 2024;1 doi: 10.1056/aidbp2300040. DOI

Azamfirei R., Kudchadkar S.R., Fackler J. Large language models and the perils of their hallucinations. Crit Care. 2023;27:120. PubMed PMC

Caufield H., Kroll C., O'Neil S.T., et al. CurateGPT: a flexible language-model assisted biocuration tool. 2024. http://arxiv.org/abs/2411.00046 preprint.

Bridges Y., Souza V., Cortes K.G., et al. Towards a standard benchmark for phenotype-driven variant and gene prioritisation algorithms: PhEval - Phenotypic inference Evaluation framework. BMC Bioinformatics. 2025;26:87. PubMed PMC

Kruskal W.H., Allen Wallis W. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952 https://www.tandfonline.com/doi/abs/10.1080/01621459.1952.10483441 DOI

Kanjee Z., Crowe B., Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330:78–80. PubMed PMC

Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions. Am J Hum Genet. 2024;111:1819–1833. PubMed PMC

Menezes M.C.S., Hoffmann A.F., Tan A.L.M., et al. The potential of generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digit Health. 2025;7:e35–e43. PubMed PMC

Lewis P., Perez E., Piktus A., et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. 2020. http://arxiv.org/abs/2005.11401 preprint.

Singh A.K., Kocyigit M.Y., Poulton A., et al. Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? arXiv [cs.CL] 2024 http://arxiv.org/abs/2411.03923 preprint.

OpenAI, Hurst A., Lerer A., et al. arXiv. 2024. GPT-4o system card. [cs.CL]

Dong Y., Jiang X., Liu H., et al. Generalization or memorization: data contamination and trustworthy evaluation for large language models. arXiv [cs.CL] 2024 http://arxiv.org/abs/2402.15938 preprint.

Hager P., Jungmann F., Holland R., et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024;30:2613–2622. PubMed PMC

Peters D.H., Garg A., Bloom G., Walker D.G., Brieger W.R., Rahman M.H. Poverty and access to health care in developing countries. Ann N Y Acad Sci. 2008;1136:161–171. PubMed

Singhal K., Tu T., Gottweis J., et al. Toward expert-level medical question answering with large language models. Nat Med. 2025 doi: 10.1038/s41591-024-03423-7. PubMed DOI PMC

Najít záznam

Citační ukazatele

Pouze přihlášení uživatelé

Možnosti archivace

Nahrávání dat ...