Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases
Status Publisher Jazyk angličtina Země Nizozemsko Médium print-electronic
Typ dokumentu časopisecké články
PubMed
41092581
PubMed Central
PMC12552141
DOI
10.1016/j.ebiom.2025.105957
PII: S2352-3964(25)00401-3
Knihovny.cz E-zdroje
- Klíčová slova
- Artificial intelligence, Genomic diagnostics, Global Alliance for Genomics and Health, Human phenotype ontology, Large language model, Phenopacket schema,
- Publikační typ
- časopisecké články MeSH
BACKGROUND: Large language models (LLMs) are increasingly used medicine for diverse applications including differential diagnostic support. The training data used to create LLMs such as the Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but LLMs could be used across the globe to support diagnostics if language barriers could be overcome. Initial pilot studies on the utility of LLMs for differential diagnosis in languages other than English have shown promise, but a large-scale assessment on the relative performance of these models in a variety of European and non-European languages on a comprehensive corpus of challenging rare-disease cases is lacking. METHODS: We created 4917 clinical vignettes using structured data captured with Human Phenotype Ontology (HPO) terms with the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema. These clinical vignettes span a total of 360 distinct genetic diseases with 2525 associated phenotypic features. We used translations of the Human Phenotype Ontology together with language-specific templates to generate prompts in English, Chinese, Czech, Dutch, French, German, Italian, Japanese, Spanish, and Turkish. We applied GPT-4o, version gpt-4o-2024-08-06, and the medically fine-tuned Meditron3-70B to the task of delivering a ranked differential diagnosis using a zero-shot prompt. An ontology-based approach with the Mondo disease ontology was used to map synonyms and to map disease subtypes to clinical diagnoses in order to automate evaluation of LLM responses. FINDINGS: For English, GPT-4o placed the correct diagnosis at the first rank 19.9% and within the top-3 ranks 27.0% of the time. In comparison, for the nine non-English languages tested here the correct diagnosis was placed at rank 1 between 16.9% and 20.6%, within top-3 between 25.4% and 28.6% of cases. The Meditron3 model placed the correct diagnosis within the first 3 ranks for 20.9% of cases in English and between 19.9% and 24.0% for the other nine languages. INTERPRETATION: The differential diagnostic performance of LLMs across a comprehensive corpus of rare-disease cases was largely consistent across the ten languages tested. This suggests that the utility of LLMs in clinical settings may extend to non-English clinical settings. FUNDING: NHGRI 5U24HG011449, 5RM1HG010860, R01HD103805 and R24OD011883. P.N.R. was supported by a Professorship of the Alexander von Humboldt Foundation; P.L. was supported by a National Grant (PMP21/00063 ONTOPREC-ISCIII, Fondos FEDER). C.M., J.R. and J.H.C. were supported in part by the Director, Office of Science, Office of Basic Energy Sciences, of the US Department of Energy (Contract No. DE-AC0205CH11231).
Berlin Institute of Health at Charité Universitätsmedizin Berlin Berlin Germany
Chinese HPO Consortium Beijing China
Department of Human Genetics Bioscientia Healthcare GmbH Ingelheim Germany
Department of Ophthalmology University Clinic Marburg Campus Fulda Fulda Germany
Department of Pathology and Laboratory Medicine University of Pennsylvania Philadelphia PA USA
Deutsches Herzzentrum der Charité Berlin Germany
INGEM ToMMo Tohoku University Miyagi Japan
INGEMM Idipaz Institute of Medical and Molecular Genetics Hospital Universitario La Paz Madrid Spain
Institute for Maternal and Child Health IRCCS Burlo Garofolo Trieste Trieste 34137 Italy
Lawrence Berkeley National Laboratory Berkeley CA USA
Lawrence Berkeley National Laboratory Berkeley CA USA; Trinity College Hartford CT USA
Medical University of Gdansk ul M Skłodowskiej Curie 3a 80 210 Gdańsk Poland
The Jackson Laboratory for Genomic Medicine Farmington CT USA
University of North Carolina at Chapel Hill Chapel Hill NC USA
Zobrazit více v PubMed
Singhal K., Azizi S., Tu T., et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. PubMed PMC
Statistics of common crawl monthly archives by commoncrawl. https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
Hayase J., Liu A., Choi Y., Oh S., Smith N.A. The Thirty-Eighth Annual Conference on Neural Information Processing Systems. 2024. Data mixture inference attack: BPE tokenizers reveal training data compositions.https://openreview.net/pdf?id=EHXyeImux0
Liu X., Wu J., Shao A., et al. Uncovering language disparity of ChatGPT on retinal vascular disease classification: cross-sectional study. J Med Internet Res. 2024;26 PubMed PMC
Sallam M., Al-Mahzoum K., Almutawaa R.A., et al. The performance of OpenAI ChatGPT-4 and google gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses. BMC Res Notes. 2024;17:247. PubMed PMC
Lai V.D., Ngo N., Veyseh A.P.B., et al. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. ChatGPT beyond english: towards a comprehensive evaluation of large language models in multilingual learning; pp. 13171–13189.
Takagi S., Watari T., Erabi A., Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ. 2023;9 PubMed PMC
Wu J., Wu X., Qiu Z., et al. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inform Assoc. 2024;31:2054–2064. PubMed PMC
Jung L.B., Gudera J.A., Wiegand T.L.T., Allmendinger S., Dimitriadis K., Koerte I.K. ChatGPT passes German state examination in medicine with picture questions omitted. Deutsch Arztebl Int. 2023;120 doi: 10.3238/arztebl.m2023.0113. PubMed DOI PMC
Accurate diagnosis of rare diseases remains difficult despite strong physician interest. Global Genes; 2014. https://globalgenes.org/raredaily/accurate-diagnosis-of-rare-diseases-remains-difficult-despite-strong-physician-interest/
Haendel M., Vasilevsky N., Unni D., et al. How many rare diseases are there? Nat Rev Drug Discov. 2020;19:77–78. PubMed PMC
Clark M.M., Stark Z., Farnaes L., et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. NPJ Genom Med. 2018;3:16. PubMed PMC
Kim J., Wang K., Weng C., Liu C. Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease. Am J Hum Genet. 2024;111:2190–2202. PubMed PMC
Chen Z., Cano A.H., Romanou A., et al. MEDITRON-70B: scaling medical pretraining for large language models. 2023. http://arxiv.org/abs/2311.16079 preprint.
Vasilevsky N.A., Matentzoglu N.A., Toro S., et al. Mondo: unifying diseases for the world, by the world. medRxiv. 2022 doi: 10.1101/2022.04.13.22273750. DOI
Reese J.T., Chimirri L., Bridges Y., et al. Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools. medRxiv. 2024 doi: 10.1101/2024.07.22.24310816. DOI
Gallifant J., Afshar M., Ameen S., et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025;31:60–69. PubMed PMC
Robinson P.N., Köhler S., Bauer S., Seelow D., Horn D., Mundlos S. The human phenotype ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet. 2008;83:610–615. PubMed PMC
Gargano M.A., Matentzoglu N., Coleman B., et al. The human phenotype ontology in 2024: phenotypes around the world. Nucleic Acids Res. 2024;52:D1333–D1346. PubMed PMC
Jacobsen J.O.B., Baudis M., Baynam G.S., et al. The GA4GH phenopacket schema defines a computable representation of clinical data. Nat Biotechnol. 2022;40:817–820. PubMed PMC
Danis D., Bamshad M.J., Bridges Y., et al. A corpus of GA4GH phenopackets: case-level phenotyping for genomic diagnostics and discovery. HGG Adv. 2025;6 PubMed PMC
Reese J.T., Danis D., Caufield J.H., et al. On the limitations of large language models in clinical diagnosis. medRxiv. 2024 doi: 10.1101/2023.07.13.23292613. DOI
Grattafiori A., Dubey A., Jauhri A., et al. The llama 3 herd of models. 2024. http://arxiv.org/abs/2407.21783 preprint.
Soroush A., Glicksberg B.S., Zimlichman E., et al. Large language models are poor medical coders — benchmarking of medical code querying. NEJM AI. 2024;1 doi: 10.1056/aidbp2300040. DOI
Azamfirei R., Kudchadkar S.R., Fackler J. Large language models and the perils of their hallucinations. Crit Care. 2023;27:120. PubMed PMC
Caufield H., Kroll C., O'Neil S.T., et al. CurateGPT: a flexible language-model assisted biocuration tool. 2024. http://arxiv.org/abs/2411.00046 preprint.
Bridges Y., Souza V., Cortes K.G., et al. Towards a standard benchmark for phenotype-driven variant and gene prioritisation algorithms: PhEval - Phenotypic inference Evaluation framework. BMC Bioinformatics. 2025;26:87. PubMed PMC
Kruskal W.H., Allen Wallis W. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952 https://www.tandfonline.com/doi/abs/10.1080/01621459.1952.10483441 DOI
Kanjee Z., Crowe B., Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330:78–80. PubMed PMC
Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions. Am J Hum Genet. 2024;111:1819–1833. PubMed PMC
Menezes M.C.S., Hoffmann A.F., Tan A.L.M., et al. The potential of generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digit Health. 2025;7:e35–e43. PubMed PMC
Lewis P., Perez E., Piktus A., et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. 2020. http://arxiv.org/abs/2005.11401 preprint.
Singh A.K., Kocyigit M.Y., Poulton A., et al. Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? arXiv [cs.CL] 2024 http://arxiv.org/abs/2411.03923 preprint.
OpenAI, Hurst A., Lerer A., et al. arXiv. 2024. GPT-4o system card. [cs.CL]
Dong Y., Jiang X., Liu H., et al. Generalization or memorization: data contamination and trustworthy evaluation for large language models. arXiv [cs.CL] 2024 http://arxiv.org/abs/2402.15938 preprint.
Hager P., Jungmann F., Holland R., et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024;30:2613–2622. PubMed PMC
Peters D.H., Garg A., Bloom G., Walker D.G., Brieger W.R., Rahman M.H. Poverty and access to health care in developing countries. Ann N Y Acad Sci. 2008;1136:161–171. PubMed
Singhal K., Tu T., Gottweis J., et al. Toward expert-level medical question answering with large language models. Nat Med. 2025 doi: 10.1038/s41591-024-03423-7. PubMed DOI PMC