• This record comes from PubMed

Whole exome sequencing and machine learning germline analysis of individuals presenting with extreme phenotypes of high and low risk of developing tobacco-associated lung adenocarcinoma

. 2024 Apr ; 102 () : 105048. [epub] 20240313

Language English Country Netherlands Media print-electronic

Document type Journal Article

Links

PubMed 38484556
PubMed Central PMC10955643
DOI 10.1016/j.ebiom.2024.105048
PII: S2352-3964(24)00083-5
Knihovny.cz E-resources

BACKGROUND: Tobacco is the main risk factor for developing lung cancer. Yet, while some heavy smokers develop lung cancer at a young age, other heavy smokers never develop it, even at an advanced age, suggesting a remarkable variability in the individual susceptibility to the carcinogenic effects of tobacco. We characterized the germline profile of subjects presenting these extreme phenotypes with Whole Exome Sequencing (WES) and Machine Learning (ML). METHODS: We sequenced germline DNA from heavy smokers who either developed lung adenocarcinoma at an early age (extreme cases) or who did not develop lung cancer at an advanced age (extreme controls), selected from databases including over 6600 subjects. We selected individual coding genetic variants and variant-rich genes showing a significantly different distribution between extreme cases and controls. We validated the results from our discovery cohort, in which we analysed by WES extreme cases and controls presenting similar phenotypes. We developed ML models using both cohorts. FINDINGS: Mean age for extreme cases and controls was 50.7 and 79.1 years respectively, and mean tobacco consumption was 34.6 and 62.3 pack-years. We validated 16 individual variants and 33 variant-rich genes. The gene harbouring the most validated variants was HLA-A in extreme controls (4 variants in the discovery cohort, p = 3.46E-07; and 4 in the validation cohort, p = 1.67E-06). We trained ML models using as input the 16 individual variants in the discovery cohort and tested them on the validation cohort, obtaining an accuracy of 76.5% and an AUC-ROC of 83.6%. Functions of validated genes included candidate oncogenes, tumour-suppressors, DNA repair, HLA-mediated antigen presentation and regulation of proliferation, apoptosis, inflammation and immune response. INTERPRETATION: Individuals presenting extreme phenotypes of high and low risk of developing tobacco-associated lung adenocarcinoma show different germline profiles. Our strategy may allow the identification of high-risk subjects and the development of new therapeutic approaches. FUNDING: See a detailed list of funding bodies in the Acknowledgements section at the end of the manuscript.

Bioinformatics Platform Cima and IdisNA University of Navarra Pamplona Spain

CIMA LAB Diagnostics and IdisNA University of Navarra Pamplona Spain

Computational Biology Program Cima Data Science and Artificial Intelligence Institute CCUN IdisNA and CIBERONC University of Navarra Pamplona Spain

Department of Biotechnology Universitat Politècnica de València Unidad Mixta TRIAL and CIBERONC Valencia Spain

Department of Medical Oncology Hospital General Universitario de Valencia Unidad Mixta TRIAL Valencia Spain

Department of Medical Oncology Hospital La Luz Quirón Madrid Spain

Department of Oncology CUN CCUN and IdisNA University of Navarra Pamplona Spain

Department of Oncology CUN CCUN IdisNA and CIBERONC University of Navarra Pamplona Spain

Department of Oncology CUN Division of Immunology Cima CCUN IdisNA and CIBERONC University of Navarra Pamplona Spain

Department of Pediatrics and Clinical Genetics Clínica Universidad de Navarra University of Navarra Pamplona Spain

Department of Radiology CUN CCUN and IdisNA Pamplona Spain

Division of Immunology Cima and Immunotherapy CUN CCUN IdisNA and CIBERONC University of Navarra Pamplona Spain

Electrical and Electronic Engineering Department Tecnun DATAI University of Navarra San Sebastian Spain

Electrical and Electronic Engineering Department Tecnun University of Navarra San Sebastian Spain

Institute for Clinical Chemistry and Laboratory Medicine Mildred Scheel Early Career Center National Center for Tumor Diseases Dresden University Hospital and Faculty of Medicine Medical Clinic 1 University Hospital Carl Gustav Carus Technische Universität Dresden Dresden Germany Laboratory of Cancer Cell Biology Institute of Molecular Genetics of the Czech Academy of Sciences Prague Czech Republic

Program in Solid Tumors Cima CCUN Department of Biochemistry and Genetics School of Science IdisNA and CIBERONC University of Navarra Pamplona Spain

Program in Solid Tumors Cima Department of Pathology Anatomy and Physiology Schools of Medicine and Sciences CCUN IdisNA and CIBERONC University of Navarra Pamplona Spain

Pulmonary Critical Care and Sleep Division Mount Sinai Morningside Hospital New York USA

Pulmonary Department CUN CCUN and Centro de Investigación Biomédica en Red de Enfermedades Respiratorias University of Navarra Madrid Spain

Pulmonary Department CUN CCUN and IdisNA University of Navarra Pamplona Spain

See more in PubMed

Sung H., Ferlay J., Siegel R.L., et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209–249. http://www.ncbi.nlm.nih.gov/pubmed/33538338 [cited 2023 Apr 1]. Available from: PubMed

Hoffman F.L. Cancer and smoking habits. Ann Surg. 1931;93(1):50–67. http://www.ncbi.nlm.nih.gov/pubmed/17866497 [cited 2020 Aug 12]. Available from: PubMed PMC

Müller F.H. Tabakmißbrauch und Lungencarcinom. Z Krebsforsch. 1939;49:57–85. doi: 10.1007/BF01633114. Available from: DOI

Gray E.P., Teare M.D., Stevens J., Archer R. Risk prediction models for lung cancer: a systematic review. Clin Lung Cancer. 2016;17(2):95–106. http://www.ncbi.nlm.nih.gov/pubmed/26712102 Available from: PubMed

Liao W., Coupland C.A.C., Burchardt J., et al. Predicting the future risk of lung cancer: development, and internal and external validation of the CanPredict (lung) model in 19·67 million people and evaluation of model performance against seven other risk prediction models. Lancet Respir Med. 2023;11(8):685–697. http://www.ncbi.nlm.nih.gov/pubmed/37030308 [cited 2023 Aug 10]. Available from: PubMed

Patiño-Garcia A., Guruceaga E., Segura V., et al. Whole exome sequencing characterization of individuals presenting extreme phenotypes of high and low risk of developing tobacco-induced lung adenocarcinoma. Transl Lung Cancer Res. 2021;10:1327–1337. doi: 10.21037/tlcr-20-1197. Available from: PubMed DOI PMC

Perez-Gracia J.L., Ruiz-Ilundain M.G., Garcia-Ribas I., Carrasco E.M. The role of extreme phenotype selection studies in the identification of clinically relevant genotypes in cancer research. Cancer. 2002;95(7):1605–1610. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12237932 Available from: PubMed

Pérez-Gracia J.L., Gúrpide A., Ruiz-Ilundain M.G., et al. Selection of extreme phenotypes: the role of clinical observation in translational research. Clin Transl Oncol. 2010;12(3):174–180. http://www.ncbi.nlm.nih.gov/pubmed/20231122 Available from: PubMed PMC

Perez-Gracia J.L., Sanmamed M.F., Bosch A., et al. Strategies to design clinical studies to identify predictive biomarkers in cancer research. Cancer Treat Rev. 2017;53:79–97. http://www.ncbi.nlm.nih.gov/pubmed/28088073 Available from: PubMed

NCI dictionary of cancer terms: pack-year. https://www.cancer.gov/publications/dictionaries/cancer-terms/def/pack-year [cited 2023 Apr 21]. Available from:

Bolger A.M., Lohse M., Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–2120. http://www.ncbi.nlm.nih.gov/pubmed/24695404 [cited 2020 Jul 5]. Available from: PubMed PMC

Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–1760. http://www.ncbi.nlm.nih.gov/pubmed/19451168 [cited 2020 Jul 5]. Available from: PubMed PMC

Van der Auwera G.A., Carneiro M.O., Hartl C., et al. From fastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinforma. 2013;43(1110):11.10.1–11.10.33. http://www.ncbi.nlm.nih.gov/pubmed/25431634 [cited 2020 Jul 5]. Available from: PubMed PMC

Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164. http://www.ncbi.nlm.nih.gov/pubmed/20601685 [cited 2020 Jul 5]. Available from: PubMed PMC

R: the R project for statistical computing. 2020. https://www.r-project.org/ [cited 2020 Jul 22]. Available from:

Guedj M., Wojcik J., Della-Chiesa E., Nuel G., Forner K. A fast, unbiased and exact allelic test for case-control association studies. Hum Hered. 2006;61(4):210–221. https://pubmed.ncbi.nlm.nih.gov/16877868/ [cited 2020 Jul 5]. Available from: PubMed

Gu Z., Eils R., Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32(18):2847–2849. http://www.ncbi.nlm.nih.gov/pubmed/27207943 [cited 2020 Jul 22]. Available from: PubMed

Zhu B., Mirabello L., Chatterjee N. A subregion-based burden test for simultaneous identification of susceptibility loci and subregions within. Genet Epidemiol. 2018;42(7):673–683. doi: 10.1002/gepi.22134. Available from: PubMed DOI PMC

Gillespie M., Jassal B., Stephan R., et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50(D1):D687–D692. http://www.ncbi.nlm.nih.gov/pubmed/34788843 [cited 2023 Jan 4]. Available from: PubMed PMC

Kleinbaum D.G., Klein M. Springer New York; New York, NY: 2010. Logistic regression.http://link.springer.com/10.1007/978-1-4419-1742-3 (Statistics for Biology and Health) [cited 2023 Apr 1]. Available from: DOI

Géron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: concepts, tools, and techniques to build intelligent systems. https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ 819 p. [cited 2023 Apr 1]. Available from:

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. http://link.springer.com/10.1023/A:1010933404324 [cited 2023 Apr 1]. Available from: DOI

Friedman J.H. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29(5):1189–1232. https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boosting-machine/10.1214/aos/1013203451.full [cited 2023 Apr 26]. Available from: DOI

Pedregosa F., Varoquaux G., Gramfort A., et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–2830. https://dl.acm.org/doi/10.5555/1953048.2078195 [cited 2023 Apr 1]; Available from: DOI

Muschelli J. ROC and AUC with a binary predictor: a potentially misleading metric. J Classif. 2020;37(3):696–708. http://www.ncbi.nlm.nih.gov/pubmed/33250548 [cited 2023 Apr 1]. Available from: PubMed PMC

GeneRIF: gene reference into function. https://www.ncbi.nlm.nih.gov/gene/about-generif [cited 2023 Apr 14]. Available from:

GeneCards: the human gene database. https://www.genecards.org/ [cited 2023 Apr 14]. Available from:

An online catalog of human genes and genetic disorders. https://www.omim.org/ [cited 2023 Apr 14]. Available from:

Hall J.M., Lee M.K., Newman B., et al. Linkage of early-onset familial breast cancer to chromosome 17q21. Science. 1990;250(4988):1684–1689. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=2270482 Available from: PubMed

Miller R.W. Deaths from childhood cancer in sibs. N Engl J Med. 1968;279(3):122–126. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=5655197 Available from: PubMed

Li F.P., Fraumeni J.F. Rhabdomyosarcoma in children: epidemiologic study and identification of a familial cancer syndrome. J Natl Cancer Inst. 1969;43(6):1365–1373. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=5396222 Available from: PubMed

Liu R., Paxton W.A., Choe S., et al. Homozygous defect in HIV-1 coreceptor accounts for resistance of some multiply-exposed individuals to HIV-1 infection. Cell. 1996;86(3):367–377. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=8756719 Available from: PubMed

Quillent C., Oberlin E., Braun J., et al. HIV-1-resistance phenotype conferred by combination of two separate inherited mutations of CCR5 gene. Lancet. 1998;351(9095):14–18. http://linkinghub.elsevier.com/retrieve/pii/S014067369709185X Available from: PubMed

Cohen J., Pertsemlidis A., Kotowski I.K., Graham R., Garcia C.K., Hobbs H.H. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nat Genet. 2005;37(2):161–165. PubMed

Emond M.J., Louie T., Emerson J., et al. Exome sequencing of phenotypic extremes identifies CAV2 and TMC6 as interacting modifiers of chronic Pseudomonas aeruginosa infection in cystic fibrosis. PLoS Genet. 2015;11(6) http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4457883&tool=pmcentrez&rendertype=abstract Available from: PubMed PMC

Find record

Citation metrics

Loading data ...

Archiving options

Loading data ...