BACKGROUND: Interpretable machine learning (ML) for early detection of cancer has the potential to improve risk assessment and early intervention. METHODS: Data from 261 proteins related to inflammation and/or tumor processes in 123 blood samples collected from healthy persons, but of whom a sub-group later developed squamous cell carcinoma of the oral tongue (SCCOT), were analyzed. Samples from people who developed SCCOT within less than 5 years were classified as tumor-to-be and all other samples as tumor-free. The optimal ML algorithm for feature selection was identified and feature importance computed by the SHapley Additive exPlanations (SHAP) method. Five popular ML algorithms (AdaBoost, Artificial neural networks [ANNs], Decision Tree [DT], eXtreme Gradient Boosting [XGBoost], and Support Vector Machine [SVM]) were applied to establish prediction models, and decisions of the optimal models were interpreted by SHAP. RESULTS: Using the 22 selected features, the SVM prediction model showed the best performance (sensitivity = 0.867, specificity = 0.859, balanced accuracy = 0.863, area under the receiver operating characteristic curve [ROC-AUC] = 0.924). SHAP analysis revealed that the 22 features rendered varying person-specific impacts on model decision and the top three contributors to prediction were Interleukin 10 (IL10), TNF Receptor Associated Factor 2 (TRAF2), and Kallikrein Related Peptidase 12 (KLK12). CONCLUSION: Using multidimensional plasma protein analysis and interpretable ML, we outline a systematic approach for early detection of SCCOT before the appearance of clinical signs.
- MeSH
- jazyk MeSH
- krevní proteiny MeSH
- lidé MeSH
- nádory jazyka * diagnóza MeSH
- spinocelulární karcinom * diagnóza MeSH
- strojové učení MeSH
- ubikvitinligasy MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
BACKGROUND: Patients with squamous cell carcinoma of the head and neck (SCCHN) have a high-risk of recurrence. We aimed to develop machine learning methods to identify transcriptomic and proteomic features that provide accurate classification models for predicting risk of early recurrence in SCCHN patients. METHODS: Clinical, genomic, transcriptomic and proteomic features distinguishing recurrence risk were examined in SCCHN patients from The Cancer Genome Atlas (TCGA). Recurrence within one year after treatment was classified as high-risk and no recurrence as low-risk. RESULTS: No significant differences in individual clinicopathological characteristics, mutation profiles or mRNA expression patterns were seen between the groups using conventional statistical analysis. Using the machine learning algorithm, extreme gradient boosting (XGBoost), ten proteins (RAD50, 4E-BP1, MYH11, MAP2K1, BECN1, NF2, RAB25, ERRFI1, KDR, SERPINE1) and five mRNAs (PLAUR, DKK1, AXIN2, ANG and VEGFA) made the greatest contribution to classification. These features were used to build improved models in XGBoost, achieving the best discrimination performance when combining transcriptomic and proteomic data, providing an accuracy of 0.939 and an Area Under the ROC Curve (AUC) of 0.951. CONCLUSIONS: This study highlights machine learning to identify transcriptomic and proteomic factors that play important roles in predicting risk of recurrence in patients with SCCHN and to develop such models by iterative cycles to enhance their accuracy, thereby aiding the introduction of personalized treatment regimens.
- MeSH
- dlaždicobuněčné karcinomy hlavy a krku genetika MeSH
- lidé MeSH
- messenger RNA genetika MeSH
- nádory hlavy a krku * genetika MeSH
- proteomika MeSH
- rab proteiny vázající GTP genetika MeSH
- spinocelulární karcinom * genetika MeSH
- transkriptom genetika MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
As early detection is crucial for improvement of cancer prognosis, we searched for biomarkers in plasma from individuals who later developed squamous cell carcinoma of the oral tongue (SCCOT) as well as in patients with an already established SCCOT. Levels of 261 proteins related to inflammation and/or tumor processes were measured using the proximity extension assay (PEA) in 179 plasma samples (42 collected before diagnosis of SCCOT with 81 matched controls; 28 collected at diagnosis of SCCOT with 28 matched controls). Statistical modeling tools principal component analysis (PCA) and orthogonal partial least square - discriminant analysis (OPLS-DA) were applied to provide insights into separations between groups. PCA models failed to achieve group separation of SCCOT patients from controls based on protein levels in samples taken prior to diagnosis or at the time of diagnosis. For pre-diagnostic samples and their controls, no significant OPLS-DA model was identified. Potentials for separating pre-diagnostic samples collected up to five years before diagnosis (n = 15) from matched controls (n = 28) were seen in four proteins. For diagnostic samples and controls, the OPLS-DA model indicated that 21 proteins were important for group separation. TNF receptor associated factor 2 (TRAF2), decreased in pre-diagnostic plasma (< 5 years) but increased at diagnosis, was the only protein showing altered levels before and at diagnosis of SCCOT (p-value < 0.05). Taken together, changes in plasma protein profiles at diagnosis were evident, but not reliably detectable in pre-diagnostic samples taken before clinical signs of tumor development. Variation in protein levels during cancer development poses a challenge for the identification of biomarkers that could predict SCCOT development.
- Publikační typ
- časopisecké články MeSH