Ranking pre-trained speech embeddings in Parkinson's disease detection: Does Wav2Vec 2.0 outperform its 1.0 version across speech modes and languages?
Status PubMed-not-MEDLINE Language English Country Netherlands Media electronic-ecollection
Document type Journal Article
PubMed
40586101
PubMed Central
PMC12206144
DOI
10.1016/j.csbj.2025.06.022
PII: S2001-0370(25)00238-7
Knihovny.cz E-resources
- Keywords
- Classification, Parkinson's disease, Speech modes, Wav2vec 1.0, Wav2vec 2.0,
- Publication type
- Journal Article MeSH
Speech and language technologies are effective tools for identifying the distinct speech changes associated with Parkinson's disease (PD), enabling earlier and more accurate diagnosis. Models leveraging recent advancements in self-supervised speech pretraining, such as Wav2Vec, have demonstrated superior performance over traditional feature extraction methods. While Wav2Vec 2.0 has been successfully utilized for PD detection, a rigorous quantitative comparison with Wav2Vec 1.0 is needed to comprehensively evaluate its advantages, limitations, and applicability across different speech modes in PD. This study presents a systematic comparison of Wav2Vec 1.0 and Wav2Vec 2.0 embeddings across three multilingual datasets using various classification approaches to classify normal (healthy controls; HC) and PD-affected speech. Additionally, both Wav2Vec 1.0 and 2.0 were benchmarked against traditional baseline features across diverse linguistic contexts, including spontaneous speech, non-spontaneous speech, and isolated vowels. A multicriteria TOPSIS approach was employed to rank feature extraction methods, revealing that Wav2Vec 2.0 excelled across speech modes, with its first transformer layer demonstrating the best performance for classifying read text and monologue, and its feature extractor performing best in vowel-based classification. In contrast, Wav2Vec 1.0, while generally outperformed by Wav2Vec 2.0, still provided a more efficient alternative with competitive performance. Finally, we combined selected layers from both architectures and have demonstrated improved diagnostic accuracy in vowel-based classification. This comparative analysis underscores the strengths of both Wav2Vec architectures and informs their optimal use in PD detection.
See more in PubMed
Alowais S.A., Alghamdi S.S., Alsuhebany N., Alqahtani T., Alshaya A.I., Almohareb S.N., Aldairem A., Alrashed M., Bin Saleh K., Badreldin H.A., Al Yami M.S., Al Harbi S., Albekairy A.M. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. Bmc Med Educ. 2023;23 doi: 10.1186/s12909-023-04698-z. PubMed DOI PMC
Rusz J., Krack P., Tripoliti E. From prodromal stages to clinical trials: the promise of digital speech biomarkers in parkinson's disease. Neurosci Biobehav Rev. 2024;167 doi: 10.1016/j.neubiorev.2024.105922. PubMed DOI
Favaro A., Butala A., Thebaud T., Villalba J., Dehak N., Moro-Velázquez L. Unveiling early signs of Parkinson’s disease via a longitudinal analysis of celebrity speech recordings. Npj Park'S Dis. 2024;10 doi: 10.1038/s41531-024-00817-9. PubMed DOI PMC
Nasersharif B., Namvarpour M. Exploring the potential of Wav2vec 2.0 for speech emotion recognition using classifier combination and attention-based feature fusion. J Supercomput. 2024;80:23667–23688. doi: 10.1007/s11227-024-06158-x. DOI
Schneider S., Baevski A., Collobert R., Auli M. Wav2vec Unsupervised PreTrain Speech Recognit. 2019 doi: 10.48550/arXiv.1904.05862. (arXiv) DOI
Baevski A., Zhou H., Mohamed A., Auli M. wav2vec 2 0 A Framew SelfSupervised Learn Speech Represent. 2020 doi: 10.48550/arXiv.2006.11477. (arXiv) DOI
Javanmardi F., Kadiri S.R., Alku P. Pre-trained models for detection and severity level classification of dysarthria from speech. Speech Commun. 2024;158 doi: 10.1016/j.specom.2024.103047. DOI
Cai J., Song Y., Wu J., Chen X. Voice disorder classification using Wav2vec 2.0 feature extraction. J Voice. 2024 doi: 10.1016/j.jvoice.2024.09.002. PubMed DOI
Favaro A., Tsai Y.-T., Butala A., Thebaud T., Villalba J., Dehak N., Moro-Velázquez L. Interpretable speech features vs. Dnn embeddings: what to use in the automatic assessment of Parkinson’s disease in multi-lingual scenarios. Comput Biol Med. 2023;166 doi: 10.1016/j.compbiomed.2023.107559. PubMed DOI
La Quatra M., Turco M.F., Svendsen T., Salvi G., Orozco-Arroyave J.R., Siniscalchi S.M. in: Interspeech 2024. ISCA, ISCA; 2024. Exploiting foundation models and speech enhancement for parkinson's disease detection from speech in Real-World operative conditions; pp. 1405–1409. DOI
W. Xu, Z. Dong, J. Peng, R. Wang, Z. Zhang, BAHBench: A Unified Benchmark for Evaluating Bio-Acoustic Health with Acoustic Foundation Models, Ieee Journal Of Biomedical And Health Informatics (Early Access). 1-13. 10.1109/JBHI.2025.3543968. PubMed DOI
Kunešová M., Zajíc Z., Šmídl L., Karafiát M. Comparison of wav2vec 2.0 models on three speech processing tasks. Int J Speech Technol. 2024;27:847–859. doi: 10.1007/s10772-024-10140-6. DOI
Shah J., Singla Y.K., Chen Ch, Shah R.R. What all do Audio Transform Models Hear? Probing Acoust Represent Lang Deliv Struct. 2021 doi: 10.48550/arXiv.2101.00387. (arXiv) DOI
Purohit T., Ruvolo B., Orozco-Arroyave J.R., Magimai.-Doss M. in: Icassp 2025 - 2025 Ieee International Conference On Acoustics, Speech And Signal Processing (Icassp) IEEE; 2025. Automatic Parkinson’s disease detection from speech: layer selection vs adaptation of foundation models; pp. 1–5. DOI
Q. Dao, L. Jeancolas, G. Mangone, S. Sambin, A. Chalançon, M. Gomes, S. Lehéricy, J.-C. Corvol, M. Vidailhet, I. Arnulf, D.P. Delacrétaz, M.A. El-Yacoubi, Detection of Early Parkinson's Disease by Leveraging Speech Foundation Models, Ieee Journal Of Biomedical And Health Informatics (Early Access). 1-10. 10.1109/JBHI.2025.3548917. PubMed DOI
Javanmardi F., Kadiri S.R., Alku P. Exploring the impact of Fine-Tuning the Wav2vec2 model in Database-Independent detection of dysarthric speech. Ieee J Biomed Health Inform. 2024;28:4951–4962. doi: 10.1109/JBHI.2024.3392829. PubMed DOI
Sheikh S.A. Selfsupervised Learn Pathol Speech Detect. 2024 doi: 10.48550/arXiv.2406.02572. (arXiv) DOI
Sheikh S.A., Kodrasi I. Impact Speech Mode Autom Pathol Speech Detect. 2024 doi: 10.48550/arXiv.2406.09968. (arXiv) DOI
Yokoi K., Iribe Y., Kitaoka N., Tsuboi T., Hiraga K., Satake Y., Hattori M., Tanaka Y., Sato M., Hori A., Katsuno M. Analysis of spontaneous speech in parkinson's disease by natural language processing. Park Relat Disord. 2023;113 doi: 10.1016/j.parkreldis.2023.105411. PubMed DOI
Tröger J., Dörr F., Schwed L., Linz N., König A., Thies T., Barbe M.T., Orozco-Arroyave J.R., Rusz J. An automatic measure for speech intelligibility in dysarthrias—validation across multiple languages and neurological disorders. Front Digit Health. 2024;6 doi: 10.3389/fdgth.2024.1440986. PubMed DOI PMC
Smolik T., Krupicka R., Klempir O. Vol. 2024. IEEE; 2024. Assessing speech intelligibility and severity level in parkinson's disease using Wav2Vec 2.0; pp. 231–234. (47Th International Conference On Telecommunications And Signal Processing (Tsp)). DOI
Klempíř O., Příhoda D., Krupička R. Evaluating the performance of wav2vec embedding for parkinson's disease detection. Meas Sci Rev. 2023;23:260–267. doi: 10.2478/msr-2023-0033. DOI
Klempíř O., Krupička R. Analyzing Wav2Vec 1.0 embeddings for Cross-Database Parkinson’s disease detection and speech features extraction. Sensors. 2024;24 doi: 10.3390/s24175520. PubMed DOI PMC
Jaeger H., Trivedi D., Stadtschnitzer M. Mobile device voice recordings at king's college London (MDVR-KCL) from both early and advanced parkinson's disease patients and healthy controls [Data set] Zenodo. 2019 doi: 10.5281/zenodo.2867216. DOI
J.R. Orozco-Arroyave, J.D. Arias-Londoño, J.F. Vargas-Bonilla, M.C. González-Rátiva, E. Nöth, New Spanish speech corpus database for the analysis of people suffering from Parkinson’s disease. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 342–347, Reykjavik, Iceland. European Language Resources Association (ELRA). 〈https://aclanthology.org/L14-1549/〉.
Iyer A., Kemp A., Rahmatallah Y., Pillai L., Glover A., Prior F., Larson-Prior L., Virmani T. A machine learning method to process voice samples for identification of Parkinson’s disease. Sci Rep. 2023;13 doi: 10.1038/s41598-023-47568-w. PubMed DOI PMC
Klempíř O. R. Krupička, Machine learning using speech utterances for parkinson disease detection. Clin Technol. 2018;48:66–71.
PyTorch Audio Resampling, Pytorch Documentation Pages. (2024). 〈https://pytorch.org/audio/main/tutorials/audio_resampling_tutorial.html#resampling-overview〉 (accessed June 5, 2025).
Tong H., Yang Z., Wang S., Hu Y., Semiari O., Saad W., Yin C. Federated learning for audio semantic communication. Front Commun Netw. 2021;2 doi: 10.3389/frcmn.2021.734402. DOI
Wav2Vec 1.0 Large, Fairseq. (2019). 〈https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_large.pt〉 (accessed June 5, 2025).
Wav2Vec 2.0 XLSR-53, Hugging Face. (2021). 〈https://huggingface.co/facebook/wav2vec2-large-xlsr-53〉 (accessed June 5, 2025).
Xu Q., Baevski A., Auli M. Simple Eff Zeroshot CrossLing Phoneme Recognit. 2021 doi: 10.48550/arXiv.2109.11680. (arXiv) DOI
Vetráb M., Gosztolya G. Speech And Computer. Springer Nature Switzerland; Cham: 2023. Aggregation strategies of Wav2vec 2.0 embeddings for computational paralinguistic tasks; pp. 79–93. DOI
Magateshvaren Saras M.A., Mitra M.K., Tyagi S. Navigating the multiverse: a Hitchhiker’s guide to selecting harmonization methods for multimodal biomedical data. Biol Methods Protoc. 2025;10 doi: 10.1093/biomethods/bpaf028. PubMed DOI PMC
Python Scikit-learn, Supervised Learning. (2024). 〈https://scikit-learn.org/stable/supervised_learning.html〉 (accessed June 5, 2025).
de Moura Rezende dos Santos F., Guedes de Oliveira Almeida F., Pereira Rocha Martins A.C., Bittencourt Reis A.C., Holanda M. Vol. 2018. IEEE; 2018. Ranking machine learning classifiers using multicriteria approach; pp. 168–174. (11Th International Conference On The Quality Of Information And Communications Technology (Quatic)). DOI
Rosina J., Rogalewicz V., Ivlev I., Juřičková I., Donin G., Jantosova N., Vacek J., Otawová R., Kneppo P. Health technology assessment for medical devices. Clin Technol. 2014;44:23–36.
Di Cesare M.G., Perpetuini D., Cardone D., Merla A. Machine Learning-Assisted speech analysis for early detection of Parkinson’s disease: a study on speaker diarization and classification techniques. Sensors. 2024;24 doi: 10.3390/s24051499. PubMed DOI PMC
Reszka J., Janbakhshi P., Purohit T., Mohammadi S. Invest Eff DiffusBased Cond Gener Speech Models Use Speech Enhanc Dysarthric Speech. 2024 doi: 10.48550/arXiv.2412.13933. (arXiv) DOI
D. Escobar-Grisales, C.D. Ríos-Urrego, I. Baumann, K. Riedhammer, E. Noeth, T. Bocklet, A.M. Garcia, J.R. Orozco-Arroyave, It’s Time to Take Action: Acoustic Modeling of Motor Verbs to Detect Parkinson’s Disease, in: Interspeech 2024, ISCA, ISCA, 2024: pp. 1965-1969. 10.21437/Interspeech.2024-2205. DOI
Karan B., Sekhar Sahu S. An improved framework for Parkinson’s disease prediction using variational mode Decomposition-Hilbert spectrum of speech signal. Biocybern Biomed Eng. 2021;41:717–732. doi: 10.1016/j.bbe.2021.04.014. DOI
Hireš M., Drotár P., Pah N.D., Ngo Q.C., Kumar D.K. On the inter-dataset generalization of machine learning approaches to parkinson's disease detection from voice. Int J Med Inform. 2023;179 doi: 10.1016/j.ijmedinf.2023.105237. PubMed DOI
da Silva D.H., da L.R., Souza S., Ribeiro C.T., da S.H., Brasileiro S., Nardo J.R.M., Pereira A.A., de A., Andrade O., Web A. Application for exploratory data analysis and classification of Parkinson’s disease patients using machine learning models on different datasets. Softw Impacts. 2025;23 doi: 10.1016/j.simpa.2024.100737. DOI