Comparison of various approaches to tagging for the inflectional Slovak language
Status PubMed-not-MEDLINE Jazyk angličtina Země Spojené státy americké Médium electronic-ecollection
Typ dokumentu časopisecké články
PubMed
38855261
PubMed Central
PMC11157559
DOI
10.7717/peerj-cs.2026
PII: cs-2026
Knihovny.cz E-zdroje
- Klíčová slova
- Automatic taggers, Low-resource language, Morhological annotation, Part-of-speech tagging, Slovak language,
- Publikační typ
- časopisecké články MeSH
Morphological tagging provides essential insights into grammar, structure, and the mutual relationships of words within the sentence. Tagging text in a highly inflectional language presents a challenging task due to word ambiguity. This research aims to compare six different automatic taggers for the inflectional Slovak language, seeking for the most accurate tagger for literary and non-literary texts. Our results indicate that it is useful to differentiate texts into literary and non-literary and subsequently, based on the text style to deploy a tagger. For literary texts, UDPipe2 outperformed others in seven out of nine examined tagset positions. Conversely, for non-literary texts, the RNNTagger exhibited the highest performance in eight out of nine examined tagset positions. The RNNTagger is recommended for both types of the text, the best captures the inflection of the Slovak language, but UDPipe2 demonstrates a higher accuracy for literary texts. Despite dataset size limitations, this study emphasizes the suitability of various taggers for the inflectional languages like Slovak.
Department of Computer Science Constantine the Philosopher University in Nitra Nitra Slovakia
Science and Research Centre University of Pardubice Pardubice Czech Republic
Zobrazit více v PubMed
Afanasev I. The use of Khislavichi Lect morphological tagging to determine its position in the East Slavic Group. Stroudsburg. Tenth workshop on NLP for similar languages, varieties and dialects (VarDial 2023); 2023. pp. 174–186. DOI
Alosaimy A, Atwell E. Web-based annotation tool for inflectional language resources. Paris. Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018).2018.
Bejček E, Straňák P. Annotation of multiword expressions in the Prague dependency treebank. Language Resources and Evaluation. 2010;44:7–21. doi: 10.1007/s10579-009-9093-0. DOI
Benko Ľ, Benková L. Comparison of novel approach to part-of-speech tagging of slovak language. Štúrovo. DIVAI 2022—The 14th international scientific conference on Distance Learning in Applied Informatics.2022. pp. 327–333.
Benkova L, Munkova D, Benko Ľ, Munk M. Evaluation of english–slovak neural and statistical machine translation. Applied Sciences. 2021;11:2948. doi: 10.3390/app11072948. DOI
Blunsom P. B. Phil thesis. 2004. Hidden Markov Models.
Branco A, Eskevich M, Frontini F, Hajič J, Hinrichs E, De Jong F, Kamocki P, König A, Lindén K, Navarretta C, Piasecki M, Piperidis S, Pitkänen O, Simov K, Skadiņa I, Trippel T, Witt A, Zinn C. Language resources and evaluation. Springer, Netherlands; 2023. The CLARIN infrastructure as an interoperable language technology platform for SSH and beyond. DOI
Brants T. TnT - a statistical part-of-speech tagger. Morristown. Proceedings of the sixth conference on applied natural language processing.2000. pp. 224–231.
Fehle J, Schmidt T, Wolff C. Lexicon-based Sentiment Analysis in German: systematic evaluation of resources and preprocessing techniques. Düsseldorf. Proceedings of the 17th conference on natural language processing (KONVENS 2021).2021. pp. 86–103.
Fink GA. Markov models for pattern recognition. Springer Berlin Heidelberg; Berlin: 2008. DOI
Gajdošová K, Šimková M. Slovak Dependency Treebank, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. 2016. http://hdl.handle.net/11234/1-1822 http://hdl.handle.net/11234/1-1822
Garábik R, Bobeková K. Lematizácia, morfologická anotácia a dezambiguácia slovenského textu –webové rozhranie. Slovenská Reč. 2021;86:104–109.
Garabík R, Šimková M. Slovak Morphosyntactic Tagset. Journal of Language Modelling. 2012;0(1):41–63. doi: 10.15398/jlm.v0i1.35. DOI
Hajič J. Insight into the Slovak and Czech Corpus Linguistics. Charles University, Prague, Czech Republic; 2006. Complex corpus annotation: the prague dependency treebank; pp. 54–73.
Hajič J, Bejček E, Hlaváčová J, Mikulová M, Straka M, Štěpánek J, Štěpánková B. Prague Dependency Treebank - Consolidated 1.0. Marseille. Proceedings of the 12th conference on language resources and evaluation (LREC 2020).2020. pp. 5208–5218.
Hajič J, Hric J. MorfFlex SK 170914. 2017. http://hdl.handle.net/11234/1-3277. [20 October 2023]. http://hdl.handle.net/11234/1-3277
Hammarstedt M, Schumacher A, Borin L, Forsberg M. Göteborg: Gothenburg University; 2022. Sparv 5 User Manual.
Hladek D, Stas J, Juhar J. Morphological analysis of the slovak language. Advances in Electrical and Electronic Engineering. 2015;13(4):289–294. doi: 10.15598/aeee.v13i4.1491. DOI
Hládek D, Staš J, Juhár J. Dagger: the slovak morphological classifier. Piscataway. Proceedings ELMAR-2012.2012. pp. 195–198.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. PubMed DOI
Horák A, Gianitsová L, Šimková M, Šmotlák M, Garabík R. Text, speech and dialogue, TSD 2004. Springer; Berlin: 2004. Slovak National Corpus; pp. 89–93. DOI
Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. 20151508.01991
Izzi GL, Ferilli S. EVALITA evaluation of NLP and speech tools for Italian. Torino, Italy: Accademia University Press; 2020. UniBA @ KIPoS: a hybrid approach for part-of-speech tagging; pp. 501–506. DOI
Jurafsky D, Martin J. Speech and language processing. Upper Saddle River: Pearson; 2020.
Kanerva J, Ginter F, Miekka N, Leino A, Salakoski T. Turku neural parser pipeline: an end-to-end system for the CoNLL 2018 shared task. Brussels. Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies; 2018. pp. 133–142. DOI
Kapusta J, Ľ Benko, Munkova D, Munk M. Analysis of edit operations for post-editing systems. International Journal of Computational Intelligence Systems. 2021;14:197. doi: 10.1007/s44196-021-00048-3. DOI
Karyukin V, Rakhimova D, Karibayeva A, Turganbayeva A, Turarbek A. The neural machine translation models for the low-resource Kazakh–English language pair. PeerJ Computer Science. 2023;9:e1224. doi: 10.7717/peerj-cs.1224. PubMed DOI PMC
Kirov C, Cotterell R, Sylak-Glassman J, Walther G, Vylomova E, Xia P, Faruqui M, Mielke SJ, McCarty A, Kübler S, Yarowsky D, Eisner J, Hulden M. UniMorph 2.0: universal morphology. Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018); Paris. 2018.
Ljubešić N, Dobrovoljc K. What does neural bring? Analysing improvements in morphosyntactic annotation and lemmatisation of slovenian, croatian and serbian. Stroudsburg. Proceedings of the 7th workshop on balto-slavic natural language processing; 2019. pp. 29–34. DOI
Machura J, Geržová H, Masopustová M, Valíčková M. Comparing majka and MorphoDiTa for automatic grammar checking. Brno. Proceedings of recent advances in slavonic natural language processing, RASLAN 2019.2019. pp. 3–14.
Majchráková D, Dušek O, Hajič J, Karčová A, Garábik R. Computational linguistics in Bulgaria. The Institute for Bulgarian Language Prof. Lyubomir Andreychin–Bulgarian Academy of Sciences; Sofia: 2014. Semi-automatic detection of Multiword Expressions in the Slovak Dependency Treebank; pp. 32–38.
Mikulová M, Hajič J, Hana J, Hanová H, Hlaváčová J, Jeřábek E, Štěpánková B, Vidová Hladká B, Zeman D. Manual for morphological annotation revision for Prague dependency treebank - consolidated 2020 release. Prague: Charles University, Prague, Czech Republic; 2020.
Munkova D, Munk M, Benko Ľ, Hajek P. The role of automated evaluation techniques in online professional translator training. PeerJ Computer Science. 2021a;7:e706. doi: 10.7717/peerj-cs.706. PubMed DOI PMC
Munkova D, Munk M, Benko Ľ, Stastny J. MT evaluation in the context of language complexity. Complexity. 2021b;2021:1–15. doi: 10.1155/2021/2806108. DOI
Petkevič V, Hlaváčová J, Osolsobě K, Svášek M, Šimandl J. Parts of Speech in NovaMorf, A New Morphological Annotation of Czech. Journal of Linguistics/Jazykovedný casopis. 2019;70:358–369. doi: 10.2478/jazcas-2019-0065. DOI
Petrov S, Das D, McDonald R. A universal part-of-speech tagset. Paris. Proceedings of the eighth international conference on language resources and evaluation (LREC’12).2012. pp. 2089–2096.
Piao S, Tsuruoka Y, Ananiadou S. Sentiment analysis with knowledge resource and NLP tools. The International Journal of Interdisciplinary Social Sciences: Annual Review. 2009;4:17–28. doi: 10.18848/1833-1882/CGP/v04i05/52902. DOI
Proisl T, Dykes N, Heinrich P, Kabashi B, Blombach A, Evert S. EmpiriST Corpus 2.0: adding manual normalization, lemmatizaion and semantic tagging to a German Web and CMC Corpus. Paris. Proceedings of the 12th conference on language resources and evaluation (LREC 2020).2020. pp. 6142–6148.
Qi P, Zhang Y, Zhang Y, Bolton J, Manning CD. Stanza: a python natural language processing toolkit for many human languages. Stroudsburg. Proceedings of the 58th annual meeting of the Association for Computational Linguistics: system demonstrations; 2020. pp. 101–108. DOI
Rabiner LR, Juang BH. An introduction to hidden markov models. IEEE ASSP Magazine. 1986;3:4–16. doi: 10.1109/MASSP.1986.1165342. DOI
Richter M. Diploma Thesis. 2010. Pokročilý korektor češtiny.
Rosen A, Hana J, Štindlová B, Feldman A. Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation. 2014;48:65–92. doi: 10.1007/s10579-013-9226-3. DOI
Schmid H. Improvements in Part-of-Speech Tagging with an Application to German. In: Armstrong S, Church K, Isabelle P, Manzi S, Tzoukermann E, Yarowsky D, editors. Natural language processing using very large corpora. text, speech and language processing. Kluwer Academic Publishers; Dordrecht: 1999. pp. 13–26.
Schmid H. Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts. New York. Proceedings of the 3rd international conference on digital access to textual cultural heritage; 2019. pp. 133–137. DOI
Schmid H, Laws F. Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. Morristown. Proceedings of the 22nd international conference on computational linguistics - COLING ’08; 2008. pp. 777–784. DOI
Šimková M, Gajdošová K. Slovenský závislostný korpus. Prague, Czech Republic: Grammar & Corpora; 2008. pp. 135–141.
Spoustová D, Hajič J, Raab J, Spousta M. Semi-supervised training for the averaged perceptron POS tagger. Stroudsburg. Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics on - EACL ’09; 2009. pp. 763–771. DOI
Straka M. UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task. Stroudsburg. Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies; 2018. pp. 197–207. DOI
Straka M. Prague: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic; 2020.
Straka M, Straková J. Prague: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic; 2014.
Straka M, Straková J. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. Stroudsburg. Proceedings of the CoNLL 2017 shared task: multilingual parsing from raw text to universal dependencies; 2017. pp. 88–99. DOI
Toleu A, Tolegen G, Mussabayev R. Language-independent approach for morphological disambiguation. Gyeongju. Proceedings of the 29th international conference on computational linguistics.2022. pp. 5288–5297.
Tsuruoka Y, Tateishi Y, Kim J-D, Ohta T, McNaught J, Ananiadou S, Tsujii J. Developing a robust part-of-speech tagger for biomedical text. Berlin. PCI 2005: advances in Informatics; 2005. pp. 382–392. DOI
Universal Dependencies contributors Universal POS tags. https://universaldependencies.org/treebanks/sk_snk/index.html 2022
Yao Y, Huang Z. Bi-directional LSTM recurrent neural network for chinese word segmentation. Cham. ICONIP 2016: neural information processing; 2016. pp. 345–353. DOI
Zeman D, Nivre J, Abrams M, Ackermann E, Agić Ž, Aepli N, Aghaei H, Ahrenberg L. http://hdl.handle.net/11234/1-5150. Universal Dependencies 2.12. [22 November 2023];2023