Comparison of various approaches to tagging for the inflectional Slovak language

. 2024 ; 10 () : e2026. [epub] 20240524

Status PubMed-not-MEDLINE Jazyk angličtina Země Spojené státy americké Médium electronic-ecollection

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid38855261

Morphological tagging provides essential insights into grammar, structure, and the mutual relationships of words within the sentence. Tagging text in a highly inflectional language presents a challenging task due to word ambiguity. This research aims to compare six different automatic taggers for the inflectional Slovak language, seeking for the most accurate tagger for literary and non-literary texts. Our results indicate that it is useful to differentiate texts into literary and non-literary and subsequently, based on the text style to deploy a tagger. For literary texts, UDPipe2 outperformed others in seven out of nine examined tagset positions. Conversely, for non-literary texts, the RNNTagger exhibited the highest performance in eight out of nine examined tagset positions. The RNNTagger is recommended for both types of the text, the best captures the inflection of the Slovak language, but UDPipe2 demonstrates a higher accuracy for literary texts. Despite dataset size limitations, this study emphasizes the suitability of various taggers for the inflectional languages like Slovak.

Zobrazit více v PubMed

Afanasev I. The use of Khislavichi Lect morphological tagging to determine its position in the East Slavic Group. Stroudsburg. Tenth workshop on NLP for similar languages, varieties and dialects (VarDial 2023); 2023. pp. 174–186. DOI

Alosaimy A, Atwell E. Web-based annotation tool for inflectional language resources. Paris. Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018).2018.

Bejček E, Straňák P. Annotation of multiword expressions in the Prague dependency treebank. Language Resources and Evaluation. 2010;44:7–21. doi: 10.1007/s10579-009-9093-0. DOI

Benko Ľ, Benková L. Comparison of novel approach to part-of-speech tagging of slovak language. Štúrovo. DIVAI 2022—The 14th international scientific conference on Distance Learning in Applied Informatics.2022. pp. 327–333.

Benkova L, Munkova D, Benko Ľ, Munk M. Evaluation of english–slovak neural and statistical machine translation. Applied Sciences. 2021;11:2948. doi: 10.3390/app11072948. DOI

Blunsom P. B. Phil thesis. 2004. Hidden Markov Models.

Branco A, Eskevich M, Frontini F, Hajič J, Hinrichs E, De Jong F, Kamocki P, König A, Lindén K, Navarretta C, Piasecki M, Piperidis S, Pitkänen O, Simov K, Skadiņa I, Trippel T, Witt A, Zinn C. Language resources and evaluation. Springer, Netherlands; 2023. The CLARIN infrastructure as an interoperable language technology platform for SSH and beyond. DOI

Brants T. TnT - a statistical part-of-speech tagger. Morristown. Proceedings of the sixth conference on applied natural language processing.2000. pp. 224–231.

Fehle J, Schmidt T, Wolff C. Lexicon-based Sentiment Analysis in German: systematic evaluation of resources and preprocessing techniques. Düsseldorf. Proceedings of the 17th conference on natural language processing (KONVENS 2021).2021. pp. 86–103.

Fink GA. Markov models for pattern recognition. Springer Berlin Heidelberg; Berlin: 2008. DOI

Gajdošová K, Šimková M. Slovak Dependency Treebank, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. 2016. http://hdl.handle.net/11234/1-1822 http://hdl.handle.net/11234/1-1822

Garábik R, Bobeková K. Lematizácia, morfologická anotácia a dezambiguácia slovenského textu –webové rozhranie. Slovenská Reč. 2021;86:104–109.

Garabík R, Šimková M. Slovak Morphosyntactic Tagset. Journal of Language Modelling. 2012;0(1):41–63. doi: 10.15398/jlm.v0i1.35. DOI

Hajič J. Insight into the Slovak and Czech Corpus Linguistics. Charles University, Prague, Czech Republic; 2006. Complex corpus annotation: the prague dependency treebank; pp. 54–73.

Hajič J, Bejček E, Hlaváčová J, Mikulová M, Straka M, Štěpánek J, Štěpánková B. Prague Dependency Treebank - Consolidated 1.0. Marseille. Proceedings of the 12th conference on language resources and evaluation (LREC 2020).2020. pp. 5208–5218.

Hajič J, Hric J. MorfFlex SK 170914. 2017. http://hdl.handle.net/11234/1-3277. [20 October 2023]. http://hdl.handle.net/11234/1-3277

Hammarstedt M, Schumacher A, Borin L, Forsberg M. Göteborg: Gothenburg University; 2022. Sparv 5 User Manual.

Hladek D, Stas J, Juhar J. Morphological analysis of the slovak language. Advances in Electrical and Electronic Engineering. 2015;13(4):289–294. doi: 10.15598/aeee.v13i4.1491. DOI

Hládek D, Staš J, Juhár J. Dagger: the slovak morphological classifier. Piscataway. Proceedings ELMAR-2012.2012. pp. 195–198.

Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. PubMed DOI

Horák A, Gianitsová L, Šimková M, Šmotlák M, Garabík R. Text, speech and dialogue, TSD 2004. Springer; Berlin: 2004. Slovak National Corpus; pp. 89–93. DOI

Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. 20151508.01991

Izzi GL, Ferilli S. EVALITA evaluation of NLP and speech tools for Italian. Torino, Italy: Accademia University Press; 2020. UniBA @ KIPoS: a hybrid approach for part-of-speech tagging; pp. 501–506. DOI

Jurafsky D, Martin J. Speech and language processing. Upper Saddle River: Pearson; 2020.

Kanerva J, Ginter F, Miekka N, Leino A, Salakoski T. Turku neural parser pipeline: an end-to-end system for the CoNLL 2018 shared task. Brussels. Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies; 2018. pp. 133–142. DOI

Kapusta J, Ľ Benko, Munkova D, Munk M. Analysis of edit operations for post-editing systems. International Journal of Computational Intelligence Systems. 2021;14:197. doi: 10.1007/s44196-021-00048-3. DOI

Karyukin V, Rakhimova D, Karibayeva A, Turganbayeva A, Turarbek A. The neural machine translation models for the low-resource Kazakh–English language pair. PeerJ Computer Science. 2023;9:e1224. doi: 10.7717/peerj-cs.1224. PubMed DOI PMC

Kirov C, Cotterell R, Sylak-Glassman J, Walther G, Vylomova E, Xia P, Faruqui M, Mielke SJ, McCarty A, Kübler S, Yarowsky D, Eisner J, Hulden M. UniMorph 2.0: universal morphology. Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018); Paris. 2018.

Ljubešić N, Dobrovoljc K. What does neural bring? Analysing improvements in morphosyntactic annotation and lemmatisation of slovenian, croatian and serbian. Stroudsburg. Proceedings of the 7th workshop on balto-slavic natural language processing; 2019. pp. 29–34. DOI

Machura J, Geržová H, Masopustová M, Valíčková M. Comparing majka and MorphoDiTa for automatic grammar checking. Brno. Proceedings of recent advances in slavonic natural language processing, RASLAN 2019.2019. pp. 3–14.

Majchráková D, Dušek O, Hajič J, Karčová A, Garábik R. Computational linguistics in Bulgaria. The Institute for Bulgarian Language Prof. Lyubomir Andreychin–Bulgarian Academy of Sciences; Sofia: 2014. Semi-automatic detection of Multiword Expressions in the Slovak Dependency Treebank; pp. 32–38.

Mikulová M, Hajič J, Hana J, Hanová H, Hlaváčová J, Jeřábek E, Štěpánková B, Vidová Hladká B, Zeman D. Manual for morphological annotation revision for Prague dependency treebank - consolidated 2020 release. Prague: Charles University, Prague, Czech Republic; 2020.

Munkova D, Munk M, Benko Ľ, Hajek P. The role of automated evaluation techniques in online professional translator training. PeerJ Computer Science. 2021a;7:e706. doi: 10.7717/peerj-cs.706. PubMed DOI PMC

Munkova D, Munk M, Benko Ľ, Stastny J. MT evaluation in the context of language complexity. Complexity. 2021b;2021:1–15. doi: 10.1155/2021/2806108. DOI

Petkevič V, Hlaváčová J, Osolsobě K, Svášek M, Šimandl J. Parts of Speech in NovaMorf, A New Morphological Annotation of Czech. Journal of Linguistics/Jazykovedný casopis. 2019;70:358–369. doi: 10.2478/jazcas-2019-0065. DOI

Petrov S, Das D, McDonald R. A universal part-of-speech tagset. Paris. Proceedings of the eighth international conference on language resources and evaluation (LREC’12).2012. pp. 2089–2096.

Piao S, Tsuruoka Y, Ananiadou S. Sentiment analysis with knowledge resource and NLP tools. The International Journal of Interdisciplinary Social Sciences: Annual Review. 2009;4:17–28. doi: 10.18848/1833-1882/CGP/v04i05/52902. DOI

Proisl T, Dykes N, Heinrich P, Kabashi B, Blombach A, Evert S. EmpiriST Corpus 2.0: adding manual normalization, lemmatizaion and semantic tagging to a German Web and CMC Corpus. Paris. Proceedings of the 12th conference on language resources and evaluation (LREC 2020).2020. pp. 6142–6148.

Qi P, Zhang Y, Zhang Y, Bolton J, Manning CD. Stanza: a python natural language processing toolkit for many human languages. Stroudsburg. Proceedings of the 58th annual meeting of the Association for Computational Linguistics: system demonstrations; 2020. pp. 101–108. DOI

Rabiner LR, Juang BH. An introduction to hidden markov models. IEEE ASSP Magazine. 1986;3:4–16. doi: 10.1109/MASSP.1986.1165342. DOI

Richter M. Diploma Thesis. 2010. Pokročilý korektor češtiny.

Rosen A, Hana J, Štindlová B, Feldman A. Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation. 2014;48:65–92. doi: 10.1007/s10579-013-9226-3. DOI

Schmid H. Improvements in Part-of-Speech Tagging with an Application to German. In: Armstrong S, Church K, Isabelle P, Manzi S, Tzoukermann E, Yarowsky D, editors. Natural language processing using very large corpora. text, speech and language processing. Kluwer Academic Publishers; Dordrecht: 1999. pp. 13–26.

Schmid H. Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts. New York. Proceedings of the 3rd international conference on digital access to textual cultural heritage; 2019. pp. 133–137. DOI

Schmid H, Laws F. Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. Morristown. Proceedings of the 22nd international conference on computational linguistics - COLING ’08; 2008. pp. 777–784. DOI

Šimková M, Gajdošová K. Slovenský závislostný korpus. Prague, Czech Republic: Grammar & Corpora; 2008. pp. 135–141.

Spoustová D, Hajič J, Raab J, Spousta M. Semi-supervised training for the averaged perceptron POS tagger. Stroudsburg. Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics on - EACL ’09; 2009. pp. 763–771. DOI

Straka M. UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task. Stroudsburg. Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies; 2018. pp. 197–207. DOI

Straka M. Prague: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic; 2020.

Straka M, Straková J. Prague: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic; 2014.

Straka M, Straková J. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. Stroudsburg. Proceedings of the CoNLL 2017 shared task: multilingual parsing from raw text to universal dependencies; 2017. pp. 88–99. DOI

Toleu A, Tolegen G, Mussabayev R. Language-independent approach for morphological disambiguation. Gyeongju. Proceedings of the 29th international conference on computational linguistics.2022. pp. 5288–5297.

Tsuruoka Y, Tateishi Y, Kim J-D, Ohta T, McNaught J, Ananiadou S, Tsujii J. Developing a robust part-of-speech tagger for biomedical text. Berlin. PCI 2005: advances in Informatics; 2005. pp. 382–392. DOI

Universal Dependencies contributors Universal POS tags. https://universaldependencies.org/treebanks/sk_snk/index.html 2022

Yao Y, Huang Z. Bi-directional LSTM recurrent neural network for chinese word segmentation. Cham. ICONIP 2016: neural information processing; 2016. pp. 345–353. DOI

Zeman D, Nivre J, Abrams M, Ackermann E, Agić Ž, Aepli N, Aghaei H, Ahrenberg L. http://hdl.handle.net/11234/1-5150. Universal Dependencies 2.12. [22 November 2023];2023

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...