Text normalization for named entity recognition in Vietnamese tweets
Status PubMed-not-MEDLINE Jazyk angličtina Země Německo Médium print-electronic
Typ dokumentu časopisecké články
PubMed
29355207
PubMed Central
PMC5749168
DOI
10.1186/s40649-016-0032-0
PII: 32
Knihovny.cz E-zdroje
- Klíčová slova
- Named entity recognition, Spelling error detection and correction, Text normalization,
- Publikační typ
- časopisecké články MeSH
BACKGROUND: Named entity recognition (NER) is a task of detecting named entities in documents and categorizing them to predefined classes, such as person, location, and organization. This paper focuses on tweets posted on Twitter. Since tweets are noisy, irregular, brief, and include acronyms and spelling errors, NER in those tweets is a challenging task. Many approaches have been proposed to deal with this problem in tweets written in English, Germany, Chinese, etc., but none for Vietnamese tweets. METHODS: We propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets. The normalization step detects spelling errors in a tweet and corrects them using an improved Dice's coefficient or n-grams. A Support Vector Machine learning algorithm is employed to learn a classifier using six different types of features. RESULTS AND CONCLUSION: We train our method on a training set consisting of more than 40,000 named entities and evaluate it on a testing set consisting of 3,186 named entities. The experimental results showed that our system achieves state-of-the-art performance with F1 score of 82.13%.
Zobrazit více v PubMed
Baldwin T, de Marneffe MC, Han B, et al. Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. ACL-IJCNLP. 2015;2015:126–135.
Bandyopadhyay A, Roy D, Mitra M, Saha S. Named entity recognition from tweets. In: Proceedings of the 16th LWA workshops: KDML, IR and FGWM, Aachen, Germany; 2014. p. 218–25.
Cherry C, Guo H, Dai C. Nrc: Infused phrase vectors for named entity recognition in twitter. ACL-IJCNLP. 2015;2015:54–60.
Choi D, Kim J, et al. A method for normalizing non-standard words in online social network services: A case study on twitter. Second International Conference Context-Aware Systems and Applications, ICCASA. 2014;2013:359–68.
Chu MN, Nghieu VD, Phien HT. Basis of linguistics and Vietnamese. Vietnam: Vietnam educational publisher; 2010.
Cotelo JM, et al. A modular approach for lexical normalization applied to spanish tweets. Expert Syst Appl. 2015;42(10):4743–4754. doi: 10.1016/j.eswa.2015.02.003. DOI
Crammer K, Singer Y. Ultraconservative online algorithms for multiclass problems. J Mach Learn Res. 2003;3:951–991.
Curran JR, Clark S. Language independent NER using a maximum entropy tagger. In: Proceedings of the seventh conference on natural language learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada; 2003. p. 164–7.
Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26(3):297–302. doi: 10.2307/1932409. DOI
Downey D, Broadhead M, Etzioni O. Locating complex named entities in web text. In: IJCAI 2007, Proceedings of the 20th international joint conference on artificial intelligence, Hyderabad, India; 2007. p. 2733–9.
Fersini E, Messina E, Felici G, Roth D. Soft-constrained inference for named entity recognition. Inform Process Manag. 2014;50(5):807–819. doi: 10.1016/j.ipm.2014.04.005. DOI
Florian R. Named entity recognition as a house of cards: classifier stacking. In: Proceedings of the 6th conference on natural language learning, CoNLL 2002, Held in cooperation with COLING 2002, Taipei; 2002
Godin F, Vandersmissen B, Neve WD, de Walle RV. Multimedia lab @ acl w-nut ner shared task: named entity recognition for twitter microposts using distributed word representations. ACL-IJCNLP. 2015;2015:146–153.
Han B, Baldwin T. Lexical normalisation of short text messages: Makn sens a# twitter. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1; 2011. p. 368–78.
Han B, et al. Lexical normalization for social media text. ACM Trans Intell Syst Technol. 2013;4(1):621–633. doi: 10.1145/2414425.2414430. DOI
Hassan H, Menezes A. Social text normalization using contextual graph random walks. In: Proceedings of the 51st annual meeting of the association for computational linguistics. Berlin: Association for Computational Linguistics; 2013. p. 1577–86.
Jung JJ. Online named entity recognition method for microtexts in social networking services: a case study of twitter. Expert Syst Appl. 2012;39(9):8066–8070. doi: 10.1016/j.eswa.2012.01.136. DOI
Konkol M, Brychcin T, Konopík M. Latent semantics in named entity recognition. Expert Syst Appl. 2015;42(7):3470–3479. doi: 10.1016/j.eswa.2014.12.015. DOI
Le H, Tran M, Bui N, Phan N, Ha Q. An integrated approach using conditional random fields for named entity recognition and person property extraction in Vietnamese text. In: International conference on Asian language processing, IALP 2011, Penang; 2011. p. 115–8.
Le HP, Huyên NTM, Roussanaly A, Vinh HT. A hybrid approach to word segmentation of Vietnamese texts. In: Second international conference on language and automata theory and applications, LATA 2008, Tarragona, Revised Papers; 2008. p. 240–9.
Le HT, Sam RC, Nguyen HC, Nguyen TT. Named entity recognition in Vietnamese text using label propagation. In: 2013 international conference on soft computing and pattern recognition, SoCPaR 2013, Hanoi; 2013. p. 366–70.
Le HT, Tran LV. Automatic feature selection for named entity recognition using genetic algorithm. In: 4th international symposium on information and communication technology, SoICT ’13, Danang; 2013. p. 81–7.
Le HT, Tran LV, Nguyen XH, Nguyen TH. Optimizing genetic algorithm in feature selection for named entity recognition. In: Proceedings of the sixth international symposium on information and communication technology, Hue City; 2015. p. 5
Le-Hong P, Roussanaly A, et al. An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In: Traitement Automatique des Langues Naturelles-TALN 2010; 2010.
Li C, Liu Y. Improving text normalization via unsupervised model and discriminative reranking. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, ACL 2014. Baltimore: Student Research Workshop; 2014. p. 86–93.
Li C, Liu, Y. Improving named entity recognition in tweets via detecting non-standard words. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing, ACL 2015, Beijing, vol 1: Long Papers; 2015. p. 929–38.
Li C, Sun A, Weng J, He Q. Tweet segmentation and its application to named entity recognition. IEEE Trans Knowl Data Eng. 2015;27(2):558–570. doi: 10.1109/TKDE.2014.2327042. DOI
Liao W, Veeramachaneni S. A simple semi-supervised algorithm for named entity recognition. In: Proceedings of the NAACL HLT workshop on semisupervised learning for natural language processing; 2009. p. 28–36.
Liu F, Weng F, Jiang X. A broad-coverage normalization system for social media language. In: Proceedings of the conference on the 50th annual meeting of the association for computational linguistics 2012, Jeju Island, vol 1. Long Papers; 2012. p. 1035–44.
Liu X, Wei F, Zhang S, Zhou M. Named entity recognition for tweets. ACM TIST. 2013;4(1):3.
Liu X, Zhang S, Wei F, Zhou M. Recognizing named entities in tweets. In: Proceedings of the conference on the 49th annual meeting of the association for computational linguistics: human language technologies, Portland; 2011. pp. 359–67.
Liu X, Zhou M. Two-stage NER for tweets with clustering. Inform Process Manag. 2013;49(1):264–273. doi: 10.1016/j.ipm.2012.05.006. DOI
Liu X, Zhou M, Zhou X, Fu Z, Wei F. Joint inference of named entity recognition and normalization for tweets. In: Proceedings of the conference on The 50th annual meeting of the association for computational linguistics, Jeju Island, Vol 1: Long Papers; 2012. p. 526–35.
Mayfield J, McNamee P, Piatko CD. Named entity recognition using hundreds of thousands of features. In: Proceedings of the seventh conference on natural language learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton; 2003. p. 184–7.
McCallum A, Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on natural language learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton; 2003. p. 188–91.
Nguyen DB, Hoang SH, Pham SB, Nguyen TP. Named entity recognition for Vietnamese. In: Second international conference on intelligent information and database systems, ACIIDS, Hue City. Proceedings, Part II; 2010. p. 205–14.
Nguyen DB, Pham SB. Ripple down rules for Vietnamese named entity recognition. In: Technologies and applications—4th International conference on computational collective intelligence, ICCCI 2012, Ho Chi Minh City, Proceedings, Part I; 2012. p. 354–63.
Nguyen TT, Cao TH. VN-KIM IE: automatic extraction of Vietnamese named-entities on the web. New Gener Comput. 2007;25(3):277–292. doi: 10.1007/s00354-007-0018-4. DOI
Nguyen TT, Cao TH. Linguistically motivated and ontological features for Vietnamese named entity recognition. In: 2012 IEEE RIVF international conference on computing & communication technologies, research, innovation, and vision for the future (RIVF), Ho Chi Minh City; 2012. p. 1–6.
Nguyen TT, Moschitti A. Structural reranking models for named entity recognition. Intell Artif. 2012;6(2):177–190.
Pham QH, Nguyen ML, Nguyen BT, Cuong NV. Semi-supervised learning for Vietnamese named entity recognition using online conditional random fields. In: Proceedings of NEWS 2015 the fifth named entities workshop; 2015. p. 53–8.
Phe H. syllable Dictionary. Dictionary center. Hanoi: Encyclopedia Publishers; 2011.
Ramage D, Hall DLW, Nallapati R, Manning CD. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing; 2009. p. 248–56.
Ritter A, Clark S, Mausam Etzioni O. Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 conference on empirical methods in natural language processing, EMNLP 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL; 2011. p. 1524–34.
Saloot MA, et al. An architecture for malay tweet normalization. Inform Process Manag. 2014;50(5):621–633. doi: 10.1016/j.ipm.2014.04.009. DOI
Sam RC, Le HT, Nguyen TT, Nguyen TH. Combining proper name-coreference with conditional random fields for semi-supervised named entity recognition in Vietnamese text. In: Advances in Knowledge Discovery and Data Mining—15th Pacific-Asia Conference, PAKDD 2011, Shenzhen, China, May 24–27, 2011, Proceedings, Part I; 2011. p. 512–24.
Sproat R, et al. Normalization of non-standard words. Comput Speech Lang. 2001;15(3):287–333. doi: 10.1006/csla.2001.0169. DOI
Thao PTX, Tri TQ, Dien D, Collier N. Named entity recognition in Vietnamese using classifier voting. ACM Trans Asian Lang Inform Process. 2007;6(4):3.
Tran QT, et al. Named entity recognition in Vietnamese documents. Progress Inform. 2007;5:14.
Tran VC, Hwang D, Jung JJ. Semi-supervised approach based on co-occurrence coefficient for named entity recognition on twitter. In: 2015 2nd national foundation for science and technology development conference on information and computer science (NICS). New York: IEEE; 2015. p. 141–6.
Trung HL, Anh VL, Trung KL. Bootstrapping and rule-based model for recognizing Vietnamese named entity. In: 6th Asian conference on intelligent information and database systems, ACIIDS 2014, Bangkok, Proceedings, Part II; 2014. p. 167–76.
Tu NC, et al. Named entity recognition in Vietnamese free-text and web documents using conditional random fields. In: The 8th conference on some selection problems of information technology and telecommunication; 2005.
Yamada I, Takeda H, Takefuji Y. Enhancing named entity recognition in twitter messages using entity linking. ACL-IJCNLP. 2015;2015:136–140.
Zhou G, Su J. Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th annual meeting of the association for computational linguistics, Philadelphia; 2002. p. 473–80.
Zirikly A, Diab M. Named entity recognition for arabic social media. Proc NAACL-HLT. 2015;2015:176–185.