A comprehensive social media data processing and analytics architecture by using big data platforms: a case study of twitter flood-risk messages

. 2021 ; 14 (2) : 913-929. [epub] 20210311

Status PubMed-not-MEDLINE Jazyk angličtina Země Německo Médium print-electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid33727982

The main objective of the article is to propose an advanced architecture and workflow based on Apache Hadoop and Apache Spark big data platforms. The primary purpose of the presented architecture is collecting, storing, processing, and analysing intensive data from social media streams. This paper presents how the proposed architecture and data workflow can be applied to analyse Tweets with a specific flood topic. The secondary objective, trying to describe the flood alert situation by using only Tweet messages and exploring the informative potential of such data is demonstrated as well. The predictive machine learning approach based on Bayes Theorem was utilized to classify flood and no flood messages. For this study, approximately 100,000 Twitter messages were processed and analysed. Messages were related to the flooding domain and collected over a period of 5 days (14 May - 18 May 2018). Spark application was developed to run data processing commands automatically and to generate the appropriate output data. Results confirmed the advantages of many well-known features of Spark and Hadoop in social media data processing. It was noted that such technologies are prepared to deal with social media data streams, but there are still challenges that one has to take into account. Based on the flood tweet analysis, it was observed that Twitter messages with some considerations are informative enough to be used to estimate general flood alert situations in particular regions. Text analysis techniques proved that Twitter messages contain valuable flood-spatial information.

Zobrazit více v PubMed

Al-Daihani SM, Abrahams A. A text mining analysis of academic libraries’ tweets. J Acad Libr. 2016;42:135–143. doi: 10.1016/j.acalib.2015.12.014. DOI

Alom Z, Carminati B, Ferrari E. A deep learning model for twitter spam detection. Online Soc Netw Media. 2020;18:100079. doi: 10.1016/j.osnem.2020.100079. DOI

Arthur R, Boulton CA, Shotton H, Williams HTP. Social sensing of floods in the UK. PLoS One. 2018;13:1–18. doi: 10.1371/journal.pone.0189327. PubMed DOI PMC

Baesens B, Gestel TV, Viaene S, Stepanova M, Suykens J, Vanthienen J. Benchmarking state-of-the-art classification algorithms for credit scoring. J Oper Res Soc. 2003;54:627–635. doi: 10.1057/palgrave.jors.2601545. DOI

Bermejo P, Gamez JA, Puerta JM. Improving the performance of Naïve Bayes multinomial in email foldering by introducing distribution-based balance of datasets. Expert Syst Appl. 2011;38:2072–2080. doi: 10.1016/j.eswa.2010.07.146. DOI

Chianese A, Piccialli F. International workshop on Data Mining of Iot Systems (DaMIS): a service oriented framework for analysing social network activities. Procedia Comput Sci. 2016;98:509–514. doi: 10.1016/j.procs.2016.09.087. DOI

Chu Z, Gianvecchio S, Wang H, Jajodia S. Detecting automation of twitter accounts: are you a human, bot, or cyborg? IEEE T Depend Secure. 2012;9:811–824. doi: 10.1109/TDSC.2012.75. DOI

Crannell WC, Clark E, Jones C, James TA, Moore J. A pattern-matched twitter analysis of US cancer-patient sentiments. J Surg Res. 2016;206:536–542. doi: 10.1016/j.jss.2016.06.050. PubMed DOI

Eilander D, Trambauer P, Wagemaker J, Loenen AV. Harvesting social media for generation of near real-time flood maps. Procedia Eng. 2016;154:176–183. doi: 10.1016/j.proeng.2016.07.441. DOI

Flood Warning Vs. Watch (2020) https://www.weather.gov/safety/flood-watch-warning. Accessed 5 November 2020

Floodlist (2018) USA – Deadly Storms Hit North East, Flash Floods in Maryland. http://floodlist.com/america/usa/usa-storms-north-east-flash-floods-maryland-may-2018.

Flume 1.9.0 User Guide (2020) https://flume.apache.org/FlumeUserGuide.html. Accessed 5 November 2020

Fohringer J, Dransch D, Kreibich H, Schroter K. Social media as an information source for rapid flood inundation mapping. Nat Hazards Earth Syst Sci. 2015;15:2725–2738. doi: 10.5194/nhess-15-2725-2015. DOI

Harzevili NS, Alizadeh SH. Mixture of latent multinomial naive Bayes classifier. Appl Soft Comput. 2018;69:516–527. doi: 10.1016/j.asoc.2018.04.020. DOI

Hill D, Kerkez B, Rasekh A, Ostfeld A, Minsker B, Banks MK (2014) Sensing and cyberinfrastructure for smarter water management: the promise and challenge of ubiquity. J Water Res Pl 140. 10.1061/(ASCE)WR.1943-5452.0000449, 01814002

Huang Q, Xiao Y. Geographic situational awareness: mining tweets for disaster preparedness, emergency response, impact, and recovery. ISPRS Int Geo-Inf. 2015;4:1549–1568. doi: 10.3390/ijgi4031549. DOI

Jiang L, Wang S, Li C, Zhang L. Structure extended multinomial naive Bayes. Inform Sciences. 2016;329:346–356. doi: 10.1016/j.ins.2015.09.037. DOI

Jongman B, Wagemaker J, Romero BR, Perez ECD. Early flood detection for rapid humanitarian response: harnessing near real-time satellite and twitter signals. ISPRS Int J Geo-Information. 2015;4:2246–2266. doi: 10.3390/ijgi4042246. DOI

Kim J, Hastak M. Social network analysis. Int J Inform Manage. 2018;38:86–96. doi: 10.1016/j.ijinfomgt.2017.08.003. DOI

Landwehr PM, Wei W, Kowalchuck M, Carley KM. Using tweets to support disaster planning, warning and response. Safety Sci. 2016;90:33–47. doi: 10.1016/j.ssci.2016.04.012. DOI

Lansley G, Longley PA. The geography of twitter topics in London. Comput Environ Urban Syst. 2016;58:85–96. doi: 10.1016/j.compenvurbsys.2016.04.002. DOI

Lu HC, Hwang FJ, Huang YH. Parallel and distributed architecture of genetic algorithm on apache Hadoop and spark. Appl Soft Comput. 2020;95:106497. doi: 10.1016/j.asoc.2020.106497. DOI

Martin A, Julian ABA, Cos-Gayon F. Analysis of twitter messages using big data tools to evaluate and locate the activity in the city of Valencia (Spain) Cities. 2019;86:37–50. doi: 10.1016/j.cities.2018.12.014. DOI

Martinez-Rojas M, Pardo-Ferreira MDC, Rubio-Romero JC. Twitter as a tool for the management and analysis of emergency situations: a systematic literature review. Int J Inform Manage. 2018;43:196–208. doi: 10.1016/j.ijinfomgt.2018.07.008. DOI

Melo TD, Figueiredo CMS. A first public dataset from Brazilian twitter and news on COVID-19 in Portuguese. Data Brief. 2020;32:106179. doi: 10.1016/j.dib.2020.106179. PubMed DOI PMC

Muralidharan S, Rasmussen L, Patterson D, Shin JH. Hope for Haiti: an analysis of Facebook and twitter usage during the earthquake relief efforts. Public Relat Rev. 2011;37:175–177. doi: 10.1016/j.pubrev.2011.01.010. DOI

National Climate Report - May 2018 (2018) https://www.ncdc.noaa.gov/sotc/national/201805.

Osman AMS. A novel big data analytics framework for smart cities. Future Gener Comp Sy. 2019;91:620–633. doi: 10.1016/j.future.2018.06.046. DOI

Ozdikis O, Oguztuzun H, Karagoz P. A survey on location estimation techniques for events detected in twitter. Knowl Inf Syst. 2017;52:291–339. doi: 10.1007/s10115-016-1007-z. DOI

Ozturk N, Ayvaz S. Sentiment analysis on twitter: a text mining approach to the Syrian refugee crisis. Telemat Inform. 2018;35:136–147. doi: 10.1016/j.tele.2017.10.006. DOI

Pradeep D, Sundar C. QAOC: novel query analysis and ontology-based clustering for data management in Hadoop. Future Gener Comp Sy. 2020;108:849–860. doi: 10.1016/j.future.2020.03.010. DOI

Rossi C, Acerbo FS, Ylinen K, Juga I, Nurmi P, Bosca A, Tarasconi F, Cristoforetti M, Alikadic A. Early detection and information extraction for weather-induced foods using social media streams. Int J Disast Risk Re. 2018;30:145–157. doi: 10.1016/j.ijdrr.2018.03.002. DOI

Schneider S, Check P. Read all about it: the role of the media in improving construction safety and health. J Saf Res. 2010;41:283–287. doi: 10.1016/j.jsr.2010.05.001. PubMed DOI

Shafiee ME, Barker Z, Rasekh A. Enhancing water system models by integrating big data. Sustain Cities Soc. 2018;37:485–491. doi: 10.1016/j.scs.2017.11.042. DOI

Simon T, Goldberg A, Adini B. Socializing in emergencies – a review of the use of social media in emergency situations. Int J Inf Manag. 2015;35:609–619. doi: 10.1016/j.ijinfomgt.2015.07.001. DOI

Son J, Lee J, Oh O, Lee HK, Woo J. Using a heuristic-systematic model to assess the twitter user profile’s impact on disaster tweet credibility. Int J Inform Manage. 2020;54:102176. doi: 10.1016/j.ijinfomgt.2020.102176. DOI

Storm Prediction Center (2018) https://www.spc.noaa.gov/exper/archive/event.php?date=20180514.

Tallada P, Carretero J, Casals J, Acosta-Silva C, Serrano S, Caubet M, Castander FJ, Cesar E, Crocce M, Delfino M, Eriksen M, Fosalba P, Gaztanaga E, Merino G, Neissner C, Tonello N. CosmoHub: interactive exploration and distribution of astronomical data on Hadoop. Astron Comput. 2020;32:100391. doi: 10.1016/j.ascom.2020.100391. DOI

Twitter Developer (2020) https://developer.twitter.com/en/docs/tutorials.

Twitter User Data (2020) An In-Depth Look at the Most Active Twitter User Data. https://sysomos.com/inside-twitter/most-active-twitter-user-data.

Vera-Burgos CM, Padgett DRG. Using twitter for crisis communications in a natural disaster: hurricane Harvey. Heliyon. 2020;6:e04804. doi: 10.1016/j.heliyon.2020.e04804. PubMed DOI PMC

Wang RQ, Mao H, Wang Y, Rae C, Shaw W. Hyper-resolution monitoring of urban flooding with social media and crowdsourcing data. Comput Geosci. 2018;111:139–147. doi: 10.1016/j.cageo.2017.11.008. DOI

Wang Y, Hao H, Platt LS. Examining risk and crisis communications of government agencies and stakeholders during early-stages of COVID-19 on twitter. Comput Hum Behav. 2021;114:106568. doi: 10.1016/j.chb.2020.106568. PubMed DOI PMC

World Cities Database (2020) https://simplemaps.com/data/world-cities.

Yaqub U, Chun SA, Atluri V, Vaidya J. Analysis of political discourse on twitter in the context of the 2016 US presidential elections. Gov Inform Q. 2017;34:613–626. doi: 10.1016/j.giq.2017.11.001. DOI

Yoo E, Rand W, Eftekhar M, Rabinovich E. Evaluating information diffusion speed and its determinants in social media networks during humanitarian crises. J Oper Manag. 2016;45:123–133. doi: 10.1016/j.jom.2016.05.007. DOI

Zhang YC, Sakhanenko L. The naive Bayes classifier for functional data. Stat Probab Lett. 2019;152:137–146. doi: 10.1016/j.spl.2019.04.017. DOI

Zvara Z, Szabo PGN, Balazs B, Benczur A. Optimizing distributed data stream processing by tracing. Future Gener Comp Sy. 2019;90:578–591. doi: 10.1016/j.future.2018.06.047. DOI

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...