
Large scale implementations for Twitter sentiment classification. (English) Zbl 1461.62204

Summary: Sentiment Analysis on Twitter Data is indeed a challenging problem due to the nature, diversity and volume of the data. People tend to express their feelings freely, which makes Twitter an ideal source for accumulating a vast amount of opinions towards a wide spectrum of topics. This amount of information offers huge potential and can be harnessed to receive the sentiment tendency towards these topics. However, since no one can invest an infinite amount of time to read through these tweets, an automated decision making approach is necessary. Nevertheless, most existing solutions are limited in centralized environments only. Thus, they can only process at most a few thousand tweets. Such a sample is not representative in order to define the sentiment polarity towards a topic due to the massive number of tweets published daily. In this work, we develop two systems: the first in the MapReduce and the second in the Apache Spark framework for programming with Big Data. The algorithm exploits all hashtags and emoticons inside a tweet, as sentiment labels, and proceeds to a classification method of diverse sentiment types in a parallel and distributed manner. Moreover, the sentiment analysis tool is based on Machine Learning methodologies alongside Natural Language Processing techniques and utilizes Apache Spark’s Machine learning library, MLlib. In order to address the nature of Big Data, we introduce some pre-processing steps for achieving better results in Sentiment Analysis as well as Bloom filters to compact the storage size of intermediate data and boost the performance of our algorithm. Finally, the proposed system was trained and validated with real data crawled by Twitter, and, through an extensive experimental evaluation, we prove that our solution is efficient, robust and scalable while confirming the quality of our sentiment identification.


62P25 Applications of statistics to social sciences
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T10 Pattern recognition, speech recognition


[1] Sentiment; ; . · Zbl 1293.91203
[2] Wang, X.; Wei, F.; Liu, X.; Zhou, M.; Zhang, M.; Topic Sentiment Analysis in Twitter: A Graph-based Hashtag Sentiment Classification Approach; Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM): ; ,1031-1040.
[3] Emoticon; ; .
[4] Lin, J.; Dyer, C.; ; Data-Intensive Text Processing with MapReduce: San Rafael, CA, USA 2010; .
[5] van Banerveld, M.; Le-Khac, N.; Kechadi, M.T.; Performance Evaluation of a Natural Language Processing Approach Applied in White Collar Crime Investigation; Proceedings of the Future Data and Security Engineering (FDSE): ; ,29-43.
[6] Agarwal, A.; Xie, B.; Vovsha, I.; Rambow, O.; Passonneau, R.; Sentiment Analysis of Twitter Data; Workshop on Languages in Social Media: Stroudsburg, PA, USA 2011; ,30-38.
[7] Davidov, D.; Tsur, O.; Rappoport, A.; Enhanced Sentiment Learning Using Twitter Hashtags and Smileys; Proceedings of the International Conference on Computational Linguistics, Posters: ; ,241-249.
[8] Jiang, L.; Yu, M.; Zhou, M.; Liu, X.; Zhao, T.; Target-dependent Twitter Sentiment Classification; Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: ; Volume Volume 1 ,151-160.
[9] Dean, J.; Ghemawat, S.; MapReduce: Simplified Data Processing on Large Clusters; Commun. ACM: 2008; Volume 51 ,107-113.
[10] White, T.; ; Hadoop: The Definitive Guide: Sebastopol, CA, USA 2012; .
[11] Karau, H.; Konwinski, A.; Wendell, P.; Zaharia, M.; ; Learning Spark: Lightning-Fast Big Data Analysis: Sebastopol, CA, USA 2015; .
[12] Pang, B.; Lee, L.; Opinion Mining and Sentiment Analysis; Found. Trends Inf. Retr.: 2008; Volume 2 ,1-135.
[13] Hu, M.; Liu, B.; Mining and Summarizing Customer Reviews; Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: ; ,168-177.
[14] Zhuang, L.; Jing, F.; Zhu, X.Y.; Movie Review Mining and Summarization; Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM): ; ,43-50.
[15] Zhang, W.; Yu, C.; Meng, W.; Opinion Retrieval from Blogs; Proceedings of the ACM Conference on Conference on Information and Knowledge Management (CIKM): ; ,831-840.
[16] Turney, P.D.; Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews; Proceedings of the Annual Meeting of the Association for Computational Linguistics: ; ,417-424.
[17] Wilson, T.; Wiebe, J.; Hoffmann, P.; Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis; Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT/EMNLP): ; ,347-354.
[18] Wilson, T.; Wiebe, J.; Hoffmann, P.; Recognizing Contextual Polarity: An Exploration of Features for Phrase-level Sentiment Analysis; Comput. Linguist.: 2009; Volume 35 ,399-433.
[19] Yu, H.; Hatzivassiloglou, V.; Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences; Proceedings of the Conference on Empirical Methods in Natural Language Processing: ; ,129-136.
[20] Lin, C.; He, Y.; Joint Sentiment/Topic Model for Sentiment Analysis; Proceedings of the ACM Conference on Information and Knowledge Management: ; ,375-384.
[21] Mei, Q.; Ling, X.; Wondra, M.; Su, H.; Zhai, C.; Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs; Proceedings of the International Conference on World Wide Web (WWW): ; ,171-180.
[22] Pang, B.; Lee, L.; Vaithyanathan, S.; Thumbs up? Sentiment Classification using Machine Learning Techniques; Proceedings of the ACL Conference on Empirical methods in Natural Language Processing: ; ,79-86.
[23] Boiy, E.; Moens, M.; A Machine Learning Approach to Sentiment Analysis in Multilingual Web Texts; Inf. Retr.: 2009; Volume 12 ,526-558.
[24] Nasukawa, T.; Yi, J.; Sentiment Analysis: Capturing Favorability Using Natural Language Processing; Proceedings of the International Conference on Knowledge Capture: ; ,70-77.
[25] Ding, X.; Liu, B.; The Utility of Linguistic Rules in Opinion Mining; Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval: ; ,811-812.
[26] Xavier, U.H.R.; Sentiment Analysis of Hollywood Movies on Twitter; Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM): ; ,1401-1404.
[27] Yamamoto, Y.; Kumamoto, T.; Nadamoto, A.; Role of Emoticons for Multidimensional Sentiment Analysis of Twitter; Proceedings of the International Conference on Information Integration and Web-based Applications Services (iiWAS): ; ,107-115.
[28] Waghode Poonam, B.; Kinikar, M.; Twitter Sentiment Analysis with Emoticons; Int. J. Eng. Comput. Sci.: 2015; Volume 4 ,11315-11321.
[29] Chikersal, P.; Poria, S.; Cambria, E.; SeNTU: Sentiment Analysis of Tweets by Combining a Rule-based Classifier with Supervised Learning; Proceedings of the International Workshop on Semantic Evaluation (SemEval): ; ,647-651.
[30] Barbosa, L.; Feng, J.; Robust Sentiment Detection on Twitter from Biased and Noisy Data; Proceedings of the International Conference on Computational Linguistics: Posters: ; ,36-44.
[31] Naveed, N.; Gottron, T.; Kunegis, J.; Alhadi, A.C.; Bad News Travel Fast: A Content-based Analysis of Interestingness on Twitter; Proceedings of the 3rd International Web Science Conference (WebSci’11): ; ,8:1-8:7.
[32] Nakov, P.; Rosenthal, S.; Kozareva, Z.; Stoyanov, V.; Ritter, A.; Wilson, T.; SemEval-2013 Task 2: Sentiment Analysis in Twitter; Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval@NAACL-HLT): ; ,312-320.
[33] Rosenthal, S.; Ritter, A.; Nakov, P.; Stoyanov, V.; SemEval-2014 Task 9: Sentiment Analysis in Twitter; Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval@COLING): ; ,73-80.
[34] Rosenthal, S.; Nakov, P.; Kiritchenko, S.; Mohammad, S.; Ritter, A.; Stoyanov, V.; SemEval-2015 Task 10: Sentiment Analysis in Twitter; Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval@NAACL-HLT): ; ,451-463.
[35] Nakov, P.; Ritter, A.; Rosenthal, S.; Sebastiani, F.; Stoyanov, V.; SemEval-2016 Task 4: Sentiment Analysis in Twitter; Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval@NAACL-HLT): ; ,1-18.
[36] Lee, C.; Roth, D.; Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM; Proceedings of the 32nd International Conference on Machine Learning (ICML): ; ,987-996.
[37] Zhuang, Y.; Chin, W.; Juan, Y.; Lin, C.; Distributed Newton Methods for Regularized Logistic Regression; Proceedings of the 19th Pacific-Asia Conference, Advances in Knowledge Discovery and Data Mining (PAKDD): ; ,690-703.
[38] Sahni, T.; Chandak, C.; Chedeti, N.R.; Singh, M.; Efficient Twitter Sentiment Classification using Subjective Distant Supervision; arXiv: 2017; .
[39] Kanavos, A.; Perikos, I.; Vikatos, P.; Hatzilygeroudis, I.; Makris, C.; Tsakalidis, A.; Conversation Emotional Modeling in Social Networks; Proceedings of the IEEE International Conference on Tools with Artificial Intelligence (ICTAI): ; ,478-484.
[40] Kanavos, A.; Perikos, I.; Hatzilygeroudis, I.; Tsakalidis, A.; Integrating User’s Emotional Behavior for Community Detection in Social Networks; Proceedings of the International Conference on Web Information Systems and Technologies (WEBIST): ; ,355-362.
[41] Baltas, A.; Kanavos, A.; Tsakalidis, A.; An Apache Spark Implementation for Sentiment Analysis on Twitter Data; Proceedings of the International Workshop on Algorithmic Aspects of Cloud Computing (ALGOCLOUD): ; .
[42] Nodarakis, N.; Sioutas, S.; Tsakalidis, A.; Tzimas, G.; Large Scale Sentiment Analysis on Twitter with Spark; Proceedings of the EDBT/ICDT Workshops: ; . · Zbl 1461.62204
[43] Khuc, V.N.; Shivade, C.; Ramnath, R.; Ramanathan, J.; Towards Building Large-Scale Distributed Systems for Twitter Sentiment Analysis; Proceedings of the Annual ACM Symposium on Applied Computing: ; ,459-464.
[44] Apache Spark; ; . · Zbl 1360.68697
[45] MLlib; ; . · Zbl 1360.68697
[46] Nodarakis, N.; Pitoura, E.; Sioutas, S.; Tsakalidis, A.; Tsoumakos, D.; Tzimas, G.; kdANN+: A Rapid AkNN Classifier for Big Data; Trans. Large Scale Data Knowl. Cent. Syst.: 2016; Volume 23 ,139-168.
[47] Davidov, D.; Rappoport, A.; Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words; Proceedings of the International Conference on Computational Linguistics: ; ,297-304.
[48] Bloom, B.H.; Space/Time Trade-offs in Hash Coding with Allowable Errors; Commun. ACM: 1970; Volume 13 ,422-426. · Zbl 0195.47003
[49] Using Hadoop for Large Scale Analysis on Twitter: A Technical Report; ; .
[50] Toutanova, K.; Klein, D.; Manning, C.D.; Singer, Y.; Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network; Proceedings of the HLT-NAACL: ; ,252-259.
[51] Twitter Developer Documentation; ; .
[52] Go, A.; Bhayani, R.; Huang, L.; ; Twitter Sentiment Classification Using Distant Supervision: Stanford, CA, USA 2009; ,1-6.
[53] Sentiment140 API; ; .
[54] Cheng, Z.; Caverlee, J.; Lee, K.; You Are Where You Tweet: A Content-based Approach to Geo-locating Twitter Users; Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM): ; ,759-768.
[55] Twitter Cikm 2010; ; .
[56] Twitter Sentiment Analysis Training Corpus (Dataset); ; .
[57] Ternary Classification; ; .
[58] Barbieri, F.; Saggion, H.; Modelling Irony in Twitter: Feature Analysis and Evaluation; Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC): ; ,4258-4264.
[59] Bosco, C.; Patti, V.; Bolioli, A.; Developing Corpora for Sentiment Analysis: The Case of Irony and Senti-TUT; IEEE Intell. Syst.: 2013; Volume 28 ,55-63.
[60] González-Ibáñez, R.I.; Muresan, S.; Wacholder, N.; Identifying Sarcasm in Twitter: A Closer Look; Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL): ; ,581-586.
[61] Reyes, A.; Rosso, P.; Veale, T.; A Multidimensional Approach for Detecting Irony in Twitter; Lang. Resour. Eval.: 2013; Volume 47 ,239-268.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.