×

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data. (English) Zbl 07732379

Summary: Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.

MSC:

62-08 Computational methods for problems pertaining to statistics

References:

[1] Alvarez-Melis D, Saveski M (2016) Topic modeling in twitter: aggregating tweets by conversations. In: Tenth international AAAI conference on web and social media, pp 519-522
[2] Bekkerman R, Allan J (2004) Using bigrams in text categorization. Technical Report IR-408, Center of Intelligent Information Retrieval, University of Massachusetts Amherst, pp 1-10
[3] Blei, D.; Kucukelbir, A.; McAuliffe, J., Variational inference: a review for statisticians, J Am Stat Assoc, 112, 859-877 (2016) · doi:10.1080/01621459.2017.1285773
[4] Blei, D.; Ng, A.; Jordan, M., Latent Dirichlet allocation, Adv Neural Inf Process Syst, 14, 601-608 (2001) · Zbl 1112.68379
[5] Chang J, Gerrish S, Wang C, Boyd-Graber J, Blei D (2009) Reading tea leaves: how humans interpret topic models. In: Advances in neural information processing systems, pp 288-296
[6] Févotte, C.; Idier, J., Algorithms for nonnegative matrix factorization with the beta-divergence, Neural Comput, 23, 9, 2421-2456 (2011) · Zbl 1231.65072 · doi:10.1162/NECO_a_00168
[7] Hoffman M, Bach F, Blei D (2010) Online learning for latent Dirichlet allocation. In: Advances in neural information processing systems, 23
[8] Hoyle A, Goel P, Peskov D, Hian-Cheong A, Boyd-Graber JL, Resnik P (2021) Is automated topic model evaluation broken?: The incoherence of coherence. In: 35th Conference on neural information processing systems, pp 1-16
[9] Kant, G.; Weisser, C.; Säfken, B., TTLocVis: a twitter topic location visualization package, J. Open Source Softw, 5, 54, 2507 (2020) · doi:10.21105/joss.02507
[10] Kant G, Wiebelt L, Weisser C, Kis-Katos K, Luber M, Säfken B (forthcoming) An iterative topic model filtering framework for short and noisy user-generated data: analyzing conspiracy theories on twitter. Int J Data Sci Anal
[11] Korenius T, Laurikkala J, Järvelin K, Juhola M (2004) Stemming and lemmatization in the clustering of Finnish text documents. In: Proceedings of the thirteenth ACM international conference on information and knowledge management, pp 625-633
[12] Lau JH, Newman D, Baldwin T (2014) Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics, pp 530-539
[13] Liu, Jun S., The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem, J Am Stat Assoc, 89, 427, 958-966 (1994) · Zbl 0804.62033 · doi:10.1080/01621459.1994.10476829
[14] Luber M, Thielmann A, Weisser C, Säfken B (2021) Community-detection via hashtag-graphs for semi-supervised NMF topic models. arXiv:2111.10401
[15] Luber M, Weisser C, Säfken B, Silbersdorff A, Kneib T, Kis-Katos K (2021) Identifying topical shifts in twitter streams: an integration of non-negative matrix factorisation, sentiment analysis and structural break models for large scale data. In: MISDOOM 2021: disinformation in open online media. Springer International Publishing, pp 33-49
[16] Mazarura J, De Waal A (2016) A comparison of the performance of latent Dirichlet allocation and the Dirichlet multinomial mixture model on short text. In: Pattern recognition association of South Africa and robotics and mechatronics international conference (PRASA-RobMech), pp 1-6
[17] Mazarura J, De Waal A, de Villiers P (2020) A Gamma-Poisson mixture topic model for short text. Math Probl Eng 1-17
[18] Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 889-892
[19] Nigam, K.; McCallum, AK; Thrun, S.; Mitchell, T., Text classification from labeled and unlabeled documents using EM, Mach Learn, 39, 103-134 (2000) · Zbl 0949.68162 · doi:10.1023/A:1007692713085
[20] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E., Scikit-learn: machine learning in python, J Mach Learn Res, 12, 2825-2830 (2011) · Zbl 1280.68189
[21] Řehůřek R, Sojka P (2010) Software framework for topic modelling with large Corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, pp 45-50
[22] Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on Web search and data mining, pp 399-408
[23] Roesslein J (2009) Tweepy documentation. http://tweepy.readthedocs.io/en/v3, 5
[24] Rosner F, Hinneburg A, Röder M, Nettling M, Both A (2014) Evaluating topic coherence measures. arXiv:1403.6397
[25] Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 952-961
[26] Tan, CM; Wang, YF; Lee, CD, The use of bigrams to enhance text categorization, Inf Process Manage, 38, 4, 529-546 (2002) · Zbl 1052.68611 · doi:10.1016/S0306-4573(01)00045-0
[27] Wang SI, Manning CD (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the association for computational linguistics, vol 2, 90-94
[28] Yin J, Wang J (2014) A Dirichlet multinomial mixture model based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 233-242
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.