×

Sparse partially collapsed MCMC for parallel inference in topic models. (English) Zbl 07498961

Summary: Topic models, and more specifically the class of latent Dirichlet allocation (LDA), are widely used for probabilistic modeling of text. Markov chain Monte Carlo (MCMC) sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler. Supplementary materials for this article are available online.

MSC:

62-XX Statistics

Software:

CODA; MALLET

References:

[1] Ahmed, A.; Aly, M.; Gonzalez, J.; Narayanamurthy, S.; Smola, A., Scalable Inference in Latent Variable Models, International Conference on Web Search and Data Mining, 123-132 (2012)
[2] Ahn, S.; Shahbaba, B.; Welling, M., Distributed Stochastic Gradient MCMC, International Conference on Machine Learning, 1044-1052 (2014)
[3] Araujo, M. D.; Navarro, G.; Ziviani, N., Large Text Searching Allowing Errors, 4th South American Workshop on String Processing, 2-20 (1997)
[4] Blei, D. M.; Ng, A. Y.; Jordan, M. I., Latent Dirichlet Allocation, Journal of Machine Learning Research, 3, 993-1022 (2003) · Zbl 1112.68379
[5] Chien, J.-T.; Chang, Y.-L., Bayesian Sparse Topic Model, Journal of Signal Processing Systems, 74, 375-389 (2014)
[6] Gao, J.; Johnson, M., A Comparison of Bayesian Estimators for Unsupervised Hidden Markov Model POS Taggers, Conference on Empirical Methods in Natural Language Processing, 344-352 (2008)
[7] Griffiths, T. L.; Steyvers, M., Finding Scientific Topics, Proceedings of the National Academy of Sciences, 101, 5228-5235 (2004)
[8] Heaps, H. S., Information Retrieval: Computational and Theoretical Aspects (1978), New York: Academic P, New York · Zbl 0471.68075
[9] Hoffman, M. D.; Blei, D. M.; Wang, C.; Paisley, J. W., Stochastic Variational Inference, Journal of Machine Learning Research, 14, 1303-1347 (2013) · Zbl 1317.68163
[10] Ihler, A.; Newman, D., Understanding Errors in Approximate Distributed Latent Dirichlet Allocation, IEEE Transactions on Knowledge and Data Engineering, 24, 952-960 (2012)
[11] Ishwaran, H.; James, L. F., Gibbs Sampling Methods for Stick-Breaking Priors, Journal of the American Statistical Association, 96, 161-173 (2001) · Zbl 1014.62006
[12] Lea, D., A Java Fork/Join Framework, ACM 2000 Conference on Java Grande, 36-43 (2000)
[13] Li, A. Q.; Ahmed, A.; Ravi, S.; Smola, A. J., Reducing the Sampling Complexity of Topic Models, 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 891-900 (2014)
[14] Liu, J. S., The Collapsed Gibbs Sampler in Bayesian Computations With Applications to a Gene Regulation Problem, Journal of the American Statistical Association, 89, 958-966 (1994) · Zbl 0804.62033
[15] Liu, Z.; Zhang, Y.; Chang, E. Y.; Sun, M., PLDA+, ACM Transactions on Intelligent Systems and Technology, 2, 1-18 (2011)
[16] Maciuca, R.; Zhu, S.-C., First Hitting Time Analysis of the Independence Metropolis Sampler, Journal of Theoretical Probability, 19, 235-261 (2006) · Zbl 1104.60043
[17] Marsaglia, G.; Tsang, W. W., A Simple Method for Generating Gamma Variables, ACM Transactions on Mathematical Software, 26, 363-372 (2000) · Zbl 1365.65022
[18] McCallum, A. K., MALLET: A Machine Learning for Language Toolkit (2002)
[19] Newman, D.; Asuncion, A.; Smyth, P.; Welling, M., Distributed Algorithms for Topic Models, The Journal of Machine Learning Research, 10, 1801-1828 (2009) · Zbl 1235.68324
[20] Newman, D.; Bonilla, E. V.; Buntine, W., Improving Topic Coherence With Regularized Topic Models, Advances in Neural Information Processing Systems, 496-504 (2011)
[21] Ng, K. W.; Tian, G.-L.; Tang, M.-L., Dirichlet and Related Distributions: Theory, Methods and Applications, 888 (2011), New York: Wiley, New York · Zbl 1234.60006
[22] Plummer, M.; Best, N.; Cowles, K.; Vines, K., CODA: Convergence Diagnosis and Output Analysis for MCMC, R News, 6, 7-11 (2006)
[23] Porteous, I.; Newman, D.; Ihler, A.; Asuncion, A.; Smyth, P.; Welling, M., Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 569-577 (2008)
[24] Reese, S.; Boleda Torrent, G.; Cuadros Oller, M.; Padró, L.; Rigau Claramunt, G., Word-Sense Disambiguated Multilingual Wikipedia Corpus, 7th International Conference on Language Resources and Evaluation (2010)
[25] Rosen-Zvi, M.; Chemudugunta, C.; Griffiths, T.; Smyth, P.; Steyvers, M., Learning Author-Topic Models From Text Corpora, ACM Transactions on Information Systems, 28, 4 (2010)
[26] Sandhous, E., The New York Times Annotated corpus LDC2008T19. DVD (2008), Philadelphia, PA: Linguistic Data Consortium, Philadelphia, PA
[27] Smola, A.; Narayanamurthy, S., An Architecture for Parallel Topic Models, Proceedings of the VLDB Endowment, 3, 703-710 (2010)
[28] Teh, Y.; Sammut, C.; Webb, G. I., Dirichlet Process, Encyclopedia of Machine Learning (2011), Boston, MA: Springer, Boston, MA
[29] Terenin, A.; Magnusson, M.; Jonsson, L.; Draper, D., Pólya Urn Latent Dirichlet Allocation: A Sparse Massively Parallel Sampler (2017)
[30] Tristan, J.-B.; Huang, D.; Tassarotti, J.; Pocock, A. C.; Green, S.; Steele, G. L., Augur: Data-Parallel Probabilistic Modeling, Advances in Neural Information Processing Systems, 2600-2608 (2014)
[31] Van Dyk, D. A.; Park, T., Partially Collapsed Gibbs Samplers: Theory and Methods, Journal of the American Statistical Association, 103, 790-796 (2008) · Zbl 1471.62198
[32] Villani, M.; Kohn, R.; Giordani, P., Regression Density Estimation Using Smooth Adaptive Gaussian Mixtures, Journal of Econometrics, 153, 155-173 (2009) · Zbl 1431.62093
[33] Walker, A. J., An Efficient Method for Generating Discrete Random Variables With General Distributions, ACM Transactions on Mathematical Software (TOMS), 3, 253-256 (1977) · Zbl 1148.65301
[34] Wallach, H. M.; Murray, I.; Salakhutdinov, R.; Mimno, D., Evaluation Methods for Topic Models, 26th Annual International Conference on Machine Learning, 1105-1112 (2009)
[35] Xiao, H.; Stibor, T., Efficient Collapsed Gibbs Sampling For Latent Dirichlet Allocation, Proceedings of 2nd Asian Conference on Machine Learning, 3-78 (2010)
[36] Yan, F.; Xu, N.; Qi, Y., Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units, Advances in Neural Information Processing Systems, 2134-2142 (2009)
[37] Yao, L.; Mimno, D.; McCallum, A., Efficient Methods for Topic Model Inference on Streaming Document Collections, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 937-946 (2009)
[38] Yuan, J.; Gao, F.; Ho, Q.; Dai, W.; Wei, J.; Zheng, X.; Xing, E. P.; Liu, T.-Y.; Ma, W.-Y., Lightlda: Big Topic Models on Modest Computer Clusters, Proceedings of the 24th International Conference on World Wide Web, 1351-1361 (2015)
[39] Zhu, J.; Chen, N.; Perkins, H.; Zhang, B., Gibbs Max-Margin Topic Models With Fast Sampling Algorithms, 30th International Conference on Machine Learning, 124-132 (2013)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.