×

Hybrid clustering based on content and connection structure using joint nonnegative matrix factorization. (English) Zbl 1434.15011

Summary: A hybrid method called JointNMF is presented which is applied to latent information discovery from data sets that contain both text content and connection structure information. The new method jointly optimizes an integrated objective function, which is a combination of two components: the Nonnegative Matrix Factorization (NMF) objective function for handling text content and the Symmetric NMF (SymNMF) objective function for handling network structure information. An effective algorithm for the joint NMF objective function is proposed so that the efficient method of block coordinate descent framework can be utilized. The proposed hybrid method simultaneously discovers content associations and related latent connections without any need for postprocessing of additional clustering. It is shown that the proposed method can also be applied when the text content is associated with hypergraph edges. An additional capability of the JointNMF is prediction of unknown network information which is illustrated using several real world problems such as citation recommendations of papers and leader detection in organizations. The proposed method can also be applied to general data expressed with both feature space vectors and pairwise similarities and can be extended to the case with multiple feature spaces or multiple similarity measures. Our experimental results illustrate multiple advantages of the proposed hybrid method when both content and connection structure information is available in the data for obtaining higher quality clustering results and discovery of new information such as unknown link prediction.

MSC:

15A23 Factorization of matrices
90C27 Combinatorial optimization

Software:

HierNMF2; SmallK; SymNMF; SNAP

References:

[1] Bertsekas, D.: Nonlinear Programming. Athena Scientific, Belmont (1999) · Zbl 1015.90077
[2] Chang, J., Blei, D.M.: Hierarchical relational models for document networks. Ann. Appl. Stat. 4(1), 124-150 (2010) · Zbl 1189.62191 · doi:10.1214/09-AOAS309
[3] Choo, J., Lee, C., Reddy, C.K., Park, H.: Utopian: user-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans. Vis. Comput. Graph. 19(12), 1992-2001 (2013). doi:10.1109/TVCG.2013.212 · doi:10.1109/TVCG.2013.212
[4] Cohn, DA; Hofmann, T.; Leen, TK (ed.); Dietterich, TG (ed.); Tresp, V. (ed.), The missing link – a probabilistic model of document content and hypertext connectivity, No. 13, 430-436 (2001), Cambridge
[5] Cruz, J., Bothorel, C., Poulet, F.: Entropy based community detection in augmented social networks. In: 2011 International Conference on Computational Aspects of Social Networks (CASoN), pp. 163-168 (2011). doi:10.1109/CASON.2011.6085937
[6] Drake, B., Kim, J., Mallick, M., Park, H.: Supervised Raman spectra estimation based on nonnegative rank deficient least squares. In: Proceedings 13th International Conference on Information Fusion, Edinburgh, UK (2010)
[7] Drake, B., Lee-Urban, S., Park, H.: Smallk is a C++/Python high-performance software library for nonnegative matrix factorization (nmf) and hierarchical and flat clustering using the nmf; current version 1.6.2. http://smallk.github.io/ (2017)
[8] Elhadi, H., Agam, G.: Structure and attributes community detection: comparative analysis of composite, ensemble and selection methods. In: Proceedings of the 7th Workshop on Social Network Mining and Analysis, SNAKDD ’13, pp. 10:1-10:7. ACM, New York, NY, USA (2013). doi:10.1145/2501025.2501034
[9] Erosheva, E., Fienberg, S., Lafferty, J.: Mixed-membership models of scientific publications. Proc. Natl. Acad. Sci. 101(suppl 1), 5220-5227 (2004). doi:10.1073/pnas.0307760101 · doi:10.1073/pnas.0307760101
[10] Gruber, A., Rosen-Zvi, M., Weiss, Y.: Latent topic models for hypertext. In: Proceedings of the Twenty-Fourth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-08), pp. 230-239. AUAI Press, Corvallis, Oregon (2008)
[11] Jin, D., Gabrys, B., Dang, J.: Combined node and link partitions method for finding overlapping communities in complex networks. Scientific Reports 5 (2015). doi:10.1038/srep08600
[12] Kannan, R.; Ishteva, M.; Drake, B.; Park, H.; Naik, GR (ed.), Bounded matrix low rank approximation, 89-118 (2016), Springer · Zbl 1338.65113 · doi:10.1007/978-3-662-48331-2_4
[13] Kannan, R., Ishteva, M., Park, H.: Bounded matrix factorization for recommender system. Knowl. Inf. Syst. 39(3), 491-511 (2014) · doi:10.1007/s10115-013-0710-2
[14] Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Glob. Optim. 58(2), 285-319 (2014). doi:10.1007/s10898-013-0035-4 · Zbl 1321.90129 · doi:10.1007/s10898-013-0035-4
[15] Kim, J., Park, H.: Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J. Sci. Comput. 33(6), 3261-3281 (2011) · Zbl 1232.65068 · doi:10.1137/110821172
[16] Kuang, D.; Choo, J.; Park, H.; Celebi, ME (ed.), Nonnegative matrix factorization for interactive topic modeling and document clustering, 215-243 (2015), Berlin · doi:10.1007/978-3-319-09259-1_7
[17] Kuang, D., Park, H.: Fast rank-2 nonnegative matrix factorization for hierarchical document clustering. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 739-747. ACM (2013)
[18] Kuang, D., Park, H., Ding, C.H.: Symmetric nonnegative matrix factorization for graph clustering. In: SDM, vol. 12, pp. 106-117. SIAM (2012)
[19] Kuang, D., Yun, S., Park, H.: SymNMF: Nonnegative low-rank approximation of a similarity matrix for graph clustering. J. Glob. Optim. 62(3), 545-574 (2015). doi:10.1007/s10898-014-0247-2 · Zbl 1326.90080 · doi:10.1007/s10898-014-0247-2
[20] Leskovec, J., Krevl, A.: SNAP datasets: stanford large network dataset collection. http://snap.stanford.edu/data (2014)
[21] Liu, J., Wang, C., Gao, J., Han, J.: Multi-view clustering via joint nonnegative matrix factorization. In: Proceedings of the 2013 SIAM International Conference on Data Mining, Proceedings, pp. 252-260. Society for Industrial and Applied Mathematics (2013)
[22] Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link LDA: joint models of topic and author community. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp. 665-672. ACM, New York, NY, USA (2009). doi:10.1145/1553374.1553460
[23] Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008) · Zbl 1160.68008 · doi:10.1017/CBO9780511809071
[24] Mei, Q., Cai, D., Zhang, D., Zhai, C.: Topic modeling with network regularization. In: Proceedings of the 17th International Conference on World Wide Web, WWW ‘08, pp. 101-110. ACM, New York, NY, USA (2008). doi:10.1145/1367497.1367512
[25] Nallapati, R.M., Ahmed, A., Xing, E.P., Cohen, W.W.: Joint latent topic models for text and citations. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘08, pp. 542-550. ACM, New York, NY, USA (2008). doi:10.1145/1401890.1401957
[26] Ruan, Y., Fuhry, D., Parthasarathy, S.: Efficient community detection in large networks using content and links. In: Proceedings of the 22nd International Conference on World Wide Web, WWW ‘13, pp. 1089-1098. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2013)
[27] Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583-617 (2003). doi:10.1162/153244303321897735 · Zbl 1084.68759 · doi:10.1162/153244303321897735
[28] Sun, Y., Aggarwal, C.C., Han, J.: Relation strength-aware clustering of heterogeneous information networks with incomplete attributes. Proc. VLDB Endow. 5(5), 394-405 (2012). doi:10.14778/2140436.2140437 · doi:10.14778/2140436.2140437
[29] Tang, J., Wang, X., Liu, H.: Integrating social media data for community detection. In: Proceedings of the 2011 International Conference on Modeling and Mining Ubiquitous Social Media, MSM‘11, pp. 1-20. Springer, Berlin, Heidelberg (2012). doi:10.1007/978-3-642-33684-3
[30] Wang, X., Tang, L., Gao, H., Liu, H.: Discovering overlapping groups in social media. In: 2010 IEEE International Conference on Data Mining, pp. 569-578 (2010). doi:10.1109/ICDM.2010.48
[31] Wang, X., Tang, L., Liu, H., Wang, L.: Learning with multi-resolution overlapping communities. Knowl. Inf. Syst. 36(2), 517-535 (2013). doi:10.1007/s10115-012-0555-0 · doi:10.1007/s10115-012-0555-0
[32] Xu, Y., Yin, W., Wen, Z., Zhang, Y.: An alternating direction algorithm for matrix completion with nonnegative factors. Front. Math. China 7(2), 365-384 (2012). doi:10.1007/s11464-012-0194-5 · Zbl 1323.65044 · doi:10.1007/s11464-012-0194-5
[33] Yang, J., Leskovec, J.: Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 587-596. ACM (2013)
[34] Yang, T., Jin, R., Chi, Y., Zhu, S.: Combining link and content for community detection: a discriminative approach. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘09, pp. 927-936. ACM, New York, NY, USA (2009). doi:10.1145/1557019.1557120
[35] Zhou, D.; Huang, J.; Schölkopf, B.; Schölkopf, B. (ed.); Platt, JC (ed.); Hoffman, T. (ed.), Learning with hypergraphs: clustering, classification, and embedding, No. 19, 1601-1608 (2007), Cambridge
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.