×

Integer matrix approximation and data mining. (English) Zbl 1388.90092

Summary: Integer datasets frequently appear in many applications in science and engineering. To analyze these datasets, we consider an integer matrix approximation technique that can preserve the original dataset characteristics. Because integers are discrete in nature, to the best of our knowledge, no previously proposed technique developed for real numbers can be successfully applied. In this study, we first conduct a thorough review of current algorithms that can solve integer least squares problems, and then we develop an alternative least square method based on an integer least squares estimation to obtain the integer approximation of the integer matrices. We discuss numerical applications for the approximation of randomly generated integer matrices as well as studies of association rule mining, cluster analysis, and pattern extraction. Our computed results suggest that our proposed method can calculate a more accurate solution for discrete datasets than other existing methods.

MSC:

90C20 Quadratic programming
15B36 Matrices of integers
90C10 Integer programming

Software:

UCI-ml; PROXIMUS
Full Text: DOI

References:

[1] Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min Knowl. Discov. 11(1), 5-33 (2005) · doi:10.1007/s10618-005-1396-1
[2] Agrell, E., Eriksson, T., Vardy, A., Zeger, K.: Closest point search in lattices. IEEE Trans. Inf. Theory 48(8), 2201-2214 (2002). doi:10.1109/TIT.2002.800499 · Zbl 1062.94035 · doi:10.1109/TIT.2002.800499
[3] Asuncion, A., Newman, D.: UCI machine learning repository (2007). http://www.ics.uci.edu/ mlearn/MLRepository.html
[4] Banerjee, A., Krumpelman, C., Ghosh, J., Basu, S., Mooney, R.J.: Model-based overlapping clustering. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, pp. 532-537. ACM, New York, NY, USA (2005). doi:10.1145/1081870.1081932 · Zbl 1321.90129
[5] Breen, S., Chang, X.W.: Column reordering for box-constrained integer least squares problems. arXiv:1204.1407 (2012) · Zbl 1162.65354
[6] CBCL face database \[\sharp \]♯ 1, MIT center for biological and computation learning (2000). http://www.ai.mit.edu/projects/cbcl
[7] Chang, X.W., Golub, G.H.: Solving ellipsoid-constrained integer least squares problems. SIAM J. Matrix Anal. Appl. 31(3), 1071-1089 (2009). doi:10.1137/060660680 · Zbl 1195.65049 · doi:10.1137/060660680
[8] Chang, X.W., Han, Q.: Solving box-constrained integer least squares problems. IEEE Trans. Wirel. Commun. 7(1), 277-287 (2008). doi:10.1109/TWC.2008.060497 · doi:10.1109/TWC.2008.060497
[9] Chang, X.W., Yang, X.: An efficient regularization approach for underdetermined mimo system decoding (2007). http://www.cs.mcgill.ca/ chang/pub/ChaY07a.pdf · Zbl 1195.65049
[10] Chowdhury, G.G.: Introduction to Modern Information Retrieval, 2nd edn. Facet Publishing, Abingdon (2004)
[11] Chu, M.T., Lin, M.M.: Low-dimensional polytope approximation and its applications to nonnegative matrix factorization. SIAM J. Sci. Comput. 30(3), 1131-1155 (2008). doi:10.1137/070680436 · Zbl 1168.65019 · doi:10.1137/070680436
[12] Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006) · Zbl 1140.94001
[13] Donoho, D., Stodden, V.: When does nonnegative matrix factorization give a correct decomposition into parts? In: Proceedings of 17th Annual Conference on Neural Information Processing Systems. NIPS, Stanford University, Stanford, CA, 2003 (2003) · Zbl 1346.15028
[14] Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95(25), 14863-14868 (1998) · doi:10.1073/pnas.95.25.14863
[15] Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765-2781 (2013) · doi:10.1109/TPAMI.2013.57
[16] Filipovych, R., Resnick, S.M., Davatzikos, C.: Semi-supervised cluster analysis of imaging data. NeuroImage 54(3), 2185-2197 (2011) · doi:10.1016/j.neuroimage.2010.09.074
[17] Gillis, N., Glineur, F.: Using underapproximations for sparse nonnegative matrix factorization. Pattern Recogn. 43(4), 1676-1687 (2010). doi:10.1016/j.patcog.2009.11.013. http://www.sciencedirect.com/science/article/pii/S0031320309004324 · Zbl 1191.68783
[18] Golub, G.H., Van Loan, C.F.: Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences, 4th edn. Johns Hopkins University Press, Baltimore (2013) · Zbl 1268.65037
[19] Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157-1182 (2003) · Zbl 1102.68556
[20] Hassibi, A., Boyd, S.: Integer parameter estimation in linear models with applications to gps. In: Proceedings of the 35th IEEE Conference on Decision and Control, 1996, vol. 3, pp. 3245-3251 vol.3 (1996). doi:10.1109/CDC.1996.573639 · Zbl 1242.90112
[21] Houseman, E.A., Accomando, W.P., Koestler, D.C., Christensen, B.C., Marsit, C.J., Nelson, H.H., Wiencke, J.K., Kelsey, K.T.: Dna methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinform. 13, 86 (2012). doi:10.1186/1471-2105-13-86
[22] Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457-1469 (2004) · Zbl 1222.68218
[23] Kabán, A., Bingham, E.: Factorisation and denoising of 0-1 data: a variational approach. Neurocomputing 71(10), 2291-2308 (2008) · doi:10.1016/j.neucom.2007.07.038
[24] Kannan, R., Ishteva, M., Park, H.: Bounded matrix low rank approximation. In: 2012 IEEE 12th International Conference on Data Mining (ICDM), pp. 319-328 (2012). doi:10.1109/ICDM.2012.131 · Zbl 1102.68556
[25] Kawamoto, T., Hotta, K., Mishima, T., Fujiki, J., Tanaka, M., Kurita, T.: Estimation of single tones from chord sounds using non-negative matrix factorization. Neural Netw. World 3, 429-436 (2000)
[26] Kim, H., Park, H.: Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 30(2), 713-730 (2008) · Zbl 1162.65354 · doi:10.1137/07069239X
[27] Kim, H., Park, H., Drake, B.L.: Extracting unrecognized gene relationships from the biomedical literature via matrix factorizations. BMC Bioinform. 9, 211 (2008). doi:10.1186/1471-2105-9-211
[28] Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Global Optim. 58(2), 285-319 (2014). doi:10.1007/s10898-013-0035-4 · Zbl 1321.90129 · doi:10.1007/s10898-013-0035-4
[29] Kim, J., Park, H.: Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J. Sci. Comput. 33(6), 3261-3281 (2011). doi:10.1137/110821172 · Zbl 1232.65068 · doi:10.1137/110821172
[30] Koyutürk, M., Grama, A.: Proximus: a framework for analyzing very high dimensional discrete-attributed datasets. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 147-156. ACM, New York, NY, USA (2003). doi:10.1145/956750.956770
[31] Koyuturk, M., Grama, A., Ramakrishnan, N.: Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans. Knowl. Data Eng. 17(4), 447-461 (2005). doi:10.1109/TKDE.2005.55 · doi:10.1109/TKDE.2005.55
[32] Koyutürk, M., Grama, A., Ramakrishnan, N.: Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Trans. Math. Softw. 32(1), 33-69 (2006). doi:10.1145/1132973.1132976 · Zbl 1346.15028 · doi:10.1145/1132973.1132976
[33] Liao, J.C., Boscolo, R., Yang, Y.L., Tran, L.M., Sabatti, C., Roychowdhury, V.P.: Network component analysis: reconstruction of regulatory signals in biological systems. In: Proceedings of the National Academy of Sciences 100(26), 15522-15527 (2003). doi:10.1073/pnas.2136632100. http://www.pnas.org/content/100/26/15522.abstract
[34] Lin, M.M.: Discrete eckart-young theorem for integer matrices. SIAM J. Matrix Anal. Appl. 32(4), 1367-1382 (2011). doi:10.1137/10081099X · Zbl 1242.90112 · doi:10.1137/10081099X
[35] Meeds, E., Ghahramani, Z., Neal, R.M., Roweis, S.T.: Modeling dyadic data with binary latent factors. Adv. Neural Inf. Process. Syst. 19, 977 (2007)
[36] Morgan, S.D.: Cluster analysis in electronic manufacturing. Ph.D. dissertation, North Carolina State University, Raleigh, NC 27695 (2001) · Zbl 1222.68218
[37] Mow, W.H.: Universal lattice decoding: a review and some recent results. In: 2004 IEEE International Conference on Communications, vol. 5, pp. 2842-2846 (2004). doi:10.1109/ICC.2004.1313048
[38] Segal, E., Battle, A., Koller, D.: Decomposing gene expression into cellular processes. In: In Proceedings of the 8th Pacific Symposium on Biocomputing (2003) · Zbl 1219.92027
[39] Slawski, M., Hein, M., Lutsik, P.: Matrix factorization with binary components. In: Advances in Neural Information Processing Systems, pp. 3210-3218 (2013)
[40] Su, K., Wassell, I.: A new ordering for efficient sphere decoding. In: 2005 IEEE International Conference on Communications, ICC 2005, vol. 3, pp. 1906-1910 (2005). doi:10.1109/ICC.2005.1494671
[41] Tu, S., Chen, R., Xu, L.: Transcription network analysis by a sparse binary factor analysis algorithm. J. Integr. Bioinform. 9(2), 68-79 (2012)
[42] van der Veen, A.J.: Analytical method for blind binary signal separation. IEEE Trans. Signal Process. 45(4), 1078-1082 (1997). doi:10.1109/78.564198 · doi:10.1109/78.564198
[43] van Emde-Boas, P.: Another NP-complete partition problem and the complexity of computing short vectors in a lattice. Report. Department of Mathematics. University of Amsterdam. Department, University (1981). http://books.google.com.tw/books?id=tCQiHQAACAAJ
[44] Zhang, Z., Li, T., Ding, C., Zhang, X.: Binary matrix factorization with applications. In: Seventh IEEE International Conference on Data Mining, 2007, ICDM 2007, pp. 391-400. IEEE (2007)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.