×

Flexible low-rank statistical modeling with missing data and side information. (English) Zbl 1397.62180

Summary: We explore a general statistical framework for low-rank modeling of matrix-valued data, based on convex optimization with a generalized nuclear norm penalty. We study several related problems: the usual low-rank matrix completion problem with flexible loss functions arising from generalized linear models; reduced-rank regression and multi-task learning; and generalizations of both problems where side information about rows and columns is available, in the form of features or smoothing kernels. We show that our approach encompasses maximum a posteriori estimation arising from Bayesian hierarchical modeling with latent factors, and discuss ramifications of the missing-data mechanism in the context of matrix completion. While the above problems can be naturally posed as rank-constrained optimization problems, which are nonconvex and computationally difficult, we show how to relax them via generalized nuclear norm regularization to obtain convex optimization problems. We discuss algorithms drawing inspiration from modern convex optimization methods to address these large scale convex optimization computational tasks. Finally, we illustrate our flexible approach in problems arising in functional data reconstruction and ecological species distribution modeling.

MSC:

62H05 Characterization and structure theory for multivariate probability distributions; copulas
62B10 Statistical aspects of information-theoretic topics

References:

[1] Abernethy, J., Bach, F., Evgeniou, T. and Vert, J.-P. (2009). A new approach to collaborative filtering: Operator estimation with spectral regularization. J. Mach. Learn. Res.10 803–826. · Zbl 1235.68122
[2] Aggarwal, C. C. and Chen, B.-C. (2009). Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 19–28. ACM, New York.
[3] Agarwal, D. K. and Chen, B.-C. (2015). Statistical Methods for Recommender Systems. Cambridge Univ. Press, Cambridge.
[4] Agarwal, D., Zhang, L. and Mazumder, R. (2011). Modeling item–item similarities for personalized recommendations on Yahoo! front page. Ann. Appl. Stat.5 1839–1875. · Zbl 1231.62207 · doi:10.1214/11-AOAS475
[5] Anderson, T. W. (1951). Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann. Math. Stat.22 327–351. · Zbl 0043.13902 · doi:10.1214/aoms/1177729580
[6] Angst, R., Zach, C. and Pollefeys, M. (2011). The generalized trace-norm and its application to structure-from-motion problems. In 2011 IEEE International Conference on Computer Vision (ICCV) 2502–2509. IEEE, Los Alamitos, CA.
[7] Atchadé, Y. F., Mazumder, R. and Chen, J. (2015). Scalable computation of regularized precision matrices via stochastic optimization. Preprint. Available at arXiv:1509.00426.
[8] Audigier, V., Husson, F. and Josse, J. (2016). A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif.10 5–26. · Zbl 1414.62206
[9] Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci.2 183–202. · Zbl 1175.94009 · doi:10.1137/080716542
[10] Bell, R. M. and Koren, Y. (2007). Lessons from the Netflix Prize Challenge. ACM SIGKDD Explor. Newsl.9 75–79.
[11] Bennett, J. and Lanning, S. (2007). The Netflix Prize. In Proceedings of KDD Cup and Workshop 3–6. ACM New York.
[12] Bertsekas, D. P. (1999). Nonlinear Programming, 2nd ed. Athena Scientific, Belmont, MA. · Zbl 1015.90077
[13] Bertsimas, D., Copenhaver, M. S. and Mazumder, R. (2017). Certifiably optimal low rank factor analysis. J. Mach. Learn. Res.18 Paper No. 29. · Zbl 1437.62216
[14] Bottou, L. and Bousquet, O. (2008). The trade-offs of large scale learning. In Advances in Neural Information Processing Systems20 (J. C. Platt, D. Koller, Y. Singer and S. T. Roweis, eds.) 161–168. MIT Press, Cambridge, MA.
[15] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge. · Zbl 1058.90049
[16] Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn.3 1–122. · Zbl 1229.90122 · doi:10.1561/2200000016
[17] Burer, S. and Monteiro, R. D. C. (2005). Local minima and convergence in low-rank semidefinite programming. Math. Program.103 427–444. · Zbl 1099.90040 · doi:10.1007/s10107-004-0564-1
[18] Cai, T. and Zhou, W.-X. (2013). A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res.14 3619–3647. · Zbl 1318.62172
[19] Candes, E. and Plan, Y. (2010). Matrix completion with noise. Proc. IEEE98 925–936.
[20] Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. Comput. Math.9 717–772. · Zbl 1219.90124 · doi:10.1007/s10208-009-9045-5
[21] Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inform. Theory56 2053–2080. · Zbl 1366.15021 · doi:10.1109/TIT.2010.2044061
[22] Candès, E. J., Li, X., Ma, Y. and Wright, J. (2011). Robust principal component analysis? J. ACM58 Art. ID 11. · Zbl 1327.62369
[23] Candès, E. J., Eldar, Y. C., Strohmer, T. and Voroninski, V. (2015). Phase retrieval via matrix completion [reprint of MR3032952]. SIAM Rev.57 225–251. · Zbl 1344.49057
[24] Carpentier, A., Klopp, O., Löffler, M. and Nickl, R. (2016). Adaptive confidence sets for matrix completion. Preprint. Available at arXiv:1608.04861.
[25] Chen, Y. and Wainwright, M. J. (2015). Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. Preprint. Available at arXiv:1509.03025.
[26] Cottet, V. and Alquier, P. (2018). 1-Bit matrix completion: PAC-Bayesian analysis of a variational approximation. Mach. Learn.107 579–603. · Zbl 1461.15032
[27] Davenport, M. A., Plan, Y., van den Berg, E. and Wootters, M. (2014). 1-bit matrix completion. Inf. Inference3 189–223. · Zbl 1309.62124 · doi:10.1093/imaiai/iau006
[28] de Leeuw, J. and van der Heijden, P. G. M. (1988). Correspondence analysis of incomplete contingency tables. Psychometrika53 223–233. · Zbl 0718.62116 · doi:10.1007/BF02294134
[29] Devolder, O., Glineur, F. and Nesterov, Y. (2014). First-order methods of smooth convex optimization with inexact oracle. Math. Program.146 37–75. · Zbl 1317.90196 · doi:10.1007/s10107-013-0677-5
[30] Fithian, W. and Mazumder, R. (2013). Scalable convex methods for flexible low-rank matrix modeling. Preprint. Available at arXiv:1308.4211.
[31] Fithian, W., Elith, J., Hastie, T. and Keith, D. A. (2015). Bias correction in species distribution models: Pooling survey and collection data for multiple species. Methods Ecol. Evol.6 424–438.
[32] Foygel, R. and Srebro, N. (2011). Concentration-based guarantees for low-rank matrix reconstruction. In COLT 315–340.
[33] Frank, M. and Wolfe, P. (1956). An algorithm for quadratic programming. Nav. Res. Logist. Q.3 95–110.
[34] Freund, R. M., Grigas, P. and Mazumder, R. (2017). An extended Frank–Wolfe method with “in-face” directions, and its application to low-rank matrix completion. SIAM J. Optim.27 319–346. · Zbl 1357.90115 · doi:10.1137/15M104726X
[35] Gerrish, S. and Blei, D. M. (2011). Predicting legislative roll calls from text. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) 489–496.
[36] Golub, G. H. and Van Loan, C. F. (1983). Matrix Computations. Johns Hopkins Series in the Mathematical Sciences3. Johns Hopkins Univ. Press, Baltimore, MD. · Zbl 0559.65011
[37] Hastie, T., Buja, A. and Tibshirani, R. (1995). Penalized discriminant analysis. Ann. Statist.23 73–102. · Zbl 0821.62031 · doi:10.1214/aos/1176324456
[38] Hastie, T., Tibshirani, R. and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. Monographs on Statistics and Applied Probability143. CRC Press, Boca Raton, FL. · Zbl 1319.68003
[39] Hastie, T., Mazumder, R., Lee, J. D. and Zadeh, R. (2015). Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res.16 3367–3402. · Zbl 1352.65117
[40] Huber, P. J. (2011). Robust Statistics. Springer, New York.
[41] Jaggi, M. and Sulovsk, M. (2010). A simple algorithm for nuclear norm regularized problems. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) 471–478.
[42] Jain, P., Netrapalli, P. and Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization (extended abstract). In STOC’13—Proceedings of the 2013 ACM Symposium on Theory of Computing 665–674. ACM, New York. · Zbl 1293.65073
[43] Josse, J. and Husson, F. (2012). Handling missing values in exploratory multivariate data analysis methods. J. SFdS153 79–99. · Zbl 1316.62006
[44] Josse, J., Wager, S. and Husson, F. (2016). Confidence areas for fixed-effects PCA. J. Comput. Graph. Statist.25 28–48. · doi:10.1080/10618600.2014.950871
[45] Journée, M., Bach, F., Absil, P.-A. and Sepulchre, R. (2010). Low-rank optimization on the cone of positive semidefinite matrices. SIAM J. Optim.20 2327–2351. · Zbl 1215.65108 · doi:10.1137/080731359
[46] Keshavan, R. H., Montanari, A. and Oh, S. (2010). Matrix completion from a few entries. IEEE Trans. Inform. Theory56 2980–2998. · Zbl 1366.62111 · doi:10.1109/TIT.2010.2046205
[47] Klopp, O., Lafond, J., Moulines, É. and Salmon, J. (2015). Adaptive multinomial matrix completion. Electron. J. Stat.9 2950–2975. · Zbl 1329.62304 · doi:10.1214/15-EJS1093
[48] Koren, Y. (2010). Collaborative filtering with temporal dynamics. Commun. ACM53 89–97.
[49] Koren, Y., Bell, R. and Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer42 30–37.
[50] Lafond, J. (2015). Low rank matrix completion with exponential family noise. Preprint. Available at arXiv:1502.06919.
[51] Larsen, R. M. (2004). PROPACK—Software for large and sparse SVD calculations. Available at http://sun.stanford.edu/ rmunk/PROPACK.
[52] Lee, J., Recht, B., Srebro, N., Tropp, J. and Salakhutdinov, R. (2010). Practical large-scale optimization for max-norm regularization. In Advances in Neural Information Processing Systems 1297–1305.
[53] Lesieur, T., Krzakala, F. and Zdeborová, L. (2015). MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton) 680–687. IEEE, Los Alamitos, CA.
[54] Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London. · Zbl 0432.62029
[55] Martin, A. D. and Quinn, K. M. (2002). Dynamic ideal point estimation via Markov chain Monte Carlo for the US Supreme Court, 1953–1999. Polit. Anal.10 134–153.
[56] Mazumder, R., Hastie, T. and Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res.11 2287–2322. · Zbl 1242.68237
[57] Mazumder, R., Radchenko, P. and Dedieu, A. (2017). Subset selection with shrinkage: Sparse linear modeling when the SNR is low. Preprint. Available at arXiv:1708.03288.
[58] Menon, A. K. and Elkan, C. (2010). A log-linear model with latent features for dyadic prediction. In 2010 IEEE 10th International Conference on Data Mining (ICDM) 364–373. IEEE, Los Alamitos, CA.
[59] Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res.13 1665–1697. · Zbl 1436.62204
[60] Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization87. Kluwer Academic, Boston, MA. · Zbl 1086.90045
[61] Nesterov, Yu. (2005). Smooth minimization of non-smooth functions. Math. Program.103 127–152. · Zbl 1079.90102 · doi:10.1007/s10107-004-0552-5
[62] Parker, J. T., Schniter, P. and Cevher, V. (2014a). Bilinear generalized approximate message passing—Part I: Derivation. IEEE Trans. Signal Process.62 5839–5853. · Zbl 1394.94447
[63] Parker, J. T., Schniter, P. and Cevher, V. (2014b). Bilinear generalized approximate message passing—Part II: Applications. IEEE Trans. Signal Process.62 5854–5867. · Zbl 1394.94448 · doi:10.1109/TSP.2014.2357773
[64] Reinsel, G. C. and Velu, R. P. (1998). Multivariate Reduced-Rank Regression: Theory and Applications. Lecture Notes in Statistics136. Springer, New York. · Zbl 0909.62066
[65] Rennie, J. and Srebro, N. (2005). Fast maximum margin matrix factorization for collaborative prediction. In ICML.
[66] Roweis, S. (1998). EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems10 626–632. MIT Press, Cambridge, MA.
[67] Rubin, D. B. (1976). Inference and missing data. Biometrika63 581–592. With comments by R. J. A. Little and a reply by the author. · Zbl 0344.62034 · doi:10.1093/biomet/63.3.581
[68] Salakhutdinov, R. and Mnih, A. (2008a). Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th International Conference on Machine Learning 880–887. ACM, New York.
[69] Salakhutdinov, R. and Mnih, A. (2008b). Probabilistic matrix factorization. In Advances in Neural Information Processing Systems20 (J. C. Platt, D. Koller, Y. Singer and S. T. Roweis, eds.) 1257–1264. MIT Press, Cambridge, MA.
[70] Salakhutdinov, R. and Srebro, N. (2010). Collaborative filtering in a non-uniform world: Learning with the weighted trace norm. Preprint. Available at arXiv:1002.2780.
[71] Srebro, N., Rennie, J. and Jaakkola, T. (2005). Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems17 1329–1336. MIT Press, Cambridge, MA.
[72] Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Learning Theory. Lecture Notes in Computer Science3559 545–560. Springer, Berlin. · Zbl 1137.68563
[73] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B58 267–288. · Zbl 0850.62538
[74] Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B. Stat. Methodol.61 611–622. · Zbl 0924.62068 · doi:10.1111/1467-9868.00196
[75] Todeschini, A., Caron, F. and Chavent, M. (2013). Probabilistic low-rank matrix completion with adaptive spectral regularization algorithms. In Advances in Neural Information Processing Systems 845–853.
[76] Udell, M., Horn, C., Zadeh, R. and Boyd, S. (2016). Generalized low rank models. Found. Trends Mach. Learn.9 1–118. · Zbl 1350.68221 · doi:10.1561/2200000055
[77] Yang, Y., Ma, J. and Osher, S. (2013). Seismic data reconstruction via matrix completion. Inverse Probl. Imaging7 1379–1392. · Zbl 1292.15017 · doi:10.3934/ipi.2013.7.1379
[78] Yee, T. W. and Hastie, T. J. (2003). Reduced-rank vector generalized linear models. Stat. Model.3 15–41. · Zbl 1195.62123 · doi:10.1191/1471082X03st045oa
[79] Yuan, M., Ekici, A., Lu, Z. and Monteiro, R. (2007). Dimension reduction and coefficient estimation in multivariate linear regression. J. R. Stat. Soc. Ser. B. Stat. Methodol.69 329–346. · Zbl 07555355
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.