×

Integrated principal components analysis. (English) Zbl 07626713

Summary: Data integration, or the strategic analysis of multiple sources of data simultaneously, can often lead to discoveries that may be hidden in individualistic analyses of a single data source. We develop a new unsupervised data integration method named Integrated Principal Components Analysis (iPCA), which is a model-based generalization of PCA and serves as a practical tool to find and visualize common patterns that occur in multiple data sets. The key idea driving iPCA is the matrix-variate normal model, whose Kronecker product covariance structure captures both individual patterns within each data set and joint patterns shared by multiple data sets. Building upon this model, we develop several penalized (sparse and non-sparse) covariance estimators for iPCA, and using geodesic convexity, we prove that our non-sparse iPCA estimator converges to the global solution of a non-convex problem. We also demonstrate the practical advantages of iPCA through extensive simulations and a case study application to integrative genomics for Alzheimer’s disease. In particular, we show that the joint patterns extracted via iPCA are highly predictive of a patient’s cognition and Alzheimer’s diagnosis.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

References:

[1] H. Abdi, L. J. Williams, and D. Valentin. Multiple factor analysis: principal component analysis for multitable and multiblock data sets.Wiley Interdisciplinary Reviews: Computational Statistics, 5(2):149-179, 2013. · Zbl 1540.62004
[2] E. Acar, E. E. Papalexakis, G. G¨urdeniz, M. A. Rasmussen, A. J. Lawaetz, M. Nilsson, and R. Bro. Structure-revealing data fusion.BMC Bioinformatics, 15(1):239, Jul 2014. ISSN 1471-2105.doi: 10.1186/1471-2105-15-239.URLhttps://doi.org/10.1186/ 1471-2105-15-239.
[3] G. I. Allen and R. Tibshirani. Transposable regularized covariance models with an application to missing data imputation.The Annals of Applied Statistics, 4(2):764-790, 2010. · Zbl 1194.62079
[4] O. Alter, P. O. Brown, and D. Botstein. Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proceedings of the National Academy of Sciences, 100(6):3351-3356, 2003.
[5] K. Benidis, Y. Sun, P. Babu, and D. P. Palomar. Orthogonal sparse PCA and covariance estimation via procrustes reformulation.IEEE Transactions on Signal Processing, 64(23): 6211-6226, 2016. URLhttp://www.danielppalomar.com/publications.html. · Zbl 1414.94069
[6] O. Carrette, I. Demalte, A. Scherl, O. Yalkinoglu, G. Corthals, P. Burkhard, D. F. Hochstrasser, and J.C. Sanchez.A panel of cerebrospinal fluid potential biomarkers for the diagnosis of Alzheimer’s disease.Proteomics, 3(8):1486-1494, 2003.
[7] A.P. Dawid.Some matrix-variate distribution theory: notational considerations and a Bayesian application.Biometrika, 68(1):265-274, 1981. · Zbl 0464.62039
[8] P. Dutilleul. The MLE algorithm for the matrix normal distribution.Journal of Statistical Computation and Simulation, 64(2):105-123, 1999. · Zbl 0960.62056
[9] B. Escofier and J. Pages. Multiple factor analysis.Computational Statistics & Data Analysis, 18(1):121-140, 1994. · Zbl 0825.62517
[10] I. Espuny-Camacho, A. M. Arranz, M. Fiers, A. Snellinx, K. Ando, S. Munck, J. Bonnefont, L. Lambot, N. Corthout, L. Omodho, et al. Hallmarks of Alzheimer’s disease in stemcell-derived human neurons transplanted into mouse brain.Neuron, 93(5):1066-1081, 2017.
[11] J. Fan, D. Wang, K. Wang, and Z. Zhu. Distributed estimation of principal eigenspaces. ArXiv e-prints, February 2017. · Zbl 1450.62067
[12] M. Ghil and P. Malanotte-Rizzoli. Data assimilation in meteorology and oceanography. In Advances in Geophysics, volume 33, pages 141-266. Elsevier, 1991.
[13] K. Greenewald and A. O. Hero. Robust kronecker product PCA for spatio-temporal covariance estimation.IEEE Transactions on Signal Processing, 63(23):6368-6378, 2015. · Zbl 1395.94113
[14] A. K. Gupta and D. K. Nagar.Matrix variate distributions. CRC Press, 1999. · Zbl 0935.62064
[15] P. Han, W. Liang, L. C. Baxter, J. Yin, Z. Tang, T. G. Beach, R. J. Caselli, E. M. Reiman, and J. Shi. Pituitary adenylate cyclase-activating polypeptide is reduced in Alzheimer disease.Neurology, 82(19):1724-1728, 2014.
[16] T. Hastie, R. Mazumder, J. D. Lee, and R. Zadeh. Matrix completion and low-rank SVD via fast alternating least squares.The Journal of Machine Learning Research, 16(1): 3367-3402, 2015. · Zbl 1352.65117
[17] R. A. Horn and C. R. Johnson.Matrix analysis. Cambridge university press, 2012.
[18] C. J. Hsieh, I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik. Sparse inverse covariance matrix estimation using quadratic approximation. InAdvances in Neural Information Processing Systems, pages 2330-2338, 2011.
[19] Sijia Huang, Kumardeep Chaudhary, and Lana X Garmire. More is better: recent progress in multi-omics data integration methods.Frontiers in genetics, 8:84, 2017.
[20] Haoming Jiang, Xinyu Fei, Han Liu, Kathryn Roeder, John Lafferty, Larry Wasserman, Xingguo Li, and Tuo Zhao.huge: High-Dimensional Undirected Graph Estimation, 2019. URLhttps://CRAN.R-project.org/package=huge. R package version 1.3.2.
[21] W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray expression data using empirical Bayes methods.Biostatistics, 8(1):118-127, 2007. · Zbl 1170.62389
[22] Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices.Journal of multivariate analysis, 88(2):365-411, 2004. · Zbl 1032.62050
[23] Y. Li, Z. Chen, Y. Gao, G. Pan, H. Zheng, Y. Zhang, H. Xu, G. Bu, and H. Zheng. Synaptic adhesion molecule Pcdh-γC5 mediates synaptic dysfunction in Alzheimer’s disease. Journal of Neuroscience, pages 1051-17, 2017.
[24] E. F. Lock, K. A. Hoadley, J. S. Marron, and A. B. Nobel. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types.The Annals of Applied Statistics, 7(1):523, 2013. · Zbl 1454.62355
[25] X. L. Meng and D. B. Rubin. Maximum likelihood estimation via the ECM algorithm: A general framework.Biometrika, 80(2):267-278, 1993. ISSN 00063444. URLhttp: //www.jstor.org/stable/2337198. · Zbl 0778.62022
[26] S. Mostafavi, C. Gaiteri, S. E. Sullivan, C. C. White, S. Tasaki, J. Xu, M. Taga, H. U. Klein, E. Patrick, V. Komashko, et al. A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer’s disease.Nature Neuroscience, 21(6):811-819, 2018. ISSN 1546-1726. URLhttps://doi.org/10.1038/ s41593-018-0154-9.
[27] S. P. Ponnapalli, M. A. Saunders, C. F. Van Loan, and O. Alter. A higher-order generalized singular value decomposition for comparison of global mRNA expression from multiple organisms.PloS one, 6(12):e28072, 2011.
[28] T. Rapcs´ak. Geodesic convexity in nonlinear optimization.Journal of Optimization Theory and Applications, 69(1):169-183, Apr 1991. ISSN 1573-2878. doi: 10.1007/BF00940467. URLhttps://doi.org/10.1007/BF00940467. · Zbl 0702.90066
[29] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation.Electronic Journal of Statistics, 2:494-515, 2008. doi: 10.1214/08-EJS176. · Zbl 1320.62135
[30] S. T. Shivappa, M. M. Trivedi, and B. D. Rao. Audiovisual information fusion in humancomputer interfaces and intelligent environments: A survey.Proceedings of the IEEE, 98 (10):1692-1715, 2010.
[31] A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 650-658. ACM, 2008.
[32] The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma.Nature, 474(7353):609, 2011.
[33] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization.Journal of Optimization Theory and Applications, 109(3):475-494, 2001. · Zbl 1006.65062
[34] T. Tsiligkaridis, A. O. Hero, and S. Zhou. On convergence of kronecker graphical lasso algorithms.IEEE Transactions on Signal Processing, 61:1743-1755, 2013. · Zbl 1394.62095
[35] N. K. Vishnoi. Geodesic Convex Optimization: Differentiation on Manifolds, Geodesics, and Convexity.ArXiv e-prints, June 2018.
[36] J. A. Westerhuis, T. Kourti, and J. F. MacGregor. Analysis of multiblock and hierarchical PCA and PLS models.Journal of Chemometrics, 12(5):301-321, 1998.
[37] A. Wiesel. Geodesic convexity and covariance estimation.IEEE Transactions on Signal Processing, 60(12):6182, 2012. · Zbl 1393.94489
[38] J. Yin and H. Li. Model selection and estimation in the matrix normal graphical model. Journal of Multivariate Analysis, 107:119-140, 2012. · Zbl 1236.62058
[39] Y. Yu, T. Wang, and R. J. Samworth.A useful variant of the Davis-Kahan theorem for statisticians.Biometrika, 102(2):315-323, 2015. doi: 10.1093/biomet/asv008. URL http://dx.doi.org/10.1093/biomet/asv008. · Zbl 1452.15010
[40] H. Zhang and S. Sra. First-order methods for geodesically convex optimization. InConference on Learning Theory, pages 1617-1638, 2016.
[41] S. Zhou. Gemini: Graph estimation with matrix variate normal instances.The Annals of Statistics, 42(2):532-562, 2014a. · Zbl 1301.62054
[42] S. Zhou. Supplement to “gemini: Graph estimation with matrix variate normal instances”. 2014b. · Zbl 1301.62054
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.