×

Direct covariance matrix estimation with compositional data. (English) Zbl 07855841

Summary: Compositional data arise in many areas of research in the natural and biomedical sciences. One prominent example is in the study of the human gut microbiome, where one can measure the relative abundance of many distinct microorganisms in a subject’s gut. Often, practitioners are interested in learning how the dependencies between microbes vary across distinct populations or experimental conditions. In statistical terms, the goal is to estimate a covariance matrix for the (latent) log-abundances of the microbes in each of the populations. However, the compositional nature of the data prevents the use of standard estimators for these covariance matrices. In this article, we propose an estimator of multiple covariance matrices which allows for information sharing across distinct populations of samples. Compared to some existing estimators, which estimate the covariance matrices of interest indirectly, our estimator is direct, ensures positive definiteness, and is the solution to a convex optimization problem. We compute our estimator using a proximal-proximal gradient descent algorithm. Asymptotic properties of our estimator reveal that it can perform well in high-dimensional settings. We show that our method provides more reliable estimates than competitors in an analysis of microbiome data from subjects with myalgic encephalomyelitis/chronic fatigue syndrome and through simulation studies.

MSC:

62-XX Statistics

Software:

CCLasso; glasso; spcov; MEGAN

References:

[1] AITCHISON, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological) 44 139-160. MathSciNet: MR0676206 · Zbl 0491.62017
[2] AITCHISON, J. (2003). The Statistical Analysis of Compositional Data. Blackburn Press. MathSciNet: MR0865647
[3] BAN, Y., AN, L. and JIANG, H. (2015). Investigating microbial co-occurrence patterns based on metagenomic compositional data. Bioinformatics 31 3322-3329.
[4] BIEN, J. and TIBSHIRANI, R. J. (2011). Sparse estimation of a covariance matrix. Biometrika 98 807-820. Digital Object Identifier: 10.1093/biomet/asr054 Google Scholar: Lookup Link MathSciNet: MR2860325 · Zbl 1228.62063 · doi:10.1093/biomet/asr054
[5] BIGOT, J., BISCAY, R. J., LOUBES, J.-M. and MUÑIZ-ALVAREZ, L. (2011). Group lasso estimation of high-dimensional covariance matrices. The Journal of Machine Learning Research 12 3187-3225. MathSciNet: MR2877598 · Zbl 1280.68156
[6] CAI, T. T., LI, H., LIU, W. and XIE, J. (2016). Joint estimation of multiple high-dimensional precision matrices. Statistica Sinica 26 445-464. MathSciNet: MR3497754 · Zbl 1356.62066
[7] CAO, Y., LIN, W. and LI, H. (2019). Large Covariance Estimation for Compositional Data Via Composition-Adjusted Thresholding. Journal of the American Statistical Association 114 759-772. Digital Object Identifier: 10.1080/01621459.2018.1442340 Google Scholar: Lookup Link MathSciNet: MR3963178 · Zbl 1420.62240 · doi:10.1080/01621459.2018.1442340
[8] DANAHER, P., WANG, P. and WITTEN, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 373-397. MathSciNet: MR3164871 · Zbl 07555455
[9] DAVIS, D. and YIN, W. (2017). A Three-Operator Splitting Scheme and its Optimization Applications. Set-Valued and Variational Analysis 25 829-858. Digital Object Identifier: 10.1007/s11228-017-0421-z Google Scholar: Lookup Link MathSciNet: MR3740519 · Zbl 1464.47041 · doi:10.1007/s11228-017-0421-z
[10] FANG, H., HUANG, C., ZHAO, H. and DENG, M. (2015). CCLasso: Correlation inference for compositional data through Lasso. Bioinformatics 31 3172-3180.
[11] FAUST, K., SATHIRAPONGSASUTI, J. F., IZARD, J., SEGATA, N., GEVERS, D., RAES, J. and HUTTENHOWER, C. (2012). Microbial co-occurrence relationships in the human microbiome. PLoS computational biology 8 e1002606. MathSciNet: MR3251360
[12] FRIEDMAN, J. and ALM, E. J. (2012). Inferring Correlation Networks from Genomic Survey Data. PLOS Computational Biology 8 1-11. Digital Object Identifier: 10.1371/journal.pcbi.1002687 Google Scholar: Lookup Link · doi:10.1371/journal.pcbi.1002687
[13] GILOTEAUX, L., GOODRICH, J. K., WALTERS, W. A., LEVINE, S. M., LEY, R. E. and HANSON, M. R. (2016). Reduced diversity and altered composition of the gut microbiome in individuals with myalgic encephalomyelitis/chronic fatigue syndrome. Microbiome 4 30. Digital Object Identifier: 10.1186/s40168-016-0171-4 Google Scholar: Lookup Link · doi:10.1186/s40168-016-0171-4
[14] GLOOR, G. B., MACKLAIM, J. M., PAWLOWSKY-GLAHN, V. and EGOZCUE, J. J. (2017). Microbiome datasets are compositional: and this is not optional. Frontiers in microbiology 8 2224. MathSciNet: MR3992128
[15] GUO, J., LEVINA, E., MICHAILIDIS, G. and ZHU, J. (2011). Joint estimation of multiple graphical models. Biometrika 98 1-15. Digital Object Identifier: 10.1093/biomet/asq060 Google Scholar: Lookup Link MathSciNet: MR2804206 · Zbl 1214.62058 · doi:10.1093/biomet/asq060
[16] HE, Y., LIU, P., ZHANG, X. and ZHOU, W. (2021). Robust covariance estimation for high-dimensional compositional data with application to microbial communities analysis. Statistics in Medicine 40 3499-3515. MathSciNet: MR4269066
[17] HENRION, D. and MALICK, J. (2012). Projection Methods in Conic Optimization. Handbook on Semidefinite, Conic and Polynomial Optimization 565-600. Digital Object Identifier: 10.1007/978-1-4614-0769-0_20 Google Scholar: Lookup Link MathSciNet: MR2894664 · Zbl 1334.90105 · doi:10.1007/978-1-4614-0769-0_20
[18] HUSON, D. H., AUCH, A. F., QI, J. and SCHUSTER, S. C. (2007). MEGAN analysis of metagenomic data. Genome Research 17 377-386.
[19] JIANG, D., ARMOUR, C. R., HU, C., MEI, M., TIAN, C., SHARPTON, T. J. and JIANG, Y. (2019). Microbiome multi-omics network analysis: statistical considerations, limitations, and opportunities. Frontiers in genetics 10 995.
[20] LI, D., SRINIVASAN, A., CHEN, Q. and XUE, L. (2022). Robust Covariance Matrix Estimation for High-Dimensional Compositional Data with Application to Sales Data Analysis. Journal of Business and Economic Statistics 1-11. MathSciNet: MR4650447
[21] MA, J. and MICHAILIDIS, G. (2016). Joint structural estimation of multiple graphical models. The Journal of Machine Learning Research 17 5777-5824. MathSciNet: MR3555057 · Zbl 1392.62198
[22] MA, J., YUE, K. and SHOJAIE, A. (2021). Networks for Compositional Data. Statistical Analysis of Microbiome Data 311-336.
[23] MATCHADO, M. S., LAUBER, M., REITMEIER, S., KACPROWSKI, T., BAUMBACH, J., HALLER, D. and LIST, M. (2021). Network analysis methods for studying microbial communities: A mini review. Computational and structural biotechnology journal 19 2687-2698.
[24] NEGAHBAN, S. N., RAVIKUMAR, P., WAINWRIGHT, M. J. and YU, B. (2012). A Unified Framework for High-Dimensional Analysis of M-estimators with Decomposable Regularizers. Statistical Science 27. Digital Object Identifier: 10.1214/12-STS400 Google Scholar: Lookup Link MathSciNet: MR3025133 · Zbl 1331.62350 · doi:10.1214/12-STS400
[25] PARIKH, N. and BOYD, S. (2014). Proximal algorithms. Foundations and Trends in Optimization 1 127-239.
[26] PEDREGOSA, F. and GIDEL, G. (2018). Adaptive Three Operator Splitting. In Proceedings of the 35th International Conference on Machine Learning (J. DY and A. KRAUSE, eds.). Proceedings of Machine Learning Research 80 4085-4094. PMLR.
[27] PORTER, N. T. and MARTENS, E. C. (2016). Love thy neighbor: Sharing and cooperativity in the gut microbiota. Cell Host and Microbe 19 745-746.
[28] PRICE, B. S., GEYER, C. J. and ROTHMAN, A. J. (2015). Ridge fusion in statistical learning. Journal of Computational and Graphical Statistics 24 439-454. MathSciNet: MR3357389
[29] PRICE, B. S., MOLSTAD, A. J. and SHERWOOD, B. (2021). Estimating multiple precision matrices with cluster fusion regularization. Journal of Computational and Graphical Statistics 30 823-834. MathSciNet: MR4356588 · Zbl 07499920
[30] RASKUTTI, G., WAINWRIGHT, M. J. and YU, B. (2010). Restricted eigenvalue properties for correlated Gaussian designs. The Journal of Machine Learning Research 11 2241-2259. MathSciNet: MR2719855 · Zbl 1242.62071
[31] ROTHMAN, A. J. (2012). Positive definite estimators of large covariance matrices. Biometrika 99 733-740. MathSciNet: MR2966781 · Zbl 1437.62595
[32] SAEGUSA, T. and SHOJAIE, A. (2016). Joint estimation of precision matrices in heterogeneous populations. Electronic Journal of Statistics 10 1341. MathSciNet: MR3507368 · Zbl 1341.62130
[33] SEGATA, N., WALDRON, L., BALLARINI, A., NARASIMHAN, V., JOUSSON, O. and HUTTENHOWER, C. (2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nature methods 9 811-814.
[34] SIMON, N., FRIEDMAN, J., HASTIE, T. and TIBSHIRANI, R. (2013). A Sparse-Group Lasso. Journal of Computational and Graphical Statistics 22 231-245. Digital Object Identifier: 10.1080/10618600.2012.681250 Google Scholar: Lookup Link MathSciNet: MR3173712 · doi:10.1080/10618600.2012.681250
[35] SUN, Y. and VANDENBERGHE, L. (2015). Decomposition methods for sparse matrix nearness problems. SIAM Journal on Matrix Analysis and Applications 36 1691-1717. MathSciNet: MR3432149 · Zbl 1342.90128
[36] Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science 47. Cambridge University Press. MathSciNet: MR3837109 · Zbl 1430.60005
[37] XU, J. and LANGE, K. (2022). A proximal distance algorithm for likelihood-based sparse covariance estimation. Biometrika 109 1047-1066. Digital Object Identifier: 10.1093/biomet/asac011 Google Scholar: Lookup Link MathSciNet: MR4519115 · Zbl 07638100 · doi:10.1093/biomet/asac011
[38] XUE, L., MA, S. and ZOU, H. (2012). Positive-definite L1-penalized estimation of large covariance matrices. Journal of the American Statistical Association 107 1480-1491. MathSciNet: MR3036409 · Zbl 1258.62063
[39] YOUNES, H., COUDRAY, C., BELLANGER, J., DEMIGNÉ, C., RAYSSIGUIER, Y. and RÉMÉSY, C. (2001). Effects of two fermentable carbohydrates (inulin and resistant starch) and their combination on calcium and magnesium balance in rats. British Journal of Nutrition 86 479-485.
[40] ZHANG, S., WANG, H. and LIN, W. (2023). CARE: Large Precision Matrix Estimation for Compositional Data. arXiv preprint arXiv:2309.06985.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.