×

Homogeneity structure learning in large-scale panel data with heavy-tailed errors. (English) Zbl 1540.62110

Summary: Large-scale panel data is ubiquitous in many modern data science applications. Conventional panel data analysis methods fail to address the new challenges, like individual impacts of covariates, endogeneity, embedded low-dimensional structure, and heavy-tailed errors, arising from the innovation of data collection platforms on which applications operate. In response to these challenges, this paper studies large-scale panel data with an interactive effects model. This model takes into account the individual impacts of covariates on each spatial node and removes the exogenous condition by allowing latent factors to affect both covariates and errors. Besides, we waive the sub-Gaussian assumption and allow the errors to be heavy-tailed. Further, we propose a data-driven procedure to learn a parsimonious yet flexible homogeneity structure embedded in high-dimensional individual impacts of covariates. The homogeneity structure assumes that there exists a partition of regression coefficients where the coefficients are the same within each group but different between the groups. The homogeneity structure is flexible as it contains many widely assumed low-dimensional structures (sparsity, global impact, etc.) as its special cases. Non-asymptotic properties are established to justify the proposed learning procedure. Extensive numerical experiments demonstrate the advantage of the proposed learning procedure over conventional methods especially when the data are generated from heavy-tailed distributions.

MSC:

62M10 Time series, auto-correlation, regression, etc. in statistics (GARCH)
62H12 Estimation in multivariate analysis
62H25 Factor analysis and principal components; correspondence analysis
68T05 Learning and adaptive systems in artificial intelligence

References:

[1] Emmanuel Abbe, Jianqing Fan, Kaizheng Wang, and Yiqiao Zhong. Entrywise eigenvector analysis of random matrices with low expected rank.Annals of Statistics, 48(3):1452- 1474, 2020. · Zbl 1450.62066
[2] Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Analyzing user modeling on twitter for personalized news recommendations. InInternational Conference on User Modeling, Adaptation, and Personalization, pages 1-12. Springer, 2011.
[3] Theodore Wilbur Anderson and Cheng Hsiao. Formulation and estimation of dynamic models using panel data.Journal of Econometrics, 18(1):47-82, 1982. · Zbl 0487.62099
[4] Manuel Arellano.Panel Data Econometrics. Oxford university press, 2003. · Zbl 1057.62112
[5] Marco Avella-Medina, Heather S Battey, Jianqing Fan, and Quefeng Li. Robust estimation of high-dimensional covariance and precision matrices.Biometrika, 105(2):271-284, 2018. · Zbl 07072412
[6] Jushan Bai. Estimating multiple breaks one at a time.Econometric Theory, 13(3):315-352, 1997.
[7] Jushan Bai and Serena Ng. Determining the number of factors in approximate factor models. Econometrica, 70(1):191-221, 2002. · Zbl 1103.91399
[8] Andrew Bell and Kelvyn Jones. Explaining fixed effects: Random effects modeling of time-series cross-sectional and panel data.Political Science Research and Methods, 3(1): 133-153, 2015.
[9] Alok Bhargava, Luisa Franzini, and Wiji Narendranathan. Serial correlation and the fixed effects model.The Review of Economic Studies, 49(4):533-549, 1982. · Zbl 0497.62097
[10] Koushiki Bose, Jianqing Fan, Yuan Ke, Xiaoou Pan, and Wen-xin Zhou. Farmtest: An r package for factor-adjusted robust multiple testing.The R Journal, to appear, 2021.
[11] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. Annales de l’IHP Probabilit´es et Statistiques, 48:1148-1185, 2012. · Zbl 1282.62070
[12] Olivier Catoni. Pac-bayesian bounds for the gram matrix and least squares regression with a random design.arXiv preprint arXiv:1603.05229, 2016.
[13] Gary Chamberlain and Michael Rothschild. Arbitrage, factor structure, and mean-variance analysis on large asset markets.Econometrica, 51(5):1305-1324, 1983. · Zbl 0523.90017
[14] Jinyuan Chang, Bin Guo, and Qiwei Yao. High dimensional stochastic regression with latent factors, endogeneity and nonlinearity.Journal of Econometrics, 189(2):297-312, 2015. · Zbl 1337.62247
[15] Kuo-mei Chen, Arthur Cohen, and Harold Sackrowitz. Consistent multiple testing for change points.Journal of Multivariate Analysis, 102(10):1339-1343, 2011. · Zbl 1221.62109
[16] Haeran Cho and Piotr Fryzlewicz. Multiple-change-point detection for high dimensional time series via sparsified binary segmentation.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(2):475-507, 2015. · Zbl 1414.62356
[17] Jianqing Fan, Yuan Liao, and Martina Mincheva. Large covariance estimation by thresholding principal orthogonal complements.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4):603-680, 2013. · Zbl 1411.62138
[18] Jianqing Fan, Fang Han, Han Liu, and Byron Vickers. Robust inference of risks of large portfolios.Journal of Econometrics, 194(2):298-308, 2016. · Zbl 1443.62149
[19] Jianqing Fan, Quefeng Li, and Yuyan Wang. Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):247-265, 2017. · Zbl 1414.62178
[20] Jianqing Fan, Han Liu, and Weichen Wang. Large covariance estimation through elliptical factor models.Annals of Statistics, 46(4):1383, 2018a. · Zbl 1402.62124
[21] Jianqing Fan, Weichen Wang, and Yiqiao Zhong. Anl∞eigenvector perturbation bound and its application to robust covariance estimation.Journal of Machine Learning Research, 18(207):1-42, 2018b. · Zbl 1473.15015
[22] Jianqing Fan, Yuan Ke, Qiang Sun, and Wen-Xin Zhou. FarmTest: Factor-adjusted robust multiple testing with approximate false discovery control.Journal of the American Statistical Association, 114(528):1880-1893, 2019a. · Zbl 1428.62345
[23] Jianqing Fan, Weichen Wang, and Yiqiao Zhong. Robust covariance estimation for approximate factor models.Journal of Econometrics, 208(1):5-22, 2019b. · Zbl 1452.62410
[24] Jianqing Fan, Yuan Ke, and Yuan Liao. Augmented factor models with applications to validating market risk factors and forecasting bond risk premia.Journal of Econometrics, to appear, 2020a. · Zbl 1471.62393
[25] Jianqing Fan, Kaizheng Wang, Yiqiao Zhong, and Ziwei Zhu. Robust high dimensional factor models with applications to statistical machine learning.Statistical Science, to appear, 2020b.
[26] Ethan X Fang, Yang Ning, and Runze Li. Test of significance for high-dimensional longitudinal data.The Annals of Statistics, 48(5):2622-2645, 2020. · Zbl 1455.62051
[27] Ulrich Franck, Siad Odeh, Alfred Wiedensohler, Birgit Wehner, and Olf Herbarth. The effect of particle size on cardiovascular disorders—the smaller the worse.Science of the Total Environment, 409(20):4217-4221, 2011.
[28] Piotr Fryzlewicz. Wild binary segmentation for multiple change-point detection.The Annals of Statistics, 42(6):2243-2281, 2014. · Zbl 1302.62075
[29] Piotr Fryzlewicz and Suhasini Subba Rao.Multiple-change-point detection for autoregressive conditional heteroscedastic processes.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(5):903-924, 2014. · Zbl 1411.62248
[30] Xuming He, Hengjian Cui, and Douglas G Simpson. Longitudinal data analysis using t-type regression.Journal of Statistical Planning and Inference, 122(1-2):253-269, 2004. · Zbl 1040.62056
[31] Cheng Hsiao.Analysis of Panel Data. Cambridge university press, 1986. · Zbl 0608.62145
[32] Cheng Hsiao. Panel data analysis—advantages and challenges.Test, 16(1):1-22, 2007. · Zbl 1121.62110
[33] Peter J Huber. Finite sample breakdown of m-and p-estimators.The Annals of Statistics, 12(1):119-126, 1984. · Zbl 0557.62034
[34] Lawrence Hubert and Phipps Arabie. Comparing partitions.Journal of Classification, 2 (1):193-218, 1985. · Zbl 0587.62128
[35] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions.Journal of the American Statistical Association, 104(486): 682-693, 2009. · Zbl 1388.62174
[36] Ruth A Judson and Ann L Owen. Estimating dynamic panel data models: a guide for macroeconomists.Economics Letters, 65(1):9-15, 1999. · Zbl 1007.91524
[37] Yuan Ke, Jialiang Li, and Wenyang Zhang. Structure identification in panel data analysis. The Annals of Statistics, 44(3):1193-1233, 2016. · Zbl 1341.62214
[38] Yuan Ke, Stanislav Minsker, Zhao Ren, Qiang Sun, and Wen-Xin Zhou. User-friendly covariance estimation for heavy-tailed distributions.Statistical Science, 34(3):454-471, 2019. · Zbl 1429.62312
[39] Yuan Ke, Heng Lian, and Wenyang Zhang. High-dimensional dynamic covariance matrices with homogeneous structure.Journal of Business&Economic Statistics, to appear, 2020.
[40] Zheng Tracy Ke, Jianqing Fan, and Yichao Wu. Homogeneity pursuit.Journal of the American Statistical Association, 110(509):175-194, 2015. · Zbl 1373.62345
[41] Rebecca Killick, Paul Fearnhead, and Idris A Eckley. Optimal detection of changepoints with a linear computational cost.Journal of the American Statistical Association, 107 (500):1590-1598, 2012. · Zbl 1258.62091
[42] Alois Kneip, Robin C Sickles, and Wonho Song. A new panel data treatment for heterogeneity in time trends.Econometric Theory, 28(3):590-628, 2012. · Zbl 1239.91185
[43] Lung-fei Lee and Jihai Yu. Estimation of spatial autoregressive panel data models with fixed effects.Journal of Econometrics, 154(2):165-185, 2010. · Zbl 1431.62643
[44] Kristina Lerman and Rumi Ghosh. Information contagion: An empirical study of the spread of news on digg and twitter social networks. InFourth International AAAI Conference on Weblogs and Social Media, 2010.
[45] Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe Zhang, Shuyi Zhang, Hui Huang, and Song Xi Chen. Assessing Beijing’s PM2.5 pollution: severity, weather impact, APEC and winter heating.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2182):20150257, 2015.
[46] Sydney C Ludvigson and Serena Ng. A factor analysis of bond risk premia. Technical report, National Bureau of Economic Research, 2009.
[47] Shahar Mendelson and Nikita Zhivotovskiy. Robust covariance estimation under‘4—‘2 norm equivalence.Annals of Statistics, 48(3):1648-1664, 2020. · Zbl 1451.62084
[48] Stanislav Minsker. Sub-gaussian estimators of the mean of a random matrix with heavytailed entries.The Annals of Statistics, 46(6A):2871-2903, 2018. · Zbl 1418.62235
[49] Stanislav Minsker and Xiaohan Wei. Robust modifications of u-statistics and applications to covariance estimation problems.Bernoulli, 26(1):694-727, 2020. · Zbl 1457.62147
[50] Stephen Nickell. Biases in dynamic models with fixed effects.Econometrica, 6(1981):1417- 1426, 1981. · Zbl 0464.90012
[51] Yang Ning and Han Liu. A general theory of hypothesis tests and confidence regions for sparse high dimensional models.The Annals of Statistics, 45(1):158-195, 2017. · Zbl 1364.62128
[52] M Hashem Pesaran. Estimation and inference in large heterogeneous panels with a multifactor error structure.Econometrica, 74(4):967-1012, 2006. · Zbl 1152.91718
[53] Greet Pison, Peter J Rousseeuw, Peter Filzmoser, and Christophe Croux. Robust factor analysis.Journal of Multivariate Analysis, 84(1):145-172, 2003. · Zbl 1038.62055
[54] Giuliano Polichetti, Stefania Cocco, Alessandra Spinali, Valentina Trimarco, and Alfredo Nunziata. Effects of particulate matter (PM10, PM2.5 and PM1) on the cardiovascular system.Toxicology, 261(1-2):1-8, 2009.
[55] Vivian C Pun, Fatemeh Kazemiparkouhi, Justin Manjourides, and Helen H Suh. Long-term PM2.5 exposure and respiratory, cancer, and cardiovascular mortality in older US adults. American Journal of Epidemiology, 186(8):961-969, 2017.
[56] Yiyuan She and Art B Owen. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494):626-639, 2011. · Zbl 1232.62068
[57] Dan Shen, Haipeng Shen, Hongtu Zhu, and JS Marron. The statistics and mathematics of high dimension low sample size asymptotics.Statistica Sinica, 26(4):1747, 2016. · Zbl 1356.62077
[58] James H Stock and Mark W Watson. Forecasting using principal components from a large number of predictors.Journal of the American Statistical Association, 97(460):1167-1179, 2002. · Zbl 1041.62081
[59] Liangjun Su and Gaosheng Ju. Identifying latent grouped patterns in panel data models with interactive fixed effects.Journal of Econometrics, 206(2):554-573, 2018. · Zbl 1452.62960
[60] Liangjun Su, Zhentao Shi, and Peter CB Phillips. Identifying latent structures in panel data.Econometrica, 84(6):2215-2264, 2016. · Zbl 1410.62110
[61] Qiang Sun, Wen-Xin Zhou, and Jianqing Fan. Adaptive Huber regression.Journal of the American Statistical Association, pages 1-24, 2019. · Zbl 1437.62250
[62] Sara Van de Geer, Peter B¨uhlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models.The Annals of Statistics, 42(3):1166-1202, 2014. · Zbl 1305.62259
[63] Lyudmila Yur’evna Vostrikova. Detecting “disorder” in multidimensional random processes. InDoklady Akademii Nauk, volume 259, pages 270-274. Russian Academy of Sciences, 1981.
[64] Lili Wang, Chao Zheng, and Wen-Xin Zhou. A new principle for tuning-free huber regression.Statistica Sinica, to appear, 2020.
[65] W. Wang and J. Fan. Asymptotics of empirical eigen-structure for high dimensional spiked covariance.The Annals of Statistics, 45:1342-1374, 2017. · Zbl 1373.62299
[66] Qiang Xia, Wangli Xu, and Lixing Zhu. Consistently determining the number of factors in multivariate volatility modelling.Statistica Sinica, pages 1025-1044, 2015. · Zbl 1415.62067
[67] Yu-Fei Xing, Yue-Hua Xu, Min-Hua Shi, and Yi-Xin Lian. The impact of PM2.5 on the human respiratory system.Journal of Thoracic Disease, 8(1):E69, 2016.
[68] Jinfeng Xu, Mu Yue, and Wenyang Zhang. A new multilevel modelling approach for clustered survival data.Econometric Theory, 36(4):707—-750, 2020. · Zbl 1447.62076
[69] Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217-242, 2014. · Zbl 1411.62196
[70] Mei Zheng, Lynn G Salmon, James J Schauer, Limin Zeng, CS Kiang, Yuanhang Zhang, and Glen R Cass. Seasonal trends in PM2.5 source contributions in Beijing, China. Atmospheric Environment, 39(22):3967-3976, 2005.
[71] Wen-Xin Zhou, Koushiki Bose, Jianqing Fan, and Han Liu. A new perspective on robust m-estimation: Finite sample theory and applications to dependence-adjusted multiple testing.Annals of Statistics, 46(5):1904, 2018. · Zbl 1409.62154
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.