×

On robustness of principal component regression. (English) Zbl 1506.68032

Summary: Principal component regression (PCR) is a simple, but powerful and ubiquitously utilized method. Its effectiveness is well established when the covariates exhibit low-rank structure. However, its ability to handle settings with noisy, missing, and mixed-valued, that is, discrete and continuous, covariates is not understood and remains an important open challenge. As the main contribution of this work, we establish the robustness of PCR, without any change, in this respect and provide meaningful finite-sample analysis. To do so, we establish that PCR is equivalent to performing linear regression after preprocessing the covariate matrix via hard singular value thresholding (HSVT). As a result, in the context of counterfactual analysis using observational data, we show PCR is equivalent to the recently proposed robust variant of the synthetic control method, known as robust synthetic control (RSC). As an immediate consequence, we obtain finite-sample analysis of the RSC estimator that was previously absent. As an important contribution to the synthetic controls literature, we establish that an (approximate) linear synthetic control exists in the setting of a generalized factor model, or latent variable model; traditionally in the literature, the existence of a synthetic control needs to be assumed to exist as an axiom. We further discuss a surprising implication of the robustness property of PCR with respect to noise, that is, PCR can learn a good predictive model even if the covariates are tactfully transformed to preserve differential privacy. Finally, this work advances the state-of-the-art analysis for HSVT by establishing stronger guarantees with respect to the \(l_{2,\infty}\)-norm rather than the Frobenius norm as is commonly done in the matrix estimation literature, which may be of interest in its own right.

MSC:

68Q25 Analysis of algorithms and problem complexity
62H25 Factor analysis and principal components; correspondence analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)

References:

[1] Abadie, A.; Diamond, A.; Hainmueller, J., “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of Californiaâs Tobacco Control Program,”, Journal of the American Statistical Association (2010)
[2] Abadie, A.; Gardeazabal, J., “The Economic Costs of Conflict: A Case Study of the Basque Country, American Economic Review (2003) · doi:10.1257/000282803321455188
[3] Agarwal, A.; Shah, D.; Shen, D., On Principal Component Regression in a High-Dimensional Error-in-Variables Setting, arXiv, 2010, 14449 (2020)
[4] Amjad, M. J.; Shah, D.; Shen, D., “Robust Synthetic Control,”, Journal of Machine Learning Research, 19, 1-51 (2018) · Zbl 1445.62113
[5] Arkhangelsky, D.; Athey, S.; Hirshberg, D. A.; Imbens, G. W.; Wager, S., Synthetic Difference in Differences, arXiv, 1812, 09970 (2018)
[6] Athey, S.; Bayati, M.; Doudchenko, N.; Imbens, G., Matrix Completion Methods for Causal Panel Data Models, arXiv, 1710, 10251 (2017)
[7] Athey, S.; Imbens, G., “The State of Applied Econometrics—Causality and Policy Evaluation, The Journal of Economic Perspectives, 31, 3-32 (2016) · doi:10.1257/jep.31.2.3
[8] Bartlett, P. L.; Mendelson, S., “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results,”, Journal of Machine Learning. Research, 3, 463-482 (2003) · Zbl 1084.68549
[9] Belloni, A.; Chernozhukov, V.; Kaul, A.; Rosenbaum, M.; Tsybakov, A. B., Pivotal Estimation via Self-Normalization for High-Dimensional Linear Models With Errors in Variables, arXiv, 1708, 08353 (2017)
[10] Belloni, A.; Rosenbaum, M.; Tsybakov, A. B., “Linear and Conic Programming Approaches to High-Dimensional Errors-in-Variables Models, Journal of the Royal Statistical Society, 79, 939-956 (2017) · Zbl 1411.62180 · doi:10.1111/rssb.12196
[11] Ben-Michael, E.; Feller, A.; Rothstein, J., The Augmented Synthetic Control Method, arXiv, 1811, 04170 (2018)
[12] Bishop, C. M., Bayesian PCA, Proceedings of Advances in Neural Information Processing Systems, 382-388 (1999)
[13] Candes, E.; Romberg, J., Sparsity and Incoherence in Compressive Sampling, Inverse Problems, 23, 969 (2007) · Zbl 1120.94005 · doi:10.1088/0266-5611/23/3/008
[14] Chao, G.; Luo, Y.; Ding, W., “Recent Advances in Supervised Dimension Reduction: A Survey, Machine Learning and Knowledge Extraction, 1, 341-358 (2019) · doi:10.3390/make1010020
[15] Chatterjee, S., “Matrix Estimation by Universal Singular Value Thresholding, The Annals of Statistics, 43, 177-214 (2015) · Zbl 1308.62038 · doi:10.1214/14-AOS1272
[16] Chen, Y.; Caramanis, C., Orthogonal Matching Pursuit With Noisy and Missing Data: Low and High Dimensional Results, arXiv, 1206, 0823 (2012)
[17] Datta, A.; Zou, H., “Cocolasso for High-Dimensional Error-in-Variables Regression, The Annals of Statistics, 45, 2400-2426 (2017) · Zbl 1486.62210 · doi:10.1214/16-AOS1527
[18] Doudchenko, N.; Imbens, G., Balancing, Regression, Difference-in-Differences and Synthetic Control Methods: A Synthesis, NBER Working Paper, 22791 (2016)
[19] Hsiao, C.; Steve Ching, H.; Ki Wan, S., “A Panel Data Approach for Program Evaluation: Measuring the Benefits of Political and Economic Integration of Hong Kong With Mainland China, Journal of Applied Econometrics, 27, 705-740 (2012) · doi:10.1002/jae.1230
[20] Hsiao, C.; Wan, S.-K.; Xie, Y., “Panel Data Approach vs. Synthetic Control Method, Economics Letters, 164, 121-123 (2018) · Zbl 1401.62234 · doi:10.1016/j.econlet.2018.01.019
[21] Jolliffe, I. T., “A Note on the Use of Principal Components in Regression, Journal of the Royal Statistical Society, 31, 300-303 (1982)
[22] Li, K. T. (2018), “Inference for Factor Model Based Average Treatment Effects,” available at SSRN 3112775.
[23] Li, K. T.; Bell, D. R., “Estimation of Average Treatment Effects With Panel Data: Asymptotic Theory and Implementation, Journal of Econometrics, 197, 65-75 (2017) · Zbl 1443.62488 · doi:10.1016/j.jeconom.2016.01.011
[24] Loh, P.-l.; Wainwright, M. J., “High-Dimensional Regression With Noisy and Missing Data: Provable Guarantees With Nonconvexity, The Annals of Statistics, 40, 1637-1664 (2012) · Zbl 1257.62063 · doi:10.1214/12-AOS1018
[25] Moon, H. R.; Weidner, M., “Linear Regression for Panel With Unknown Number of Factors as Interactive Fixed Effects, Econometrica, 83, 1543-1579 (2015) · Zbl 1410.62126 · doi:10.3982/ECTA9382
[26] Raskutti, G.; Wainwright, M. J.; Yu, B., “Restricted Eigenvalue Properties for Correlated Gaussian Designs, Journal of Machine Learning Research, 11, 2241-2259 (2010) · Zbl 1242.62071
[27] Rigollet, P.; Tsybakov, A., “Exponential Screening and Optimal Rates of Sparse Estimation,”, Annals of Statistics, 39, 731-771 (2011) · Zbl 1215.62043
[28] Rosenbaum, M.; Tsybakov, A. B., “Sparse Recovery Under Matrix Estimation, The Annals of Statistics, 38, 2620-2651 (2010) · Zbl 1373.62357 · doi:10.1214/10-AOS793
[29] Rosenbaum, M.; Tsybakov, A. B., From Probability to Statistics and Back: High-Dimensional Models and Processes, 9, Improved Matrix Uncertainty Selector, 276-290 (2013), Institute of Mathematical Statistics Collection · Zbl 1327.62410
[30] Tipping, M. E.; Bishop, C. M., “Probabilistic Principal Component Analysis, Journal of the Royal Statistical Society, Series B, 61, 611-622 (1999) · Zbl 0924.62068 · doi:10.1111/1467-9868.00196
[31] Tsybakov, A. B., Introduction to Nonparametric Estimation (2008), New York: Springer Publishing Company, New York
[32] Udell, M., and Townsend, A. (2017), “Nice Latent Variable Models Have Log-Rank,” CoRR, abs/1705.07474.
[33] Udell, M.; Townsend, A., “Why Are Big Data Matrices Approximately Low Rank?, SIAM Journal on Mathematics of Data Science, 1, 144-160 (2019) · Zbl 1513.68057
[34] Wainwright, M. J., High-Dimensional Statistics: A Non-Asymptotic Viewpoint, 48 (2019), Cambridge: Cambridge University Press, Cambridge · Zbl 1457.62011
[35] Xu, J., Rates of Convergence of Spectral Methods for Graphon Estimation, arXiv, 1709, 03183 (2017)
[36] Xu, Y., “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models,”, Econometrics: Multiple Equation Models eJournal, 25, 57-76 (2016)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.