×

Information-based optimal subdata selection for big data linear regression. (English) Zbl 1478.62196

Summary: Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large datasets due to computational limitations. A critical step in big data analysis is data reduction. Existing investigations in the context of linear regression focus on subsampling-based methods. However, not only is this approach prone to sampling errors, it also leads to a covariance matrix of the estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size. We propose a novel approach, termed information-based optimal subdata selection (IBOSS). Compared to leading existing subdata methods, the IBOSS approach has the following advantages: (i) it is significantly faster; (ii) it is suitable for distributed parallel computing; (iii) the variances of the slope parameter estimators converge to 0 as the full data size increases even if the subdata size is fixed, that is, the convergence rate depends on the full data size; (iv) data analysis for IBOSS subdata is straightforward and the sampling distribution of an IBOSS estimator is easy to assess. Theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to subsampling-based methods, sometimes by orders of magnitude. The advantages of the new approach are also illustrated through analysis of real data.

MSC:

62J05 Linear regression; mixed models
62F10 Point estimation
62J07 Ridge regression; shrinkage estimators (Lasso)
62K05 Optimal statistical designs

Software:

R

References:

[1] Candes, E.; Tao, T., The Dantzig Selector: Statistical Estimation When p is Much Larger Than n, The Annals of Statistics, 35, 2313-2351 (2007) · Zbl 1139.62019
[2] Drineas, P.; Magdon-Ismail, M.; Mahoney, M.; Woodruff, D., Faster Approximation of Matrix Coherence and Statistical Leverage, Journal of Machine Learning Research, 13, 3475-3506 (2012) · Zbl 1437.65030
[3] Drineas, P.; Mahoney, M.; Muthukrishnan, S.; Sarlos, T., Faster Least Squares Approximation, Numerische Mathematik, 117, 219-249 (2011) · Zbl 1218.65037
[4] Drineas, P.; Mahoney, M. W.; Muthukrishnan, S., Sampling Algorithms for \(\textit{l_2\) · Zbl 1194.62010
[5] Fan, J.; Lv, J., Sure Independence Screening for Ultrahigh Dimensional Feature Space, Journal of the Royal Statistical Society, 70, 849-911 (2008) · Zbl 1411.62187
[6] Fonollosa, J.; Sheik, S.; Huerta, R.; Marco, S., Reservoir Computing Compensates Slow Response of Chemosensor Arrays Exposed to Fast Varying Gas Concentrations in Continuous Monitoring, Sensors and Actuators B: Chemical, 215, 618-629 (2015)
[7] Galambos, J., The Asymptotic Theory of Extreme Order Statistics (1987), Florida: Robert E. Krieger, Florida · Zbl 0634.62044
[8] Goodson, D. Z., Mathematical Methods for Physical and Analytical Chemistry (2011), New York: Wiley, New York · Zbl 1225.00005
[9] Hadamard, J., Résolution d’une Question Relative aux Déterminants, Bulletin des Sciences Mathmatiques, 17, 240-246 (1893) · JFM 25.0221.02
[10] Hall, P., On the Relative Stability of Large Order Statistics, Mathematical Proceedings of the Cambridge Philosophical Society, 86, 467-475 (1979), Cambridge: Cambridge University Press, Cambridge · Zbl 0421.62031
[11] Kiefer, J., Optimum Experimental Designs, Journal of the Royal Statistical Society, 21, 272-319 (1959) · Zbl 0108.15303
[12] Lin, N.; Xie, R., Aggregated Estimating Equation Estimation, Statistics and Its Interface, 4, 73-83 (2011) · Zbl 1245.62026
[13] Ma, P.; Mahoney, M.; Yu, B., A Statistical Perspective on Algorithmic Leveraging, Proceedings of the 31st International Conference on Machine Learning (ICML-14), 91-99 (2014)
[14] ———, A Statistical Perspective on Algorithmic Leveraging, Journal of Machine Learning Research, 16, 861-911 (2015) · Zbl 1337.62164
[15] Ma, P.; Sun, X., Leveraging for Big Data Regression, Wiley Interdisciplinary Reviews: Computational Statistics, 7, 70-76 (2015)
[16] Martínez, C., Partial Quicksort, Proc. 6th ACMSIAM Workshop on Algorithm Engineering and Experiments and 1st ACM-SIAM Workshop on Analytic Algorithmics and Combinatorics, 224-228 (2004) · Zbl 1076.68607
[17] Meinshausen, N.; Yu, B., Lasso-Type Recovery of Sparse Representations for High-Dimensional Data, The Annals of Statistics, 37, 246-270 (2009) · Zbl 1155.62050
[18] Musser, D. R., Introspective Sorting and Selection Algorithms, Software: Practice and Experience, 27, 983-993 (1997)
[19] R: A Language and Environment for Statistical Computing (2015), Vienna, Austria: R Foundation for Statistical Computing, Vienna, Austria
[20] Schifano, E. D.; Wu, J.; Wang, C.; Yan, J.; Chen, M.-H., Online Updating of Statistical Inference in the Big Data Setting, Technometrics, 58, 393-403 (2016)
[21] Stroustrup, B., The C++ Programming Language (1986), India: Pearson Education, India · Zbl 0609.68011
[22] Thompson, E. E.; Sowers, M.; Frongillo, E.; Parpia, B., Sources of Fiber and Fat in Diets of U.S. Women Aged 19 to 50: Implications for Nutrition Education and Policy, American Journal of Public Health, 82, 695-702 (1992)
[23] Tibshirani, R., Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society, 58, 267-288 (1996) · Zbl 0850.62538
[24] Wang, H.; Zhu, R.; Ma, P., Optimal Subsampling for Large Sample Logistic Regression, Journal of the American Statistical Association (2017)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.