Document Zbl 07927934

Li, Xuetong; Gao, Yuan; Chang, Hong; Huang, Danyang; Ma, Yingying; Pan, Rui; Qi, Haobo; Wang, Feifei; Wu, Shuyuan; Xu, Ke; Zhou, Jing; Zhu, Xuening; Zhu, Yingqiu; Wang, Hansheng

A selective review on statistical methods for massive data computation: distributed computing, subsampling, and minibatch techniques. (English) Zbl 07927934

Stat. Theory Relat. Fields 8, No. 3, 163-185 (2024).

Summary: This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the blacksample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models.

MSC:

62-XX

Statistics

Keywords:

distributed computing; massive data analysis; minibatch techniques; stochastic optimization; subsampling methods

Cite Review PDF

Full Text: DOI arXiv

OA License

References:

[1]	Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., …Zheng, X. (2016). TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv: 1603.04467.
[2]	Agarwal, N., Bullins, B., & Hazan, E. (2017). Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(1), 4148-4187. · Zbl 1441.90115
[3]	Ai, M., Yu, J., Zhang, H., & Wang, H. (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31(2), 749-772. · Zbl 1469.62422
[4]	Alhamzawi, R., & Ali, H. T. M. (2018). Bayesian quantile regression for ordinal longitudinal data. Journal of Applied Statistics, 45(5), 815-828. · Zbl 1516.62112
[5]	Assran, M., & Rabbat, M. (2020). On the convergence of Nesterov’s accelerated gradient method in stochastic settings. In International Conference on Machine Learning. PMLR.
[6]	Bach, F., & Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate \(O(1/n) \) . In Advances in neural information processing systems. Curran Associates, Inc.
[7]	Battey, H., Fan, J., Liu, H., Lu, J., & Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46(3), 1352-1382. · Zbl 1392.62060
[8]	Bauer, M., Cook, H., & Khailany, B. (2011). CudaDMA: Optimizing GPU memory bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery.
[9]	Baydin, A. G., Cornish, R., Rubio, D. M., Schmidt, M., & Wood, F. (2017). Online learning rate adaptation with hypergradient descent. arXiv: 1703.04782.
[10]	Beck, A. (2017). First-order methods in optimization. Society for Industrial and Applied Mathematics, SIAM. · Zbl 1384.65033
[11]	Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183-202. · Zbl 1175.94009
[12]	Bellet, A., Guerraoui, R., Taziki, M., & Tommasi, M. (2018). Personalized and private peer-to-peer machine learning. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. PMLR.
[13]	Bergou, E. H., Diouane, Y., Kunc, V., Kungurtsev, V., & Royer, C. W. (2022). A subsampling line-search method with second-order results. INFORMS Journal on Optimization, 4(4), 403-425.
[14]	Bickel, P. J., & Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics, 9(6), 1196-1217. · Zbl 0449.62034
[15]	Bickel, P. J., Götze, F., & van Zwet, W. R. (1997). Resampling fewer than n observations: Gains, losses, and remedies for losses. Statistica Sinica, 7(1), 1-31. · Zbl 0927.62043
[16]	Blot, M., Picard, D., Cord, M., & Thome, N. (2016). Gossip training for deep learning. arXiv: 1611.09726.
[17]	Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2), 223-311. · Zbl 1397.65085
[18]	Broyden, C. G., Dennis Jr, J. E., & Moré, J. J. (1973). On the local and superlinear convergence of quasi-Newton methods. IMA Journal of Applied Mathematics, 12(3), 223-245. · Zbl 0282.65041
[19]	Casella, G., & Berger, R. L. (2002). Statistical inference. Duxbury Pacific Grove.
[20]	Chang, X., Lin, S., & Wang, Y. (2017). Divide and conquer local average regression. Electronic Journal of Statistics, 11(1), 1326-1350. · Zbl 1362.62085
[21]	Chen, C. W., Dunson, D. B., Reed, C., & Yu, K. (2013). Bayesian variable selection in quantile regression. Statistics and Its Interface, 6(2), 261-274. · Zbl 1327.62135
[22]	Chen, S., Yu, D., Zou, Y., Yu, J., & Cheng, X. (2022). Decentralized wireless federated learning with differential privacy. IEEE Transactions on Industrial Informatics, 18(9), 6273-6282.
[23]	Chen, S. X., & Peng, L. (2021). Distributed statistical inference for massive data. The Annals of Statistics, 49(5), 2851-2869. · Zbl 1486.62123
[24]	Chen, W., Wang, Z., & Zhou, J. (2014). Large-scale L-BFGS using mapreduce. In Advances in neural information processing systems. Curran Associates, Inc.
[25]	Chen, X., Lee, J. D., Tong, X. T., & Zhang, Y. (2020). Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1), 251-273. · Zbl 1440.62287
[26]	Chen, X., Liu, W., Mao, X., & Yang, Z. (2020). Distributed high-dimensional regression under a quantile loss function. Journal of Machine Learning Research, 21(1), 7432-7474. · Zbl 1542.62099
[27]	Chen, X., Liu, W., & Zhang, Y. (2019). Quantile regression under memory constraint. The Annals of Statistics, 47(6), 3244-3273. · Zbl 1436.62134
[28]	Chen, X., Liu, W., & Zhang, Y. (2022). First-order newton-type estimator for distributed estimation and inference. Journal of the American Statistical Association, 117(540), 1858-1874. · Zbl 1515.62051
[29]	Chen, X., & Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24(4), 1655-1684. · Zbl 1480.62258
[30]	Chen, Z., Mou, S., & Maguluri, S. T. (2022). Stationary behavior of constant stepsize SGD type algorithms: An asymptotic characterization. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(1), 1-24.
[31]	Chien, S. W. D., Markidis, S., Sishtla, C. P., Santos, L., Herman, P., Narasimhamurthy, S., & Laure, E. (2018). Characterizing deep-learning I/O workloads in TensorFlow. In International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems.IEEE.
[32]	Choi, D., Passos, A., Shallue, C. J., & Dahl, G. E. (2019). Faster neural network training with data echoing. arXiv: 1907.05550.
[33]	Crane, R., & Roosta, F. (2019). DINGO: Distributed Newton-type method for gradient-norm optimization. In Advances in neural information processing systems. Curran Associates, Inc.
[34]	Cyrus, S., Hu, B., Van Scoy, B., & Lessard, L. (2018). A robust accelerated optimization algorithm for strongly convex functions. In 2018 Annual American Control Conference (pp. 1376-1381). IEEE.
[35]	Davidon, W. C. (1991). Variable metric method for minimization. SIAM Journal on Optimization, 1(1), 1-17. · Zbl 0752.90062
[36]	Defazio, A., Bach, F., & Lacoste Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems. Curran Associates, Inc.
[37]	Deng, J., Dong, W., Socher, R., Li, Li, & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 248-255). IEEE.
[38]	Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics.
[39]	Dieuleveut, A., Durmus, A., & Bach, F. (2020). Bridging the gap between constant step size stochastic gradient descent and Markov chains. The Annals of Statistics, 48(3), 1348-1382. · Zbl 1454.62242
[40]	Drineas, P., Mahoney, M. W., & Muthukrishnan, S. (2006). Sampling algorithms for \(\ell_2\) regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics.
[41]	Drineas, P., Mahoney, M. W., Muthukrishnan, S., & Sarlós, T. (2011). Faster least squares approximation. Numerische Mathematik, 117(2), 219-249. · Zbl 1218.65037
[42]	Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 257-269.
[43]	Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 1-26. · Zbl 0406.62024
[44]	Efron, B., & Stein, C. (1981). The jackknife estimate of variance. The Annals of Statistics, 9(3), 586-596. · Zbl 0481.62035
[45]	Eisen, M., Mokhtari, A., & Ribeiro, A. (2017). Decentralized quasi-Newton methods. IEEE Transactions on Signal Processing, 65(10), 2613-2628. · Zbl 1414.94180
[46]	Fan, J., Guo, Y., & Wang, K. (2023). Communication-efficient accurate statistical estimation. Journal of the American Statistical Association, 118(542), 1000-1010. · Zbl 07707218
[47]	Fan, J., Li, R., Zhang, C., & Zou, H. (2020). Statistical foundations of data science. Chapman and Hall/CRC. · Zbl 1467.62001
[48]	Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849-911. · Zbl 1411.62187
[49]	Fan, J., Wang, D., Wang, K., & Zhu, Z. (2019). Distributed estimation of principal eigenspaces. The Annals of Statistics, 47(6), 3009-3031. · Zbl 1450.62067
[50]	Gao, D., Ju, C., Wei, X., Liu, Y., Chen, T., & Yang, Q. (2019). HHHFL: Hierarchical heterogeneous horizontal federated learning for electroencephalography. arXiv: 1909.05784.
[51]	Gao, Y., Li, J., Zhou, Y., Xiao, F., & Liu, H. (2021). Optimization methods for large-scale machine learning. In 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing. IEEE.
[52]	Gao, Y., Zhu, X., Qi, H., Li, G., Zhang, R., & Wang, H. (2023). An asymptotic analysis of random partition based minibatch momentum methods for linear regression models. Journal of Computational and Graphical Statistics, 32(3), 1083-1096.
[53]	Gargiani, M., Zanelli, A., Diehl, M., & Hutter, F. (2020). On the promise of the stochastic generalized Gauss-Newton method for training DNNs. arXiv: 2006.02409.
[54]	Ge, R., Kakade, S. M., Kidambi, R., & Netrapalli, P. (2019). The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In Advances in neural information processing systems. Curran Associates, Inc.
[55]	Gitman, I., Lang, H., Zhang, P., & Xiao, L. (2019). Understanding the role of momentum in stochastic gradient methods. In Advances in neural information processing systems. Curran Associates, Inc.
[56]	Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics of Computation, 24(109), 23-26. · Zbl 0196.18002
[57]	Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press. · Zbl 1373.68009
[58]	Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., & Richtárik, P. (2019). SGD: General analysis and improved rates. In International Conference on Machine Learning. PMLR.
[59]	Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv: 1706.02677.
[60]	Gu, J., & Chen, S. (2023). Statistical inference for decentralized federated learning. Working Paper.
[61]	Gürbüzbalaban, M., Ozdaglar, A., & Parrilo, P. A. (2021). Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186, 49-84. · Zbl 1459.90199
[62]	He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
[63]	Hector, E. C., & Song, P. X. (2020). Doubly distributed supervised learning and inference with high-dimensional correlated outcomes. Journal of Machine Learning Research, 21(1), 6983-7017. · Zbl 1536.68012
[64]	Hector, E. C., & Song, P. X. (2021). A distributed and integrated method of moments for high-dimensional correlated data analysis. Journal of the American Statistical Association, 116(534), 805-818. · Zbl 1464.62437
[65]	Hoffer, E., Nun, T. B., Hubara, I., Giladi, N., Hoefler, T., & Soudry, D. (2019). Augment your batch: Better training with larger batches. arXiv: 1901.09335.
[66]	Hu, A., Jiao, Y., Liu, Y., Shi, Y., & Wu, Y. (2021). Distributed quantile regression for massive heterogeneous data. Neurocomputing, 448, 249-262.
[67]	Huang, C., & Huo, X. (2019). A distributed one-step estimator. Mathematical Programming, 174(1-2), 41-76. · Zbl 1416.62151
[68]	Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems. Curran Associates, Inc.
[69]	Jordan, M. I., Lee, J. D., & Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526), 668-681. · Zbl 1420.62097
[70]	Kiefer, J., & Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23, 462-466. · Zbl 0049.36601
[71]	Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv: 1412.6980.
[72]	Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4), 795-816. · Zbl 07555464
[73]	Koenker, R. (2005). Quantile regression. Cambridge University Press. · Zbl 1111.62037
[74]	Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33-50. · Zbl 0373.62038
[75]	Korkmaz, S. (2020). Deep learning-based imbalanced data classification for drug discovery. Journal of Chemical Information and Modeling, 60(9), 4180-4190.
[76]	Kostov, P., & Davidova, S. (2013). A quantile regression analysis of the effect of farmers’ attitudes and perceptions on market participation. Journal of Agricultural Economics, 64(1), 112-132.
[77]	Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems. Curran Associates, Inc.
[78]	Lalitha, A., Shekhar, S., Javidi, T., & Koushanfar, F. (2018). Fully decentralized federated learning. In 3rd Workshop on Bayesian Deep Learning (NeurIPS). Curran Associates Inc.
[79]	Lan, G. (2020). First-order and stochastic optimization methods for machine learning. Springer. · Zbl 1442.68003
[80]	Lee, C., Lim, C. H., & Wright, S. J. (2018). A distributed quasi-Newton algorithm for empirical risk minimization with nonsmooth regularization. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery.
[81]	Lee, J. D., Liu, Q., Sun, Y., & Taylor, J. E. (2017). Communication-efficient sparse regression. Journal of Machine Learning Research, 18(1), 115-144. · Zbl 1434.62157
[82]	Li, K. H. (1994). Reservoir-sampling algorithms of time complexity \(O(n(1 + \log (N/n))) \). ACM Transactions on Mathematical Software, 20(4), 481-493. · Zbl 0889.65147
[83]	Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499), 1129-1139. · Zbl 1443.62184
[84]	Li, X., Li, R., Xia, Z., & Xu, C. (2020). Distributed feature screening via componentwise debiasing. Journal of Machine Learning Research, 21(24), 1-32. · Zbl 1498.68286
[85]	Li, X., Liang, J., Chang, X., & Zhang, Z. (2022). Statistical estimation and online inference via local SGD. In Conference on Learning Theory. PMLR.
[86]	Li, X., Zhu, X., & Wang, H. (2023). Distributed logistic regression for massive data with rare events. arXiv: 2304.02269.
[87]	Li, Y., Chen, C., Liu, N., Huang, H., Zheng, Z., & Yan, Q. (2021). A blockchain-based decentralized federated learning framework with committee consensus. IEEE Network, 35(1), 234-241.
[88]	Lian, H., & Fan, Z. (2018). Divide-and-conquer for debiased \(\ell_1 \) -norm support vector machine in ultra-high dimensions. Journal of Machine Learning Research, 18(182), 1-26. · Zbl 1468.68158
[89]	Lian, X., Zhang, C., Zhang, H., Hsieh, C. J., Zhang, W., & Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in neural information processing systems. PMLR.
[90]	Lin, S., & Zhou, D. (2018). Distributed kernel-based gradient descent algorithms. Constructive Approximation, 47(2), 249-276. · Zbl 1390.68542
[91]	Liu, W., Chen, L., & Wang, W. (2022). General decentralized federated learning for communication-computation tradeoff. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications Workshops. IEEE.
[92]	Liu, W., Mao, X., & Zhang, X. (2022). Fast and robust sparsity learning over networks: A decentralized surrogate median regression approach. IEEE Transactions on Signal Processing, 70, 797-809.
[93]	Liu, Y., Gao, Y., & Yin, W. (2020). An improved analysis of stochastic gradient descent with momentum. In Advances in neural information processing systems. Curran Associates, Inc.
[94]	Loizou, N., & Richtárik, P. (2020). Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77(3), 653-710. · Zbl 1466.90065
[95]	Luo, L., & Song, P. X. (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1), 69-97. · Zbl 1440.62288
[96]	Ma, J., & Yarats, D. (2018). Quasi-hyperbolic momentum and Adam for deep learning. arXiv: 1810.06801.
[97]	Ma, P., Mahoney, M., & Yu, B. (2014). A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning. PMLR.
[98]	Ma, X., Winslett, M., Lee, J., & Yu, S. (2003). Improving MPI-IO output performance with active buffering plus threads. In International Parallel and Distributed Processing Symposium. IEEE.
[99]	Ma, Y., Leng, C., & Wang, H. (2024). Optimal subsampling bootstrap for massive data. Journal of Business and Economic Statistics, 42(1), 174-186.
[100]	Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2), 123-224. · Zbl 1232.68173
[101]	Mcdonald, R., Mohri, M., Silberman, N., Walker, D., & Mann, G. S. (2009). Efficient large-scale distributed training of conditional maximum entropy models. In Advances in neural information processing Systems. Curran Associates, Inc.
[102]	Mishchenko, K., Khaled, A., & Richtárik, P. (2020). Random reshuffling: Simple analysis with vast improvements. In Advances in neural information processing systems. Curran Associates, Inc.
[103]	Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., & Jordan, M. I. (2020). On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. In Conference on Learning Theory. PMLR.
[104]	Moulines, E., & Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in neural information processing systems. Curran Associates, Inc.
[105]	Mukkamala, M. C., & Hein, M. (2017). Variants of RMSProp and adagrad with logarithmic regret bounds. In International Conference on Machine Learning. PMLR.
[106]	Nadiradze, G., Sabour, A., Davies, P., Li, S., & Alistarh, D. (2021). Asynchronous decentralized SGD with quantized and local updates. In Advances in neural information processing systems. Curran Associates, Inc.
[107]	Nakamura, K., Derbel, B., Won, K. J., & Hong, B. W. (2021). Learning-rate annealing methods for deep neural networks. Electronics, 10(16), 2029.
[108]	Nedic, A., Olshevsky, A., & Shi, W. (2017). Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4), 2597-2633. · Zbl 1387.90189
[109]	Needell, D., & Ward, R. (2017). Batched stochastic gradient descent with weighted sampling. In Approximation theory XV: San Antonio 2016 15. Springer. · Zbl 1385.65041
[110]	Nelder, J. A., & Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3), 370-384.
[111]	Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate \(O(1/k^2) \) . In Doklady Akademii Nauk. Russian Academy of Sciences.
[112]	Nitzberg, B., & Lo, V. (1997). Collective buffering: Improving parallel I/O performance. In IEEE International Symposium on High Performance Distributed Computing. IEEE.
[113]	Ofeidis, I., Kiedanski, D., & Tassiulas, L. (2022). An overview of the data-loader landscape: Comparative performance analysis. arXiv: 2209.13705.
[114]	Pan, R., Ren, T., Guo, B., Li, F., Li, G., & Wang, H. (2022). A note on distributed quantile regression by pilot sampling and one-step updating. Journal of Business and Economic Statistics, 40(4), 1691-1700.
[115]	Pan, R., Zhu, Y., Guo, B., Zhu, X., & Wang, H. (2023). A sequential addressing subsampling method for massive data analysis under memory constraint. IEEE Transactions on Knowledge and Data Engineering, 35(9), 9502-9513.
[116]	Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., …Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. Curran Associates, Inc.
[117]	Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17. · Zbl 0147.35301
[118]	Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838-855. · Zbl 0762.62022
[119]	Pumma, S., Si, M., Feng, W., & Balaji, P. (2017). Parallel I/O optimizations for scalable deep learning. In IEEE International Conference on Parallel and Distributed Systems. IEEE.
[120]	Qi, H., Huang, D., Zhu, Y., Huang, D., & Wang, H. (2023). Mini-batch gradient descent with buffer. arXiv: 2312.08728.
[121]	Qi, H., Wang, F., & Wang, H. (2023). Statistical analysis of fixed mini-batch gradient descent estimator. Journal of Computational and Graphical Statistics, 32(4), 1348-1360.
[122]	Qu, G., & Li, N. (2019). Accelerated distributed Nesterov gradient descent. IEEE Transactions on Automatic Control, 65(6), 2566-2581. · Zbl 07256369
[123]	Reich, B. J., Fuentes, M., & Dunson, D. B. (2012). Bayesian spatial quantile regression. Journal of the American Statistical Association, 106(493), 6-20. · Zbl 1396.62263
[124]	Richards, D., Rebeschini, P., & Rosasco, L. (2020). Decentralised learning with random features and distributed gradient descent. In International Conference on Machine Learning. PMLR.
[125]	Richardson, A. M., & Lidbury, B. A. (2013). Infection status outcome, machine learning method and virus type interact to affect the optimised prediction of hepatitis virus immunoassay results from routine pathology laboratory assays in unbalanced data. BMC Bioinformatics, 14, 206.
[126]	Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400-407. · Zbl 0054.05901
[127]	Rosenblatt, J. D., & Nadler, B. (2016). On the optimality of averaging in distributed statistical learning. Journal of the IMA, 5(4), 379-404. · Zbl 1426.68241
[128]	Roux, N., Schmidt, M., & Bach, F. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in neural information processing systems. Curran Associates, Inc.
[129]	Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv: 1609.04747.
[130]	Savazzi, S., Nicoli, M., & Rampa, V. (2020). Federated learning with cooperating devices: A consensus approach for massive lot networks. IEEE Internet of Things Journal, 7(5), 4641-4654.
[131]	Sengupta, S., Volgushev, S., & Shao, X. (2016). A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 111(515), 1222-1232.
[132]	Shamir, O., Srebro, N., & Zhang, T. (2014). Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning. PMLR.
[133]	Shao, J. (2003). Mathematical statistics. Springer. · Zbl 1018.62001
[134]	Shao, J., & Tu, D. (1995). The jackknife and bootstrap. Springer. · Zbl 0947.62501
[135]	Shu, J., Zhu, Y., Zhao, Q., Meng, D., & Xu, Z. (2022). Mlr-snet: Transferable LR schedules for heterogeneous tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3505-3521.
[136]	Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations. OpenReview.net.
[137]	Soori, S., Mishchenko, K., Mokhtari, A., Dehnavi, M. M., & Gurbuzbalaban, M. (2020). DAve-QN: A distributed averaged quasi-Newton method with local superlinear convergence rate. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. PMLR.
[138]	Stich, S. U. (2019). Local SGD converges fast and communicates little. In 2019 International Conference on Learning Representations.
[139]	Su, L., & Xu, J. (2019). Securing distributed gradient descent in high dimensional statistical learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(1), 1-41.
[140]	Sutskever, I. (2013). Training recurrent neural networks. University of Toronto Toronto.
[141]	Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning. PMLR.
[142]	Suwandarathna, S., & Koggalage, R. (2007). Increasing hard drive performance – from a thermal perspective. In International Conference on Industrial and Information Systems.
[143]	Tan, C., Ma, S., Dai, Y., & Qian, Y. (2016). Barzilai-Borwein step size for stochastic gradient descent. In Advances in neural information processing systems. Curran Associates, Inc.
[144]	Tan, K. M., Battey, H., & Zhou, W. (2022). Communication-constrained distributed quantile regression with optimal statistical guarantees. Journal of Machine Learning Research, 23(1), 12456-12516.
[145]	Tang, H., Lian, X., Yan, M., Zhang, C., & Liu, J. (2018). Decentralized training over decentralized data. In International Conference on Machine Learning. PMLR.
[146]	Tang, K., Liu, W., & Zhang, Y. (2023). Acceleration of stochastic gradient descent with momentum by averaging: Finite-sample rates and asymptotic normality. arXiv: 2305.17665.
[147]	Tang, L., Zhou, L., & Song, P. X. K. (2020). Distributed simultaneous inference in generalized linear models via confidence distribution. Journal of Multivariate Analysis, 176, Article 104567. · Zbl 1436.62357
[148]	Toulis, P., & Airoldi, E. M. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4), 1694-1727. · Zbl 1378.62046
[149]	Tu, J., Liu, W., Mao, X., & Xu, M. (2023). Distributed semi-supervised sparse statistical inference. arXiv: 2306.10395.
[150]	Van der Vaart, A. W. (2000). Asymptotic statistics. Cambridge University Press. · Zbl 0943.62002
[151]	Vanhaesebrouck, P., Bellet, A., & Tommasi, M. (2017). Decentralized collaborative learning of personalized models over networks. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. PMLR.
[152]	Van Scoy, B., Freeman, R. A., & Lynch, K. M. (2017). The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1), 49-54.
[153]	Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1(1), 37-57. · Zbl 0562.68028
[154]	Volgushev, S., Chao, S., & Cheng, G. (2019). Distributed inference for quantile regression processes. The Annals of Statistics, 47(3), 1634-1662. · Zbl 1418.62174
[155]	Wang, F., Huang, D., Gao, T., Wu, S., & Wang, H. (2022). Sequential one-step estimator by sub-sampling for customer churn analysis with massive data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics), 71(5), 1753-1786.
[156]	Wang, F., Zhu, Y., Huang, D., Qi, H., & Wang, H. (2021). Distributed one-step upgraded estimation for non-uniformly and non-randomly distributed data. Computational Statistics and Data Analysis, 162, Article 107265. · Zbl 1543.62287
[157]	Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512-1524. · Zbl 1205.62103
[158]	Wang, H. (2019a). Divide-and-conquer information-based optimal subdata selection algorithm. Journal of Statistical Theory and Practice, 13(3), 46. · Zbl 1425.62087
[159]	Wang, H. (2019b). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research, 20(132), 1-59. · Zbl 1441.62194
[160]	Wang, H. (2020). Logistic regression for massive data with rare events. In International Conference on Machine Learning. PMLR.
[161]	Wang, H., & Ma, Y. (2021). Optimal subsampling for quantile regression in big data. Biometrika, 108(1), 99-112. · Zbl 1462.62248
[162]	Wang, H., Yang, M., & Stufken, J. (2019b). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association, 114(525), 393-405. · Zbl 1478.62196
[163]	Wang, H., Zhu, R., & Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522), 829-844. · Zbl 1398.62196
[164]	Wang, J., Kolar, M., Srebro, N., & Zhang, T. (2017). Efficient distributed learning with sparsity. In International Conference on Machine Learning. PMLR.
[165]	Wang, S., Roosta, F., Xu, P., & Mahoney, M. W. (2018). Giant: Globally improved approximate Newton method for distributed optimization. In Advances in neural information processing systems. Curran Associates, Inc.
[166]	Wang, X., Yang, Z., Chen, X., & Liu, W. (2019a). Distributed inference for linear support vector machine. Journal of Machine Learning Research, 20(113), 1-41. · Zbl 1434.68468
[167]	Woodworth, B., Patel, K. K., Stich, S., Dai, Z., Bullins, B., Mcmahan, B., Shamir, O., & Srebro, N. (2020). Is local SGD better than minibatch SGD? In International Conference on Machine Learning. PMLR.
[168]	Wu, S., Huang, D., & Wang, H. (2023a). Network gradient descent algorithm for decentralized federated learning. Journal of Business and Economic Statistics, 41(3), 806-818.
[169]	Wu, S., Huang, D., & Wang, H. (2023b). Quasi-Newton updating for large-scale distributed learning. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 85(4), 1326-1354.
[170]	Wu, S., Zhu, X., & Wang, H. (2023c). Subsampling and jackknifing: A practically convenient solution for large data analysis with limited computational resources. Statistica Sinica, 33(3), 2041-2064.
[171]	Xu, G., Sit, T., Wang, L., & Huang, C. Y. (2017). Estimation and inference of quantile regression for survival data under biased sampling. Journal of the American Statistical Association, 112(520), 1571-1586.
[172]	Xu, Q., Cai, C., Jiang, C., Sun, F., & Huang, X. (2020). Block average quantile regression for massive dataset. Statistical Papers, 61, 141-165. · Zbl 1437.62157
[173]	Yang, J., Meng, X., & Mahoney, M. (2013). Quantile regression for large-scale applications. In International Conference on Machine Learning. PMLR.
[174]	Yao, Y., & Wang, H. (2019). Optimal subsampling for softmax regression. Statistical Papers, 60(2), 585-599. · Zbl 1421.62013
[175]	Yu, J., Wang, H., Ai, M., & Zhang, H. (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association, 117(537), 265-276. · Zbl 1506.62235
[176]	Yu, L., Balasubramanian, K., Volgushev, S., & Erdogdu, M. (2021). An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. In Advanced in neural information processing systems. Curran Associates, Inc.
[177]	Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3), 1835-1854. · Zbl 1345.90068
[178]	Zeiler, M. D. (2012). AdaDelta: An adaptive learning rate method. arXiv: 1212.5701.
[179]	Zhang, J., & Ré, C. (2016). ParallelSGD: When does averaging help? In International Conference on Machine Learning Workshop on Optimization in Machine Learning. International Conference on Machine Learning (ICML).
[180]	Zhang, Y., & Lin, X. (2015). DiSCO: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning. PMLR.
[181]	Zhang, Y., Wainwright, M. J., & Duchi, J. C. (2012). Communication-efficient algorithms for statistical optimization. In Advances in neural information processing systems. Curran Associates, Inc.
[182]	Zhao, Y., Wong, Z. S. Y., & Tsui, K. L. (2018). A framework of rebalancing imbalanced healthcare data for rare events’ classification: A case of look-alike sound-alike mix-up incident detection. Journal of Healthcare Engineering, 2018, Article 6275435-
[183]	Zhong, W., Wan, C., & Zhang, W. (2022). Estimation and inference for multi-kink quantile regression. Journal of Business and Economic Statistics, 40(3), 1123-1139.
[184]	Zhou, L., She, X., & Song, P. X. (2023). Distributed empirical likelihood approach to integrating unbalanced datasets. Statistica Sinica, 33(3), 2209-2231.
[185]	Zhu, M., Su, W., & Chipman, H. A. (2006). LAGO: A computationally efficient approach for statistical detection. Technometrics, 48(2), 193-205.
[186]	Zhu, W., Chen, X., & Wu, W. B. (2023). Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association, 118(541), 393-404. · Zbl 07705999
[187]	Zhu, X., Li, F., & Wang, H. (2021). Least-square approximation for a distributed system. Journal of Computational and Graphical Statistics, 30(4), 1004-1018. · Zbl 07499933
[188]	Zhu, X., Pan, R., Wu, S., & Wang, H. (2022). Feature screening for massive data analysis by subsampling. Journal of Business and Economic Statistics, 40(4), 1892-1903.
[189]	Zhu, Y., Huang, D., Gao, Y., Wu, R., Chen, Y., Zhang, B., & Wang, H. (2021). Automatic, dynamic, and nearly optimal learning rate specification via local quadratic approximation. Neural Networks, 141, 11-29. · Zbl 1521.68210
[190]	Zhu, Y., Yu, W., Jiao, B., Mohror, K., Moody, A., & Chowdhury, F. (2019). Efficient user-level storage disaggregation for deep learning. In 2019 IEEE International Conference on Cluster Computing.
[191]	Zhuang, J., Cai, J., Wang, R., Zhang, J., & Zheng, W. (2019). CARE: Class attention to regions of lesion for classification on imbalanced data. In International Conference on Medical Imaging with Deep Learning. PMLR.
[192]	Zinkevich, M., Weimer, M., Li, L., & Smola, A. (2010). Parallelized stochastic gradient descent. In Advances in neural information processing systems. Curran Associates, Inc.
[193]	Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4), 1509-1533. · Zbl 1142.62027

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.