×

Parallel-and-stream accelerator for computationally fast supervised learning. (English) Zbl 1543.62085

Summary: Two dominant distributed computing strategies have emerged to overcome the computational bottleneck of supervised learning with big data: parallel data processing in the MapReduce paradigm and serial data processing in the online streaming paradigm. Despite the two strategies’ common divide-and-combine approach, they differ in how they aggregate information, leading to different trade-offs between statistical and computational performance. The authors propose a new hybrid paradigm, termed a Parallel-and-Stream Accelerator (PASA), that uses the strengths of both strategies for computationally fast and statistically efficient supervised learning. PASA’s architecture nests online streaming processing into each distributed and parallelized data process in a MapReduce framework. PASA leverages the advantages and mitigates the disadvantages of both the MapReduce and online streaming approaches to deliver a more flexible paradigm satisfying practical computing needs. The authors study the analytic properties and computational complexity of PASA, and detail its implementation for two key statistical learning tasks. PASA’s performance is illustrated through simulations and a large-scale data example building a prediction model for online purchases from advertising data.

MSC:

62-08 Computational methods for problems pertaining to statistics

References:

[1] Efron, B., Bayes and likelihood calculations from confidence intervals, Biometrika, 80, 3-26 (1993) · Zbl 0773.62021
[2] Glass, G. V., Primary, secondary, and meta-analysis of research, Educ. Res., 5, 10, 3-8 (1976)
[3] Hansen, L. P., Large sample properties of generalized method of moments estimators, Econometrica, 50, 4, 1029-1054 (1982) · Zbl 0502.62098
[4] Hector, E. C.; Song, P. X.-K., Doubly distributed supervised learning and inference with high-dimensional correlated outcomes, J. Mach. Learn. Res., 21, 1-35 (2020) · Zbl 1536.68012
[5] Hector, E. C.; Song, P. X.-K., A distributed and integrated method of moments for high-dimensional correlated data analysis, J. Am. Stat. Assoc., 116, 534, 805-818 (2021) · Zbl 1464.62437
[6] Jordan, M. I., On statistics, computation and scalability, Bernoulli, 19, 4, 1378-1390 (2013) · Zbl 1273.62030
[7] Jørgensen, B., The Theory of Dispersion Models (1997), Chapman and Hall: Chapman and Hall London · Zbl 0928.62052
[8] Lemaréchal, C., Cauchy and the gradient method, Doc. Math. Extra, 251-254 (2012) · Zbl 1264.01011
[9] Li, K.; Yang, J., Score-matching representative approach for big data analysis with generalized linear models, Electron. J. Stat., 16, 1, 592-635 (2022) · Zbl 1493.62449
[10] Luo, L.; Song, P. X.-K., Renewable estimation and incremental inference in generalized linear models with streaming datasets, J. R. Stat. Soc. B, 82, 69-97 (2020) · Zbl 1440.62288
[11] Robbins, H.; Monro, S., A stochastic approximation method, Ann. Math. Stat., 22, 3, 400-407 (1951) · Zbl 0054.05901
[12] Sakrison, D. J., Efficient recursive estimation: application to estimating the parameter of a covariance function, Int. J. Eng. Sci., 3, 4, 461-483 (1965) · Zbl 0137.37202
[13] Singh, K.; Xie, M.; Strawderman, W. E., Combining information from independent sources through confidence distributions, Ann. Stat., 33, 1, 159-183 (2005) · Zbl 1064.62003
[14] Song, P. X.-K., Correlated Data Analysis: Modeling, Analytics, and Applications, Springer Series in Statistics (2007) · Zbl 1132.62002
[15] Tallis, M.; Yadav, P., Reacting to variations in product demand: an application for conversion rate (CR) prediction in sponsored search, arXiv preprint
[16] Tang, L.; Song, P. X.-K., Fused lasso approach in regression coefficients clustering – learning parameter heterogeneity in data integration, J. Mach. Learn. Res., 17, 1-23 (2016) · Zbl 1368.62209
[17] Toulis, P.; Airoldi, E. M., Scalable estimation strategies based on stochastic approximations: classical results and new insights, Stat. Comput., 25, 4, 781-795 (2015) · Zbl 1332.62291
[18] Wang, F.; Wang, L.; Song, P. X.K., Quadratic inference function approach to merging longitudinal studies: validation and joint estimation, Biometrika, 99, 3, 755-762 (2012) · Zbl 1437.62647
[19] Wang, H.; Zhu, R.; Ma, P., Optimal subsampling for large sample logistic regression, J. Am. Stat. Assoc., 113, 522, 829-844 (2018) · Zbl 1398.62196
[20] Wang, H.; Yang, M.; Stufken, J., Information-based optimal subdata selection for big data linear regression, J. Am. Stat. Assoc., 114, 525, 393-405 (2019) · Zbl 1478.62196
[21] Xie, M.; Singh, K., Confidence distribution, the frequentist distribution estimator of a parameter: a review, Int. Stat. Rev., 81, 1, 3-39 (2013) · Zbl 1416.62170
[22] Xie, M.; Singh, K.; Strawderman, W. E., Confidence distributions and a unifying framework for meta-analysis, J. Am. Stat. Assoc., 106, 493, 320-333 (2011) · Zbl 1396.62051
[23] Zellner, A., An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias, J. Am. Stat. Assoc., 57, 298, 348-368 (1962) · Zbl 0113.34902
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.