×

A framework for parallel and distributed training of neural networks. (English) Zbl 1434.68523

Summary: The aim of this paper is to develop a general framework for training neural networks (NNs) in a distributed environment, where training data is partitioned over a set of agents that communicate with each other through a sparse, possibly time-varying, connectivity pattern. In such distributed scenario, the training problem can be formulated as the (regularized) optimization of a non-convex social cost function, given by the sum of local (non-convex) costs, where each agent contributes with a single error term defined with respect to its local dataset. To devise a flexible and efficient solution, we customize a recently proposed framework for non-convex optimization over networks, which hinges on a (primal) convexification-decomposition technique to handle non-convexity, and a dynamic consensus procedure to diffuse information among the agents. Several typical choices for the training criterion (e.g., squared loss, cross entropy, etc.) and regularization (e.g., \(\ell_2\) norm, sparsity inducing penalties, etc.) are included in the framework and explored along the paper. Convergence to a stationary solution of the social non-convex problem is guaranteed under mild assumptions. Additionally, we show a principled way allowing each agent to exploit a possible multi-core architecture (e.g., a local cloud) in order to parallelize its local optimization step, resulting in strategies that are both distributed (across the agents) and parallel (inside each agent) in nature. A comprehensive set of experimental results validate the proposed approach.

MSC:

68T07 Artificial neural networks and deep learning
68W15 Distributed algorithms
90C26 Nonconvex programming, global optimization

References:

[2] Beck, A.; Teboulle, M., A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, 2, 1, 183-202 (2009) · Zbl 1175.94009
[3] Bengio, Y., Practical recommendations for gradient-based training of deep architectures, (Neural networks: Tricks of the trade (2012), Springer), 437-478
[4] Bergstra, J.; Breuleux, O.; Bastien, F.; Lamblin, P.; Pascanu, R.; Desjardins, G.; Turian, J.; Warde-Farley, D.; Bengio, Y., Theano: A cpu and gpu math compiler in python, (Proceedings of the 9th Python in science conference (2010)), 1-7
[5] Bertin-Mahieux, T.; Ellis, D. P.; Whitman, B.; Lamere, P., The million song dataset, (12th international society for music information retrieval conference (2011)), 1-6
[6] Bianchi, P.; Jakubowicz, J., Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization, IEEE Transactions on Automatic Control, 58, 2, 391-405 (2013) · Zbl 1369.90131
[7] Bishop, C. M., Pattern recognition and machine learning (2006), SpringerInternational · Zbl 1107.68072
[8] Blackwell, W. J., Neural network jacobian analysis for high-resolution profiling of the atmosphere, Journal on Advances in Signal Processing, 2012, 1, 1 (2012)
[9] Boric-Lubeke, O.; Lubecke, V. M., Wireless house calls: using communications technology for health care and monitoring, IEEE Microwave Magazine, 3, 3, 43-48 (2002)
[10] Boyd, S.; Vandenberghe, L., Convex optimization (2004), Cambridge university press · Zbl 1058.90049
[11] Byrd, R. H.; Lu, P.; Nocedal, J.; Zhu, C., A limited memory algorithm for bound constrained optimization, SIAM Journal on Scientific Computing, 16, 5, 1190-1208 (1995) · Zbl 0836.65080
[12] Cevher, V.; Becker, S.; Schmidt, M., Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics, IEEE Signal Processing Magazine, 31, 5, 32-43 (2014)
[13] Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J., Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems, 47, 4, 547-553 (2009)
[14] Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M., Large scale distributed deep networks, (Advances in neural information rocessing systems (2012)), 1223-1231
[15] Demšar, J., Statistical Comparisons of classifiers over multiple data sets, Journal of Machine Learning Research (JMLR), 7, 1-30 (2006) · Zbl 1222.68184
[16] Di Lorenzo, P.; Sayed, A. H., Sparse distributed learning based on diffusion adaptation, IEEE Transactions on Signal Processing, 61, 6, 1419-1433 (2013) · Zbl 1393.94026
[17] Di Lorenzo, P.; Scardapane, S., Parallel and distributed training of neural networks via successive convex approximation, (2016 IEEE international workshop on machine learning for signal processing (2016), IEEE), 1-6
[18] Di Lorenzo, P.; Scutari, G., Next: In-network nonconvex optimization, IEEE Transactions on Signal and Information Processing over Networks, 2, 2, 120-136 (2016)
[19] Duchi, J.; Hazan, E.; Singer, Y., Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research (JMLR), 12, Jul, 2121-2159 (2011) · Zbl 1280.68164
[20] Facchinei, F.; Scutari, G.; Sagratella, S., Parallel selective algorithms for nonconvex big data optimization, IEEE Transactions on Signal Processing, 63, 7, 1874-1889 (2015) · Zbl 1394.94174
[21] Forero, P. A.; Cano, A.; Giannakis, G. B., Consensus-based distributed support vector machines, Journal of Machine Learning Research (JMLR), 11, May, 1663-1707 (2010) · Zbl 1242.68222
[22] Gao, W.; Chen, J.; Richard, C.; Huang, J., Diffusion adaptation over networks with kernel least-mean-square, (Computational advances in multi-sensor adaptive processing, 2015 IEEE 6th international workshop on (2015), IEEE), 217-220
[23] Georgopoulos, L.; Hasler, M., Distributed machine learning in networks by consensus, Neurocomputing, 124, 2-12 (2014)
[24] Glorot, X.; Bengio, Y., Understanding the difficulty of training deep feedforward neural networks, (AISTATS. Vol. 9 (2010)), 249-256
[25] Glorot, X.; Bordes, A.; Bengio, Y., Deep sparse rectifier neural networks, (Proc. 14th International conference on artificial intelligence and statistics (2011)), 315-323
[26] Goodfellow, I. J.; Warde-farley, D.; Mirza, M.; Courville, A.; Bengio, Y., Maxout networks, (Proc. 30th International conference on machine learning (2013)), 1319-1327
[27] Haykin, S., Neural networks and learning machines (2009), Pearson
[28] Ho, C.-H.; Lin, C.-J., Large-scale linear support vector regression, Journal of Machine Learning Research (JMLR), 13, Nov, 3323-3348 (2012) · Zbl 1433.68349
[29] Huang, S.; Li, C., Distributed extreme learning machine for nonlinear learning over network, Entropy, 17, 2, 818-840 (2015)
[30] Lazarevic, A.; Obradovic, Z., Boosting algorithms for parallel and distributed learning, Distributed and Parallel Databases, 11, 2, 203-229 (2002) · Zbl 1057.68742
[31] LeCun, Y.; Bengio, Y.; Hinton, G., Deep learning, Nature, 521, 7553, 436-444 (2015)
[32] Lopes, C. G.; Sayed, A. H., Diffusion least-mean squares over adaptive networks: Formulation and performance analysis, IEEE Transactions on Signal Processing, 56, 7, 3122-3136 (2008) · Zbl 1390.94283
[33] Lu, Y.; Roychowdhury, V.; Vandenberghe, L., Distributed parallel support vector machines in strongly connected networks, IEEE Transactions on Neural Networks, 19, 7, 1167-1178 (2008)
[34] Mateos, G.; Bazerque, J. A.; Giannakis, G. B., Distributed sparse linear regression, IEEE Transactions on Signal Processing, 58, 10, 5262-5276 (2010) · Zbl 1391.62133
[36] Modi, P. J.; Shen, W.-M.; Tambe, M.; Yokoo, M., Adopt: Asynchronous distributed constraint optimization with quality guarantees, Artificial Intelligence, 161, 1-2, 149-180 (2005) · Zbl 1132.68706
[37] Moody, J.; Hanson, S.; Krogh, A.; Hertz, J. A., A simple weight decay can improve generalization, Advances in Neural Information Processing Systems, 4, 950-957 (1995)
[38] Navia-Vázquez, A.; Gutierrez-Gonzalez, D.; Parrado-Hernández, E.; Navarro-Abellan, J., Distributed support vector machines, IEEE Transactions on Neural Networks, 17, 4, 1091-1097 (2006)
[39] Nocedal, J.; Wright, S., Numerical optimization (2006), Springer Science & Business Media · Zbl 1104.65059
[40] Ochs, P.; Dosovitskiy, A.; Brox, T.; Pock, T., On iteratively reweighted algorithms for nonsmooth nonconvex optimization in computer vision, SIAM Journal on Imaging Sciences, 8, 1, 331-372 (2015) · Zbl 1326.65078
[41] Perez-Cruz, F.; Kulkarni, S. R., Robust and low complexity distributed kernel least squares learning in sensor networks, IEEE Signal Processing Letters, 17, 4, 355-358 (2010)
[42] Pottie, G. J.; Kaiser, W. J., Wireless integrated network sensors, Communications of the ACM, 43, 5, 51-58 (2000)
[43] Predd, J.; Kulkarni, S.; Poor, H., Distributed learning in wireless sensor networks, IEEE Signal Processing Magazine, 23, 4, 56-69 (2006)
[44] Predd, J. B.; Kulkarni, S. R.; Poor, H. V., A collaborative training algorithm for distributed learning, IEEE Transactions on Information Theory, 55, 4, 1856-1871 (2009) · Zbl 1368.68285
[45] Quinlan, J. R., Combining instance-based and model-based learning, (Proceedings of the tenth international conference on machine learning (1993)), 236-243
[46] Rogers, A.; Farinelli, A.; Stranders, R.; Jennings, N. R., Bounded approximate decentralised coordination via the max-sum algorithm, Artificial Intelligence, 175, 2, 730-759 (2011) · Zbl 1216.68305
[47] Sak, H.; Vinyals, O.; Heigold, G.; Senior, A.; McDermott, E.; Monga, R.; Mao, M., Sequence discriminative distributed training of long short-term memory recurrent neural networks, (Interspeech 2014 (2014))
[48] Samet, S.; Miri, A., Privacy-preserving back-propagation and extreme learning machine algorithms, Data & Knowledge Engineering, 79, 40-61 (2012)
[49] Sayed, A. H., Adaptive networks, Proceedings of the IEEE, 102, 4, 460-497 (2014)
[50] Sayed, A. H., Adaptation, learning, and optimization over networks, Foundations and \(Trends^®\) in Machine Learning, 7, 4-5, 311-801 (2014) · Zbl 1315.68212
[51] Scardapane, S.; Comminiello, D.; Hussain, A.; Uncini, A., Group sparse regularization for deep neural networks, Neurocomputing, 241, 81-89 (2017)
[52] Scardapane, S.; Fierimonte, R.; Di Lorenzo, P.; Panella, M.; Uncini, A., Distributed semi-supervised support vector machines, Neural Networks, 80, 43-52 (2016) · Zbl 1414.68073
[53] Scardapane, S.; Wang, D.; Panella, M., A decentralized training algorithm for echo state networks in distributed big data applications, Neural Networks, 78, 65-74 (2016) · Zbl 1414.68074
[54] Scardapane, S.; Wang, D.; Panella, M.; Uncini, A., Distributed learning for random vector functional-link networks, Information Sciences, 301, 271-284 (2015) · Zbl 1360.68711
[55] Schmidhuber, J., Deep learning in neural networks: An overview, Neural Networks, 61, 85-117 (2015)
[56] Schmidt, M., Graphical model structure learning with l1-regularization (2010), The University of British Columbia (Vancouver), (Ph.D. thesis)
[57] Sun, Y.; Scutari, G.; Palomar, D., Distributed nonconvex multiagent optimization over time-varying networks, (Proceedings of the 50th annual Asilomar conference on signals, systems, and computers (2016))
[58] Tibshirani, R., Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B. Statistical Methodology, 267-288 (1996) · Zbl 0850.62538
[59] Vieira-Marques, P. M.; Robles, S.; Cucurull, J.; Navarro, G., Secure integration of distributed medical data using mobile agents, IEEE Intelligent Systems, 6, 47-54 (2006)
[60] Xiao, L.; Boyd, S., Fast linear iterations for distributed averaging, Systems & Control Letters, 53, 1, 65-78 (2004) · Zbl 1157.90347
[61] Xiao, L.; Boyd, S.; Kim, S.-J., Distributed average consensus with least-mean-square deviation, Journal of Parallel and Distributed Computing, 67, 1, 33-46 (2007) · Zbl 1109.68019
[63] Zhang, Y.; Zhong, S., A privacy-preserving algorithm for distributed training of neural network ensembles, Neural Computing and Applications, 22, 1, 269-282 (2013)
[64] Zhu, M.; Martínez, S., Discrete-time dynamic average consensus, Automatica, 46, 2, 322-329 (2010) · Zbl 1205.93014
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.