Abstract
The unrestrainable growth of data in many domains in which machine learning could be applied has brought a new field called large-scale learning that intends to develop efficient and scalable algorithms with regard to requirements of computation, memory, time and communications. A promising line of research for large-scale learning is distributed learning. It involves learning from data stored at different locations and, eventually, select and combine the “local” classifiers to obtain a unique global answer using one of three main approaches. This paper is concerned with a significant issue that arises when distributed data comes in from several sources, each of which has a different distribution. The class-probability distribution of data (CPDD) is defined and its impact on the performance of the three combination approaches is analyzed. Results show the necessity of taking into account the CPDD, concluding that combining only related knowledge is the most appropriate manner for learning in a distributed manner.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, vol. 20, pp. 161–168 (2008)
PASCAL Large Scale Learning Challenge (2008), http://largescale.first.fraunhofer.de/ (Online; accessed May 10, 2011)
Catlett, J.: Megainduction: machine learning on very large databases. PhD thesis, School of Computer Science, University of Technology, Sydney, Australia (1991)
Tsoumakas, G.: Distributed Data Mining. In: Database Technologies: Concepts, Methodologies, Tools, and Applications, pp. 157–171 (2009)
Tsoumakas, G., Vlahavas, I.: Effective stacking of distributed classifiers. In: Proc. 15th European Conference on Artificial Intelligence (ECAI 2002), pp. 340–344. Ios Pr. Inc. (2002)
Guijarro-Berdiñas, B., Martínez-Rego, D., Fernández-Lorenzo, S.: Privacy-Preserving Distributed Learning Based on Genetic Algorithms and Artificial Neural Networks. In: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living, pp. 195–202 (2009)
McClean, S., Scotney, B., Greer, K., Páircéir, R.: Conceptual Clustering of Heterogeneous Distributed Databases. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 46–55. Springer, Heidelberg (2001)
Bronshtein, I.N., Semendyayev, K.A., Hirsch, K.A.: Handbook of mathematics. Springer, Berlin (2007)
Agrawal, R., Srikant, R.: Privacy-preserving data mining. ACM Sigmod Record 29(2), 439–450 (2000)
Dietterich, T.: Ensemble Methods in Machine Learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)
Lam, L., Suen, C.Y.: A theoretical analysis of the application of majority voting to pattern recognition. In: Proceedings of the 12th ICPR, vol. 2, pp. 418–420. IEEE (1994)
Tsoumakas, G., Angelis, L., Vlahavas, I.: Clustering classifiers for knowledge discovery from physically distributed databases. Data & Knowledge Engineering 49(3), 223–242 (2004)
Yang, W., Huang, S.: Data privacy protection in multi-party clustering. Data & Knowledge Engineering 67(1), 185–199 (2008)
Adhikari, A., Rao, P.R.: Efficient clustering of databases induced by local patterns. Decision Support Systems 44(4), 925–943 (2008)
Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml (Online; accessed May 10, 2011)
Quinlan, J.R.: C4. 5: programs for machine learning. Morgan Kaufmann (1993)
Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, USA (1995)
Vapnik, V.N.: The nature of statistical learning theory. Springer, Heidelberg (2000)
Weiss, S.M., Kulikowski, C.A.: Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann, San Francisco (1991)
Hollander, M., Wolfe, D.A.: Nonparametric statistical methods (1999)
Hsu, J.C.: Multiple comparisons: theory and methods. Chapman & Hall/CRC (1996)
Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(10), 993–1001 (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peteiro-Barral, D., Guijarro-Berdiñas, B., Pérez-Sánchez, B. (2011). On the Effectiveness of Distributed Learning on Different Class-Probability Distributions of Data. In: Lozano, J.A., Gámez, J.A., Moreno, J.A. (eds) Advances in Artificial Intelligence. CAEPIA 2011. Lecture Notes in Computer Science(), vol 7023. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25274-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-25274-7_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25273-0
Online ISBN: 978-3-642-25274-7
eBook Packages: Computer ScienceComputer Science (R0)