Abstract
In an optimization problem, a coreset can be defined as a subset of the input points, such that a good approximation to the optimization problem can be obtained by solving it directly on the coreset, instead of using the whole original input. In machine learning, coresets are exploited for applications ranging from speeding up training time, to helping humans understand the fundamental properties of a class, by considering only a few meaningful samples. The problem of discovering coresets, starting from a dataset and an application, can be defined as identifying the minimal amount of samples that do not significantly lower performance with respect to the performance on the whole dataset. Specialized literature offers several approaches to finding coresets, but such algorithms often disregard the application, or explicitly ask the user for the desired number of points. Starting from the consideration that finding coresets is an intuitively multi-objective problem, as minimizing the number of points goes against maintaining the original performance, in this paper we propose a multi-objective evolutionary approach to identifying coresets for classification. The proposed approach is tested on classical machine learning classification benchmarks, using 6 state-of-the-art classifiers, comparing against 7 algorithms for coreset discovery. Results show that not only the proposed approach is able to find coresets representing different compromises between compactness and performance, but that different coresets are identified for different classifiers, reinforcing the assumption that coresets might be closely linked to the specific application.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
scikit-learn: Machine Learning in Python, http://scikit-learn.org/stable/.
- 2.
inspyred: Bio-inspired Algorithms in Python, https://pythonhosted.org/inspyred/.
References
Bachem, O., Lucic, M., Krause, A.: Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476 (2017)
Huggins, J.H., Campbell, T., Broderick, T.: Coresets for scalable bayesian logistic regression. In: 30th Annual Conference on Neural Information Processing Systems (NIPS) (2016). https://arxiv.org/pdf/1605.06423.pdf
Campbell, T., Broderick, T.: Bayesian coreset construction via greedy iterative geodesic ascent. In: International Conference on Machine Learning (ICML) (2018). https://arxiv.org/pdf/1802.01737.pdf
Clarkson, K.L.: Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Trans. Algorithms 6(4), 63 (2010). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.9299&rep=rep1&type=pdf
Efroymson, M.A.: Multiple regression analysis. In: Ralston, A., Wilf, H.S. (eds.) Mathematical Methods for Digital Computers. Wiley, New York (1960)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–451 (2004). https://arxiv.org/pdf/math/0406456.pdf
Boutsidis, C., Drineas, P., Magdon-Ismail, M.: Near-optimal coresets for least-squares regression. Technical report (2013). https://arxiv.org/pdf/1202.3505.pdf
Mallat, S., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 42(12), 3397–3415 (1993)
Pati, Y., Rezaiifar, R., Krishnaprasad, P.: Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, pp. 40–44 (1993). http://ieeexplore.ieee.org/document/342465/
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-ii. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3(3), 210–229 (1959)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Cox, D.R.: The regression analysis of binary sequences. J. Roy. Stat. Soc.: Ser. B (Methodol.) 20(2), 215–242 (1958)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT press, Massachusetts (2016)
Tsang, I.W., Kwok, J.T., Cheung, P.M.: Core vector machines: fast SVM training on very large data sets. J. Mach. Learn. Res. 6, 363–392 (2005)
Campbell, T., Broderick, T.: Automated Scalable Bayesian Inference via Hilbert Coresets (2017). http://arxiv.org/abs/1710.05053
Breiman, L.: Pasting small votes for classification in large databases and on-line. Mach. Learn. 36(1–2), 85–103 (1999)
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Tikhonov, A.N.: On the stability of inverse problems. Dokl. Akad. Nauk SSSR. 39, 195–198 (1943)
Hearst, M.A., Dumais, S.T., Osman, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Garrett, A.: inspyred (version 1.0.1) inspired intelligence (2012). https://github.com/aarongarrett/inspyred
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7(2), 179–188 (1936)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Barbiero, P., Tonda, A. (2019). Fundamental Flowers: Evolutionary Discovery of Coresets for Classification. In: Kaufmann, P., Castillo, P. (eds) Applications of Evolutionary Computation. EvoApplications 2019. Lecture Notes in Computer Science(), vol 11454. Springer, Cham. https://doi.org/10.1007/978-3-030-16692-2_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-16692-2_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16691-5
Online ISBN: 978-3-030-16692-2
eBook Packages: Computer ScienceComputer Science (R0)