×

Turning big data into tiny data: constant-size coresets for \(k\)-means, PCA, and projective clustering. (English) Zbl 1451.68244

Summary: We develop and analyze a method to reduce the size of a very large set of data points in a high-dimensional Euclidean space \(\mathbb{R}^d\) to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first \(k\) principal components of the reduced set will return approximately the first \(k\) principal components of the original set or computing the centers of a \(k\)-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension \(d\) of the input space and that the sets are mergeable [P. K. Agarwal et al., ACM Trans. Database Syst. 38, No. 4, Article No. 26, 28 p. (2013; Zbl 1321.68238)]. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets. It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as \(k\)-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on data-dependently projecting the points on a low-dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including \(k\)-means clustering, principal component analysis, and subspace clustering. The main conceptual contribution is a new coreset definition that allows charging costs that appear for every solution to an additive constant.

MSC:

68T09 Computational aspects of data analysis and big data
62H25 Factor analysis and principal components; correspondence analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62R07 Statistical aspects of big data and data science
68W40 Analysis of algorithms

Citations:

Zbl 1321.68238

References:

[1] P. K. Agarwal, G. Cormode, Z. Huang, J. Phillips, Z. Wei, and K. Yi, Mergeable summaries, in Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, New York, ACM, 2012, pp. 23-34. · Zbl 1321.68238
[2] P. Awasthi, M. Charikar, R. Krishnaswamy, and A. K. Sinop, The hardness of approximation of euclidean k-means, in Proceedings of the 31st SoCG, 2015, pp. 754-767. · Zbl 1378.68048
[3] D. Aloise, A. Deshpande, P. Hansen, and P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Mach. Learn., 75 (2009), pp. 245-248. · Zbl 1378.68047
[4] A. Aggarwal, A. Deshpande, and R. Kannan, Adaptive sampling for \(k\)-means clustering, in \Proc 12th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, 2009, pp. 15-28. · Zbl 1254.68351
[5] P. Agarwal, S. Har-Peled, and K. Varadarajan, Approximating extent measures of points, J. ACM, 51 (2004), pp. 606-635. · Zbl 1204.68240
[6] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan, Approximating extent measures of points, J. ACM, 51 (2004), pp. 606-635. · Zbl 1204.68240
[7] M. R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen, and C. Sohler, StreamKM\syms: A clustering algorithm for data streams, ACM J. Exp. Algorithmics, 17 (2012). · Zbl 1284.68234
[8] S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward, Better guarantees for k-means and euclidean k-median by primal-dual algorithms, in Proceedings of the 58th FOCS, 2017, pp. 61-72.
[9] D. Arthur and S. Vassilvitskii, k-means++: The advantages of careful seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, Philadelphia, 2007, pp. 1027-1035. · Zbl 1302.68273
[10] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, Learnability and the Vapnik-Chervonenkis dimension, J. ACM, 36 (1989), pp. 929-965. · Zbl 0697.68079
[11] M. Beyer, Gartner Says Solving Big Data Challenge Involves More than Just Managing Volumes of Data, http://www.gartner.com/it/page.jsp?id=1731916 (2011).
[12] V. Braverman, D. Feldman, and H. Lang, New Frameworks for Offline and Streaming Coreset Constructions, preprint, arXiv:1612.00889, 2016.
[13] V. Braverman, G. Frahling, H. Lang, C. Sohler, and L. F. Yang, Clustering high dimensional dynamic data streams, in Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 2017, pp. 576-585.
[14] C. Boutsidis, M. W. Mahoney, and P. Drineas, Unsupervised feature selection for the \(k\)-means clustering problem, in \Proc 23rd Annual Conference on Neural Information Processing Systems, 2009, pp. 153-161. · Zbl 1420.68235
[15] J. L. Bentley and J. B. Saxe, Decomposable searching problems i. static-to-dynamic transformation, J. Algorithms, 1 (1980), pp. 301-358. · Zbl 0461.68065
[16] J. Batson, D. A. Spielman, and N. Srivastava, Twice-ramanujan sparsifiers, SIAM J. Comput., 41 (2012), pp. 1704-1721. · Zbl 1260.05092
[17] C. Boutsidis, A. Zouzias, and P. Drineas, Random Projections for \(k\)-means Clustering, in \Proc 24th Annual Conference on Neural Information Processing Systems, 2010, pp. 298-306.
[18] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas, Randomized dimensionality reduction for k-means clustering, IEEE Trans. Inform. Theory, 61 (2015), pp. 1045-1062. · Zbl 1359.62232
[19] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu, Dimensionality reduction for k-means clustering and low rank approximation, in Proceedings of the 47th Annual ACM on Symposium on Theory of Computing, 2015, pp. 163-172. · Zbl 1321.68398
[20] K. Chen, On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications, SIAM J. Comput., 39 (2009), pp. 923-947. · Zbl 1192.68880
[21] V. Cohen-Addad, P. N. Klein, and C. Mathieu, Local search yields approximation schemes for k-means and k-median in Euclidean and minor-free metrics, in Proceedings of the IEEE 57th Annual Symposium on Foundations of Computer Science, 2016, pp. 353-364. · Zbl 1421.68205
[22] M. B. Cohen, J. Nelson, and D. P. Woodruff, Optimal approximate matrix product in terms of stable rank, in Proceedings of the 43rd International Colloquium on Automata, Languages, and Programming, LIPIcs. Leibniz, Int. Pros. Inform. 55, I. Chatzigiannakis, M. Mitzenmacher, Y. Rabani, and D. Sangiorgi, eds. Schloss Dagstuhl, Leibniz-Zentrum fuer Informatik, 2016, pp. 11:1-11:14. · Zbl 1404.65032
[23] K. L. Clarkson and D. P. Woodruff, Numerical linear algebra in the streaming model, in Proceedings of the 41st STOC, 2009, pp. 205-214. · Zbl 1304.65138
[24] K. L. Clarkson and D. P. Woodruff, Low rank approximation and regression in input sparsity time, in Proceedings of STOC 2013, pp. 81-90. · Zbl 1293.65069
[25] P. Drineas, A. M. Frieze, R. Kannan, S. Vempala, and V. Vinay, Clustering large graphs via the singular value decomposition, Mach. Learn., 56 (2004), pp. 9-33. · Zbl 1089.68090
[26] A. Deshpande and L. Rademacher, Efficient volume sampling for row/column subset selection, in Proceedings of the 51th FOCS, 2010, pp. 329-338.
[27] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang, Matrix approximation and projective clustering via volume sampling, Theory Comput., 2 (2006), pp. 225-247. · Zbl 1213.68702
[28] A. Deshpande, M. Tulsiani, and N. K. Vishnoi, Algorithms and hardness for subspace approximation, in \Proc 22nd ACM-SIAM Symposium on Discrete Algorithms, 2011, pp. 482-496. · Zbl 1374.68657
[29] A. Deshpande and S. Vempala, Adaptive sampling and fast low-rank matrix approximation, in Proceedings of the 10th RANDOM, 2006, pp. 292-303. · Zbl 1155.68575
[30] D. Eisenstat and D. Angluin, The VC dimension of k-fold union, Inform. Process. Lett., 101 (2007), pp. 181-184. · Zbl 1185.68373
[31] M. Edwards and K. R. Varadarajan, No coreset, no cry: II, in \Proc 25th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, 2005, pp. 107-115. · Zbl 1172.68638
[32] D. Feldman, A. Fiat, and M. Sharir, Coresets forweighted facilities and their applications, in \Proc 47th Annual IEEE Symposium on Foundations of Computer Science, 2006, pp. 315-324.
[33] H. Fichtenberger, M. Gillé, M. Schmidt, C. Schwiegelshohn, and C. Sohler, BICO: BIRCH Meets Coresets for k-Means Clustering, in \Proc 21st Annual European Symposium on Algorithms, 2013, pp. 481-492. · Zbl 1395.68360
[34] D. Feldman and M. Langberg, A unified framework for approximating and clustering data, in \Proc 43rd ACM Symposium on the Theory of Computing, 2011. · Zbl 1288.90046
[35] D. Feldman and M. Langberg, A Unified Framework for Approximating and Clustering Data, http://arxiv.org/abs/1106.1379, 2011. · Zbl 1288.90046
[36] D. Feldman, M. Monemizadeh, and C. Sohler, A PTAS for k-means clustering based on weak coresets, in \Proc 23rd ACM Symposium on Computational Geometry, 2007, pp. 11-18. · Zbl 1209.68639
[37] D. Feldman, M. Monemizadeh, C. Sohler, and D. P. Woodruff, Coresets and sketches for high dimensional subspace approximation problems, in Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, Philadelphia, 2010, pp. 630-649. · Zbl 1288.68225
[38] Z. Friggstad, M. Rezapour, and M. R. Salavatipour, Local search yields a PTAS for k-means in doubling metrics, in Proceedings of the 57th FOCS, 2016, pp. 365-374. · Zbl 1422.68296
[39] G. Frahling and C. Sohler, Coresets in dynamic geometric data streams, in \Proc 37th ACM Symposium on the Theory of Computing, 2005, pp. 209-217. · Zbl 1192.68360
[40] D. Feldman and L. J. Schulman, Data reduction for weighted and outlier-resistant clustering, in Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, Philadelphia, 2012, pp. 1343-1354. · Zbl 1426.62184
[41] D. Feldman, M. Schmidt, and C. Sohler, Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering, in \Proc 24th ACM-SIAM Symposium on Discrete Algorithms, 2013, pp. 1434-1453. · Zbl 1421.68219
[42] D. Feldman, M. Volkov, and D. Rus, Dimensionality reduction of massive sparse datasets using coresets, CoRR abs/1503.01663, 2015.
[43] D. Feldman, M. Volkov, and D. Rus, Dimensionality reduction of massive sparse datasets using coresets, in Proceedings of the 29th NIPS, 2016, pp. 2766-2774.
[44] H. G. Golub and W. Kahan, Calculating the singular values and pseudo-inverse of a matrix, J. SIAM, Ser. B Numer. Anal., 2 (1965), pp. 205-224. · Zbl 0194.18201
[45] P. Gritzmann, V. Klee, and D. G. Larman, Largest j-simplices n-polytopes, Discrete Comput. Geom., 13 (1995), pp. 477-515. · Zbl 0826.52014
[46] M. Ghashami, E. Liberty, J. M. Phillips, and D. P. Woodruff, Frequent directions: Simple and deterministic matrix sketching, SIAM J. Comput., 45 (2016), pp. 1762-1792. · Zbl 1348.65075
[47] G. H. Golub and C. Reinsch, Singular value decomposition and least squares solutions, Numer. Math., 14 (1970), pp. 403-420. · Zbl 0181.17602
[48] S. Har-Peled, No coreset, no cry, in \Proc 24th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, 2004, pp. 324-335. · Zbl 1117.68525
[49] S. Har-Peled, Coresets for discrete integration and clustering, in Proceedings of the 26th FSTTCS, 2006, pp. 33-44. · Zbl 1177.68238
[50] J. Hellerstein, Parallel Programming in the Age of Big Data, Gigaom Blog, November 9, 2008, https://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/.
[51] M. Hilbert and P. Lopez, The world’s technological capacity to store, communicate, and compute information, Science, 332 (2011), pp. 60-65.
[52] N. Halko, P.-G. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM Rev., 53 (2011), pp. 217-288. · Zbl 1269.65043
[53] S. Har-Peled and A. Kushal, Smaller coresets for k-median and k-means clustering, Discrete Comput. Geom., 37 (2007), pp. 3-19. · Zbl 1106.68112
[54] S. Har-Peled and S. Mazumdar, Coresets for \(k\)-means and \(k\)-median clustering and their applications, in \Proc 36th ACM Symposium on the Theory of Computing, 2004, pp. 291-300. · Zbl 1192.68904
[55] S. Har-Peled and M. Sharir, Relative (p, \( \epsilon )\)-approximations in geometry, Discrete Comput. Geom., 45 (2011), pp. 462-496. · Zbl 1220.68106
[56] What Is Big Data? Bringing Big Data to the Enterprise, 2011, https://www.ibm.com/software/data/bigdata/.
[57] IceCube Neutrino Observatory, http://icecube.wisc.edu/.
[58] A. Kumar, Y. Sabharwal, and S. Sen, Linear-time approximation schemes for clustering problems in any dimensions, J. ACM, 57 (2010), 5:1-5:32. · Zbl 1327.68334
[59] Large Hadron Collider Beauty Experiment, https://www.lhcb-public.web.cern.ch/lhcb-public/.
[60] E. Liberty, Simple and deterministic matrix sketching, in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2013, pp. 581-588.
[61] Y. Li, P. M. Long, and A. Srinivasan, Improved bounds on the sample complexity of learning, J. Comput. System Sci., 62 (2001), pp. 516-527. · Zbl 0990.68081
[62] M. Langberg and L. J. Schulman, Universal epsilon-approximators for integrals, in \Proc 21st ACM-SIAM Symposium on Discrete Algorithms, 2010, pp. 598-607. · Zbl 1288.68142
[63] E. Lee, M. Schmidt, and J. Wright, Improved and simplified inapproximability for k-means, Inform. Process. Lett., 120 (2017), pp. 40-43. · Zbl 1400.68250
[64] M. W. Mahoney, Randomized algorithms for matrices and data, Found. Trends Mach. Learn., 3 (2011), pp. 123-224. · Zbl 1232.68173
[65] M. Mahajan, P. Nimbhorkar, and K. R. Varadarajan, The Planar \(k\)-means Problem is NP-Hard, in \Proc 3rd Workshop on Algorithms and Computation, 2009, pp. 274-285. · Zbl 1211.68212
[66] N. Megiddo and A. Tamir, On the complexity of locating linear facilities in the plane, Oper. Res. Lett., 1 (1982), pp. 194-197. · Zbl 0507.90025
[67] S. Muthukrishnan, Data streams: Algorithms and applications, Found. Trends Theor. Comput. Sci., 1 (2005), pp. 117-236. · Zbl 1128.68025
[68] N. H. Nguyen, T. T. Do, and T. D. Tran, A fast and efficient algorithm for low-rank approximation of a matrix, in Proceedings of the 41st STOC, 2009, pp. 215-224. · Zbl 1304.65140
[69] K. Pearson, On lines and planes of closest fit to systems of points in space, London Edinburgh Dublin Philos. Mag. J. Sci., 2 (1901), 11, pp. 559-572. · JFM 32.0710.04
[70] J. M. Phillips and D. Feldman, private communication, 2015.
[71] A. Quarteroni, R. Sacco, and F. Saleri, Numerical Mathematics, Texts in Appl. Math. 37, Springer, New York, 2000. · Zbl 0957.65001
[72] T. Sarlós, Improved approximation algorithms for large matrices via random projections, in Proceedings of the 47th FOCS, 2006, pp. 143-152.
[73] M. Schmidt, Coresets and Streaming Algorithms for the k-Means Problem and Related Clustering Objectives, Ph.D. thesis, Universität Dortmund, 2014.
[74] T. Segaran and J. Hammerbacher, Beautiful Data: The Stories Behind Elegant Data Solutions, O’Reilly Media, Newton, MA, 2009.
[75] G. W. Stewart, On the early history of the singular value decomposition, SIAM Rev., 35 (1993), pp. 551-566. · Zbl 0799.01016
[76] N. D. Shyamalkumar and K. R. Varadarajan, Efficient subspace approximation algorithms, Discrete Comput. Geom., 47 (2012), pp. 44-63. · Zbl 1232.68167
[77] M. D. Vose, A linear algorithm for generating random numbers with a given distribution, IEEE Trans. Software Engrg., 17 (1991), pp. 972-975.
[78] K. Varadarajan and X. Xiao, A near-linear algorithm for projective clustering integer points, in \Proc ACM-SIAM Symposium on Discrete Algorithms, 2012. · Zbl 1422.68259
[79] K. Varadarajan and X. Xiao, On the sensitivity of shape fitting problems, in \Proc 32nd Annual Conference on IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, 2012, pp. 486-497. · Zbl 1359.62266
[80] H. E. Warren, Lower bounds for approximation by nonlinear manifolds, Trans. Amer. Math. Soc., 133 (1968), pp. 167-178. · Zbl 0174.35403
[81] T. White, Hadoop: The Definitive Guide, O’Reilly Media, Newton, MA, 2012.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.