Abstract
We study learning from unlabeled data distributed across clients in a federated fashion where raw data do not leave the corresponding devices. We develop a K-means clustering algorithm within this federated setting where the local datasets are clustered at the clients, and a server generates the global clusters after aggregating the local ones. Given the importance of initialization on the federated \(\underline{K}\)-means algorithm (FKM), our objective is to find better initial centroids by utilizing the local data stored on each client. To this end, we start the centroid initialization at the clients, rather than at the server, since the server initially lacks any preliminary insight into the clients’ data. The clients first select their local initial clusters and subsequently share their clustering information (including cluster centroids and sizes) with the server. The server then employs a greedy algorithm to determine the global initial centroids based on the information received from the clients. We refer to this idea as G-FKM. Numerical results obtained from both synthetic and public datasets demonstrate that our proposed algorithm demonstrates accelerated convergence, exhibiting reduced within-cluster sum of squares (\(\overline{\text{ WCSS }}\)) and higher adjusted Rand Index compared to three distinct federated K-means variants. This improvement comes at a relatively low cost of sending limited additional information from the clients to the server, rather than conducting the initialization entirely at the server. Furthermore, we have also observed that the proposed algorithm performs better than the centralized algorithm for cases where the data distribution across the clients is highly skewed.
Similar content being viewed by others
References
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666
Agarwal P, Alam MA, Biswas R (2011) Issues, challenges and tools of clustering algorithms. Int J Comput Sci Issues (IJCSI) 8(3):523
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193
Min E, Guo X, Liu Q, Zhang G, Cui J, Long J (2018) A survey of clustering with deep learning: from the perspective of network architecture. IEEE Access 6:39501–39514
Mothukuri V, Parizi RM, Pouriyeh S, Huang Y, Dehghantanha A, Srivastava G (2021) A survey on security and privacy of federated learning. Future Gener Comput Syst 115:619–640
Zhang J, Chen B, Zhao Y, Cheng X, Hu F (2018) Data security and privacy-preserving in edge computing paradigm: survey and open issues. IEEE Access 6:18209–18237
Liu J, Huang J, Zhou Y, Li X, Ji S, Xiong H, Dou D (2022) From distributed machine learning to federated learning: a survey. Knowl Inf Syst 64(4):885–917
Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River
Kant K (2009) Data center evolution: a tutorial on state of the art, issues, and challenges. Comput Netw 53(17):2939–2965
Triebe O, Rajagopal R (2021) Federated K-means clustering algorithm. https://github.com/ourownstory/federated_kmeans. Accessed 01 Dec 2021
Wang Y, Ma J, Gao N, Wen Q, Sun L, Guo H (2023) Federated fuzzy k-means for privacy-preserving behavior analysis in smart grids. Appl Energy 331:120396
Li S, Hou S, Buyukates B, Avestimehr S (2022) Secure federated clustering. arXiv preprint arXiv:2205.15564
Brandão A, Mendes R, Vilela JP (2021) Efficient privacy preserving distributed k-means for non-iid data. In: International symposium on intelligent data analysis. Springer, pp 439–451
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Chung J, Lee K, Ramchandran K (2022) Federated unsupervised clustering with generative models. In: AAAI 2022 International Workshop on Trustable, Verifiable and Auditable Federated Learning
Ghosh A, Chung J, Yin D, Ramchandran K (2020) An efficient framework for clustered federated learning. Adv Neural Inf Process Syst 33:19586–19597
Dennis DK, Li T, Smith V (2021) Heterogeneity for the win: one-shot federated clustering. In: International conference on machine learning. PMLR, pp 2611–2620
Vassilvitskii S, Arthur D (2006) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Arthur D, Vassilvitskii S (2006) k-means++: the advantages of careful seeding. Technical report, Stanford
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Wikipedia: Adjusted Rand Index. https://en.wikipedia.org/wiki/Rand_index. Accessed 01 Nov 2021 (2021)
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Van Rossum G, Drake FL (2009) Python 3 reference manual. CreateSpace, Scotts Valley
Harris CR, Millman KJ, Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, Kerkwijk MH, Brett M, Haldane A, Río JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2
Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Shabtai A, Breitenbacher D, Elovici Y (2018) N-baiot-network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Comput 17(3):12–22
LeCun Y, Cortes C, Burges C (1994) The mnist database of handwritten digits. yann.lecun.com/exdb/mnis 1998
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Author information
Authors and Affiliations
Contributions
All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, K., Mohammadi Amiri, M. & Kulkarni, S.R. Greedy centroid initialization for federated K-means. Knowl Inf Syst 66, 3393–3425 (2024). https://doi.org/10.1007/s10115-024-02066-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-024-02066-x