Greedy centroid initialization for federated K-means

291 Accesses
Explore all metrics

Abstract

We study learning from unlabeled data distributed across clients in a federated fashion where raw data do not leave the corresponding devices. We develop a K-means clustering algorithm within this federated setting where the local datasets are clustered at the clients, and a server generates the global clusters after aggregating the local ones. Given the importance of initialization on the federated $\underline{K}$-means algorithm (FKM), our objective is to find better initial centroids by utilizing the local data stored on each client. To this end, we start the centroid initialization at the clients, rather than at the server, since the server initially lacks any preliminary insight into the clients’ data. The clients first select their local initial clusters and subsequently share their clustering information (including cluster centroids and sizes) with the server. The server then employs a greedy algorithm to determine the global initial centroids based on the information received from the clients. We refer to this idea as G-FKM. Numerical results obtained from both synthetic and public datasets demonstrate that our proposed algorithm demonstrates accelerated convergence, exhibiting reduced within-cluster sum of squares ($\overline{\text{ WCSS }}$) and higher adjusted Rand Index compared to three distinct federated K-means variants. This improvement comes at a relatively low cost of sending limited additional information from the clients to the server, rather than conducting the initialization entirely at the server. Furthermore, we have also observed that the proposed algorithm performs better than the centralized algorithm for cases where the data distribution across the clients is highly skewed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data: an optimized approach for cluster initialization

Article Open access 20 July 2023

DISCERN: diversity-based selection of centroids for k-estimation and rapid non-stochastic clustering

Article 21 September 2020

An Efficient Approach for Selection of Initial Cluster Centroids for k-means

References

Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666
Article Google Scholar
Agarwal P, Alam MA, Biswas R (2011) Issues, challenges and tools of clustering algorithms. Int J Comput Sci Issues (IJCSI) 8(3):523
Google Scholar
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193
Article Google Scholar
Min E, Guo X, Liu Q, Zhang G, Cui J, Long J (2018) A survey of clustering with deep learning: from the perspective of network architecture. IEEE Access 6:39501–39514
Article Google Scholar
Mothukuri V, Parizi RM, Pouriyeh S, Huang Y, Dehghantanha A, Srivastava G (2021) A survey on security and privacy of federated learning. Future Gener Comput Syst 115:619–640
Article Google Scholar
Zhang J, Chen B, Zhao Y, Cheng X, Hu F (2018) Data security and privacy-preserving in edge computing paradigm: survey and open issues. IEEE Access 6:18209–18237
Article Google Scholar
Liu J, Huang J, Zhou Y, Li X, Ji S, Xiong H, Dou D (2022) From distributed machine learning to federated learning: a survey. Knowl Inf Syst 64(4):885–917
Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet Google Scholar
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112
Article Google Scholar
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210
Article Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River
Google Scholar
Kant K (2009) Data center evolution: a tutorial on state of the art, issues, and challenges. Comput Netw 53(17):2939–2965
Article Google Scholar
Triebe O, Rajagopal R (2021) Federated K-means clustering algorithm. https://github.com/ourownstory/federated_kmeans. Accessed 01 Dec 2021
Wang Y, Ma J, Gao N, Wen Q, Sun L, Guo H (2023) Federated fuzzy k-means for privacy-preserving behavior analysis in smart grids. Appl Energy 331:120396
Article Google Scholar
Li S, Hou S, Buyukates B, Avestimehr S (2022) Secure federated clustering. arXiv preprint arXiv:2205.15564
Brandão A, Mendes R, Vilela JP (2021) Efficient privacy preserving distributed k-means for non-iid data. In: International symposium on intelligent data analysis. Springer, pp 439–451
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article Google Scholar
Chung J, Lee K, Ramchandran K (2022) Federated unsupervised clustering with generative models. In: AAAI 2022 International Workshop on Trustable, Verifiable and Auditable Federated Learning
Ghosh A, Chung J, Yin D, Ramchandran K (2020) An efficient framework for clustered federated learning. Adv Neural Inf Process Syst 33:19586–19597
Google Scholar
Dennis DK, Li T, Smith V (2021) Heterogeneity for the win: one-shot federated clustering. In: International conference on machine learning. PMLR, pp 2611–2620
Vassilvitskii S, Arthur D (2006) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Arthur D, Vassilvitskii S (2006) k-means++: the advantages of careful seeding. Technical report, Stanford
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Article Google Scholar
Wikipedia: Adjusted Rand Index. https://en.wikipedia.org/wiki/Rand_index. Accessed 01 Nov 2021 (2021)
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet Google Scholar
Van Rossum G, Drake FL (2009) Python 3 reference manual. CreateSpace, Scotts Valley
Google Scholar
Harris CR, Millman KJ, Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, Kerkwijk MH, Brett M, Haldane A, Río JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2
Article Google Scholar
Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Shabtai A, Breitenbacher D, Elovici Y (2018) N-baiot-network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Comput 17(3):12–22
Article Google Scholar
LeCun Y, Cortes C, Burges C (1994) The mnist database of handwritten digits. yann.lecun.com/exdb/mnis 1998
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

Download references

Author information

Kun Yang and Mohammad Mohammadi Amiri have contributed equally to this work.

Authors and Affiliations

Princeton University, 98 Charlton Street, Princeton, NJ, 08540, USA
Kun Yang & Sanjeev R. Kulkarni
Rensselaer Polytechnic Institute, 110 8th St, Troy, NY, 12180, USA
Mohammad Mohammadi Amiri

Authors

Kun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Mohammadi Amiri
View author publications
You can also search for this author in PubMed Google Scholar
Sanjeev R. Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors reviewed the manuscript.

Corresponding author

Correspondence to Kun Yang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, K., Mohammadi Amiri, M. & Kulkarni, S.R. Greedy centroid initialization for federated K-means. Knowl Inf Syst 66, 3393–3425 (2024). https://doi.org/10.1007/s10115-024-02066-x

Download citation

Received: 17 October 2023
Revised: 01 January 2024
Accepted: 12 January 2024
Published: 24 February 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s10115-024-02066-x

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Big data: an optimized approach for cluster initialization

DISCERN: diversity-based selection of centroids for k-estimation and rapid non-stochastic clustering

An Efficient Approach for Selection of Initial Cluster Centroids for k-means

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Greedy centroid initialization for federated K-means

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Big data: an optimized approach for cluster initialization

DISCERN: diversity-based selection of centroids for k-estimation and rapid non-stochastic clustering

An Efficient Approach for Selection of Initial Cluster Centroids for k-means

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation