Skip to main content
Log in

Greedy centroid initialization for federated K-means

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We study learning from unlabeled data distributed across clients in a federated fashion where raw data do not leave the corresponding devices. We develop a K-means clustering algorithm within this federated setting where the local datasets are clustered at the clients, and a server generates the global clusters after aggregating the local ones. Given the importance of initialization on the federated \(\underline{K}\)-means algorithm (FKM), our objective is to find better initial centroids by utilizing the local data stored on each client. To this end, we start the centroid initialization at the clients, rather than at the server, since the server initially lacks any preliminary insight into the clients’ data. The clients first select their local initial clusters and subsequently share their clustering information (including cluster centroids and sizes) with the server. The server then employs a greedy algorithm to determine the global initial centroids based on the information received from the clients. We refer to this idea as G-FKM. Numerical results obtained from both synthetic and public datasets demonstrate that our proposed algorithm demonstrates accelerated convergence, exhibiting reduced within-cluster sum of squares (\(\overline{\text{ WCSS }}\)) and higher adjusted Rand Index compared to three distinct federated K-means variants. This improvement comes at a relatively low cost of sending limited additional information from the clients to the server, rather than conducting the initialization entirely at the server. Furthermore, we have also observed that the proposed algorithm performs better than the centralized algorithm for cases where the data distribution across the clients is highly skewed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666

    Article  Google Scholar 

  2. Agarwal P, Alam MA, Biswas R (2011) Issues, challenges and tools of clustering algorithms. Int J Comput Sci Issues (IJCSI) 8(3):523

    Google Scholar 

  3. Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193

    Article  Google Scholar 

  4. Min E, Guo X, Liu Q, Zhang G, Cui J, Long J (2018) A survey of clustering with deep learning: from the perspective of network architecture. IEEE Access 6:39501–39514

    Article  Google Scholar 

  5. Mothukuri V, Parizi RM, Pouriyeh S, Huang Y, Dehghantanha A, Srivastava G (2021) A survey on security and privacy of federated learning. Future Gener Comput Syst 115:619–640

    Article  Google Scholar 

  6. Zhang J, Chen B, Zhao Y, Cheng X, Hu F (2018) Data security and privacy-preserving in edge computing paradigm: survey and open issues. IEEE Access 6:18209–18237

    Article  Google Scholar 

  7. Liu J, Huang J, Zhou Y, Li X, Ji S, Xiong H, Dou D (2022) From distributed machine learning to federated learning: a survey. Knowl Inf Syst 64(4):885–917

  8. Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137

    Article  MathSciNet  Google Scholar 

  9. Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112

    Article  Google Scholar 

  10. Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210

    Article  Google Scholar 

  11. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River

    Google Scholar 

  12. Kant K (2009) Data center evolution: a tutorial on state of the art, issues, and challenges. Comput Netw 53(17):2939–2965

    Article  Google Scholar 

  13. Triebe O, Rajagopal R (2021) Federated K-means clustering algorithm. https://github.com/ourownstory/federated_kmeans. Accessed 01 Dec 2021

  14. Wang Y, Ma J, Gao N, Wen Q, Sun L, Guo H (2023) Federated fuzzy k-means for privacy-preserving behavior analysis in smart grids. Appl Energy 331:120396

    Article  Google Scholar 

  15. Li S, Hou S, Buyukates B, Avestimehr S (2022) Secure federated clustering. arXiv preprint arXiv:2205.15564

  16. Brandão A, Mendes R, Vilela JP (2021) Efficient privacy preserving distributed k-means for non-iid data. In: International symposium on intelligent data analysis. Springer, pp 439–451

  17. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  18. Chung J, Lee K, Ramchandran K (2022) Federated unsupervised clustering with generative models. In: AAAI 2022 International Workshop on Trustable, Verifiable and Auditable Federated Learning

  19. Ghosh A, Chung J, Yin D, Ramchandran K (2020) An efficient framework for clustered federated learning. Adv Neural Inf Process Syst 33:19586–19597

    Google Scholar 

  20. Dennis DK, Li T, Smith V (2021) Heterogeneity for the win: one-shot federated clustering. In: International conference on machine learning. PMLR, pp 2611–2620

  21. Vassilvitskii S, Arthur D (2006) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035

  22. Arthur D, Vassilvitskii S (2006) k-means++: the advantages of careful seeding. Technical report, Stanford

  23. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  24. Wikipedia: Adjusted Rand Index. https://en.wikipedia.org/wiki/Rand_index. Accessed 01 Nov 2021 (2021)

  25. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227

    Article  Google Scholar 

  26. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  Google Scholar 

  27. Van Rossum G, Drake FL (2009) Python 3 reference manual. CreateSpace, Scotts Valley

    Google Scholar 

  28. Harris CR, Millman KJ, Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, Kerkwijk MH, Brett M, Haldane A, Río JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2

    Article  Google Scholar 

  29. Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Shabtai A, Breitenbacher D, Elovici Y (2018) N-baiot-network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Comput 17(3):12–22

    Article  Google Scholar 

  30. LeCun Y, Cortes C, Burges C (1994) The mnist database of handwritten digits. yann.lecun.com/exdb/mnis 1998

  31. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

Download references

Author information

Authors and Affiliations

Authors

Contributions

All authors reviewed the manuscript.

Corresponding author

Correspondence to Kun Yang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, K., Mohammadi Amiri, M. & Kulkarni, S.R. Greedy centroid initialization for federated K-means. Knowl Inf Syst 66, 3393–3425 (2024). https://doi.org/10.1007/s10115-024-02066-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-024-02066-x

Keywords

Navigation