Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets

977 Accesses
5 Citations
3 Altmetric
Explore all metrics

This article has been updated

Abstract

Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. A clustering algorithm should be able, despite this heterogeneity, to extract discriminant pieces of information from the variables in order to design groups. In this work we introduce a multilayer architecture model-based clustering method called Mixed Deep Gaussian Mixture Model that can be viewed as an automatic way to merge the clustering performed separately on continuous and non-continuous data. This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data. In this sense, we generalize Generalized Linear Latent Variable Models and Deep Gaussian Mixture Models. We also design a new initialisation strategy and a data-driven method that selects the best specification of the model and the optimal number of clusters for a given dataset. Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach by comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DeepCluster: A General Clustering Framework Based on Deep Learning

Mixing Consistent Deep Clustering

Deep generative clustering methods based on disentangled representations and augmented data

Article 28 April 2024

Change history

References

Ahmad A, Khan SS (2019) Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7:31883–31902
Article Google Scholar
Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Selected papers of Hirotugu Akaike. Springer, Berlin, pp 199–213
Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2017) Automatic differentiation in machine learning: a survey. J Mach Learn Res 18(1):5595–5637
MathSciNet MATH Google Scholar
Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575
Article MathSciNet MATH Google Scholar
Blalock D, Ortiz JJG, Frankle J, Guttag J (2020) What is the state of neural network pruning? arXiv preprint arXiv:2003.03033
Cagnone S, Viroli C (2014) A factor mixture model for analyzing heterogeneity and cognitive structure of dementia. AStA Adv Stat Anal 98(1):1–20
Article MathSciNet MATH Google Scholar
Chiu T, Fang D, Chen J, Wang Y, Jeris C (2001) A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 263–268
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231
Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
Article MathSciNet MATH Google Scholar
Fruehwirth-Schnatter S, Lopes HF (2018) Sparse bayesian factor analysis when the number of factors is unknown. arXiv preprint arXiv:1804.04231
Ghahramani Z, Hinton GE et al (1996) The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871
Article Google Scholar
Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Singapore, pp 21–34
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Disc 2(3):283–304
Article Google Scholar
Jogin M, Madhulika M, Divya G, Meghana R, Apoorva S et al (2018) Feature extraction using convolution neural networks (CNN) and deep learning. In: 2018 3rd IEEE international conference on recent trends in electronics, information & communication technology (RTEICT). IEEE, pp 2319–2323
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Google Scholar
Maclaurin D, Duvenaud D, Adams RP (2015) Autograd: Effortless gradients in numpy. In: ICML 2015 AutoML Workshop, vol 238, p 5
McLachlan GJ, Peel D (2000) Finite mixture models. Probability and statistics–applied probability and statistics section, vol 299. Wiley, New York
McLachlan GJ, Peel D, Bean RW (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3–4):379–388
Article MathSciNet MATH Google Scholar
Melnykov V, Maitra R et al (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116
Article MathSciNet MATH Google Scholar
Moustaki I (2003) A general class of latent variable models for ordinal manifest variables with covariate effects on the manifest and latent variables. Br J Math Stat Psychol 56(2):337–357
Article MathSciNet Google Scholar
Moustaki I, Knott M (2000) Generalized latent trait models. Psychometrika 65(3):391–411
Article MathSciNet MATH Google Scholar
Nenadic O, Greenacre M (2005) Computation of multiple correspondence analysis, with code in r. Technical report, Universitat Pompeu Fabra
Niku J, Brooks W, Herliansyah R, Hui FK, Taskinen S, Warton DI (2019) Efficient estimation of generalized linear latent variable models. PLoS ONE 14(5):481–497
Article Google Scholar
Pagès J (2014) Multiple factor analysis by example using R. CRC Press, Cambridge
Book MATH Google Scholar
Patil DD, Wadhai V, Gokhale J (2010) Evaluation of decision tree pruning algorithms for complexity and classification accuracy. Int J Comput Appl 11(2):23–30
Google Scholar
Philip G, Ottaway B (1983) Mixed data cluster analysis: an illustration using cypriot hooked-tang weapons. Archaeometry 25(2):119–133
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet MATH Google Scholar
Selosse M, Gormley C, Jacques J, Biernacki C (2020) A bumpy journey: exploring deep gaussian mixture models. In: ”I Can’t Believe It’s Not Better!”NeurIPS 2020 workshop
Viroli C, McLachlan GJ (2019) Deep gaussian mixture models. Stat Comput 29(1):43–51
Google Scholar
Wei GC, Tanner MA (1990) A monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704
Article Google Scholar
Wold S, Sjöström M, Eriksson L (2001) Pls-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58(2):109–130
Article Google Scholar

Download references

Acknowledgements

The authors thank the reviewers for their helpful comments which helped to improve the manuscript. This work benefited from the support of the Research Chair DIALog under the aegis of the Risk Foundation, a joint initiative by CNP Assurances and ISFA, Université Claude Bernard Lyon 1 (UCBL). This work also benefited from funds of the LIA LYSM (agreement between AMU, CNRS, ECM and INdAM)

Author information

Authors and Affiliations

CNRS, Centrale Marseille, I2M, MIO, Aix-Marseille University, Marseille, France
Robin Fuchs
Univ Lyon, UCBL, ISFA LSAF EA2429, Lyon, France
Denys Pommeret
Department of Statistical Sciences, University of Bologna, Bologna, Italy
Cinzia Viroli

Authors

Robin Fuchs
View author publications
You can also search for this author in PubMed Google Scholar
Denys Pommeret
View author publications
You can also search for this author in PubMed Google Scholar
Cinzia Viroli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robin Fuchs.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 687 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fuchs, R., Pommeret, D. & Viroli, C. Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets. Adv Data Anal Classif 16, 31–53 (2022). https://doi.org/10.1007/s11634-021-00466-3

Download citation

Received: 13 January 2021
Revised: 10 September 2021
Accepted: 16 September 2021
Published: 06 October 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11634-021-00466-3

Keywords

Mathematics Subject Classification

62H30

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DeepCluster: A General Clustering Framework Based on Deep Learning

Mixing Consistent Deep Clustering

Deep generative clustering methods based on disentangled representations and augmented data

Change history

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 687 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DeepCluster: A General Clustering Framework Based on Deep Learning

Mixing Consistent Deep Clustering

Deep generative clustering methods based on disentangled representations and augmented data

Change history

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 687 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation