Skip to main content
Log in

Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

This article has been updated

Abstract

Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. A clustering algorithm should be able, despite this heterogeneity, to extract discriminant pieces of information from the variables in order to design groups. In this work we introduce a multilayer architecture model-based clustering method called Mixed Deep Gaussian Mixture Model that can be viewed as an automatic way to merge the clustering performed separately on continuous and non-continuous data. This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data. In this sense, we generalize Generalized Linear Latent Variable Models and Deep Gaussian Mixture Models. We also design a new initialisation strategy and a data-driven method that selects the best specification of the model and the optimal number of clusters for a given dataset. Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach by comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Change history

References

  • Ahmad A, Khan SS (2019) Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7:31883–31902

    Article  Google Scholar 

  • Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Selected papers of Hirotugu Akaike. Springer, Berlin, pp 199–213

  • Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2017) Automatic differentiation in machine learning: a survey. J Mach Learn Res 18(1):5595–5637

    MathSciNet  MATH  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575

    Article  MathSciNet  MATH  Google Scholar 

  • Blalock D, Ortiz JJG, Frankle J, Guttag J (2020) What is the state of neural network pruning? arXiv preprint arXiv:2003.03033

  • Cagnone S, Viroli C (2014) A factor mixture model for analyzing heterogeneity and cognitive structure of dementia. AStA Adv Stat Anal 98(1):1–20

    Article  MathSciNet  MATH  Google Scholar 

  • Chiu T, Fang D, Chen J, Wang Y, Jeris C (2001) A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 263–268

  • Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231

    Google Scholar 

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631

    Article  MathSciNet  MATH  Google Scholar 

  • Fruehwirth-Schnatter S, Lopes HF (2018) Sparse bayesian factor analysis when the number of factors is unknown. arXiv preprint arXiv:1804.04231

  • Ghahramani Z, Hinton GE et al (1996) The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto

  • Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871

    Article  Google Scholar 

  • Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Singapore, pp 21–34

  • Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Disc 2(3):283–304

    Article  Google Scholar 

  • Jogin M, Madhulika M, Divya G, Meghana R, Apoorva S et al (2018) Feature extraction using convolution neural networks (CNN) and deep learning. In: 2018 3rd IEEE international conference on recent trends in electronics, information & communication technology (RTEICT). IEEE, pp 2319–2323

  • Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

    Google Scholar 

  • Maclaurin D, Duvenaud D, Adams RP (2015) Autograd: Effortless gradients in numpy. In: ICML 2015 AutoML Workshop, vol 238, p 5

  • McLachlan GJ, Peel D (2000) Finite mixture models. Probability and statistics–applied probability and statistics section, vol 299. Wiley, New York

  • McLachlan GJ, Peel D, Bean RW (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3–4):379–388

    Article  MathSciNet  MATH  Google Scholar 

  • Melnykov V, Maitra R et al (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116

    Article  MathSciNet  MATH  Google Scholar 

  • Moustaki I (2003) A general class of latent variable models for ordinal manifest variables with covariate effects on the manifest and latent variables. Br J Math Stat Psychol 56(2):337–357

    Article  MathSciNet  Google Scholar 

  • Moustaki I, Knott M (2000) Generalized latent trait models. Psychometrika 65(3):391–411

    Article  MathSciNet  MATH  Google Scholar 

  • Nenadic O, Greenacre M (2005) Computation of multiple correspondence analysis, with code in r. Technical report, Universitat Pompeu Fabra

  • Niku J, Brooks W, Herliansyah R, Hui FK, Taskinen S, Warton DI (2019) Efficient estimation of generalized linear latent variable models. PLoS ONE 14(5):481–497

    Article  Google Scholar 

  • Pagès J (2014) Multiple factor analysis by example using R. CRC Press, Cambridge

    Book  MATH  Google Scholar 

  • Patil DD, Wadhai V, Gokhale J (2010) Evaluation of decision tree pruning algorithms for complexity and classification accuracy. Int J Comput Appl 11(2):23–30

    Google Scholar 

  • Philip G, Ottaway B (1983) Mixed data cluster analysis: an illustration using cypriot hooked-tang weapons. Archaeometry 25(2):119–133

    Article  Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  • Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Selosse M, Gormley C, Jacques J, Biernacki C (2020) A bumpy journey: exploring deep gaussian mixture models. In: ”I Can’t Believe It’s Not Better!”NeurIPS 2020 workshop

  • Viroli C, McLachlan GJ (2019) Deep gaussian mixture models. Stat Comput 29(1):43–51

    Google Scholar 

  • Wei GC, Tanner MA (1990) A monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704

    Article  Google Scholar 

  • Wold S, Sjöström M, Eriksson L (2001) Pls-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58(2):109–130

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank the reviewers for their helpful comments which helped to improve the manuscript. This work benefited from the support of the Research Chair DIALog under the aegis of the Risk Foundation, a joint initiative by CNP Assurances and ISFA, Université Claude Bernard Lyon 1 (UCBL). This work also benefited from funds of the LIA LYSM (agreement between AMU, CNRS, ECM and INdAM)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robin Fuchs.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 687 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fuchs, R., Pommeret, D. & Viroli, C. Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets. Adv Data Anal Classif 16, 31–53 (2022). https://doi.org/10.1007/s11634-021-00466-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-021-00466-3

Keywords

Mathematics Subject Classification

Navigation