
A semi-parametric density estimation with application in clustering. (English) Zbl 07695285

Summary: The idea behind density-based clustering is to associate groups to the connected components of the level sets of the density of the data to be estimated by a nonparametric method. This approach claims some advantages over both distance- and model-based clustering. Some researchers developed this technique by proposing a graph theory-based method for identifying local modes of the underlying density being estimated by the well-known kernel density estimation (KDE) with normal and \(t\) kernels. The present work proposes a semi-parametric KDE with a more flexible family of kernels including skew-normal (SN) and skew-\(t\) (ST). We show that the proposed estimator not only reduces boundary bias but it is also closer to the actual density compared to that of the usual estimator employing the Gaussian kernel. Finding optimal bandwidth for one-dimensional and multidimensional cases under the mentioned asymmetric kernels is another main result of this paper where we shrink the bandwidth more than the one obtained under the normal assumption. Finally, through a comprehensive numerical study, we will illustrate the application of the proposed semi-parametric KDE on the density-based clustering using some simulated and real data sets.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI


[1] Abadir, KM; Lawford, S., Optimal asymmetric kernels, Economics Letters, 83, 61-68 (2004) · Zbl 1255.62101 · doi:10.1016/j.econlet.2003.07.017
[2] Azzalini, A.; Arellano-Valle, RB, Maximum penalized likelihood estimation for skew-normal and skew-t distributions, Journal of Statistical Planning and Inference, 143, 419-433 (2013) · Zbl 1254.62020 · doi:10.1016/j.jspi.2012.06.022
[3] Azzalini, A.; Torelli, N., Clustering via nonparametric density estimation, Statistics and Computing, 17, 71-80 (2007) · doi:10.1007/s11222-006-9010-y
[4] Azzalini, A.; Menardi, G., Clustering via nonparametric density estimation: The R package pdfCluster, Journal of Statistical Software, 57, 1-26 (2014) · doi:10.18637/jss.v057.i11
[5] Azzalini, A., & Salehi, M. (2020). Some computational aspects of maximum likelihood estimation of the Skew-t distribution. In Bekker A., Chen G., & Ferreira J. (Eds.) Computational and methodological statistics and biostatistics. Emerging topics in statistics and biostatistics. Springer, Cham.doi:10.1007/978-3-030-42196-0_1. · Zbl 07616796
[6] Bagnato, L.; Punzo, A.; Zoia, MG, The multivariate leptokurtic-normal distribution and its application in model-based clustering, Canadian Journal of Statistics, 45, 95-119 (2017) · Zbl 1462.62308 · doi:10.1002/cjs.11308
[7] Bowman, A.W., & Azzalini, A. (2018). R package ‘sm’: Nonparametric smoothing methods (version 2.2-5.6), http://www.stats.gla.ac.uk/adrian/sm.
[8] Bouveyron, C., Celeux, G., Murphy, T. B., & Raftery, A. E. (2019). Model-based clustering and classification for data science: With applications in R (vol. 50). Cambridge University Press. · Zbl 1436.62006
[9] Bouezmarni, T.; Scaillet, O., Consistency of asymmetric kernel density estimators and smoothed histograms with application to income data, Econometric Theory, 21, 390-412 (2005) · Zbl 1062.62058 · doi:10.1017/S0266466605050218
[10] Bowman, AW; Azzalini, A., Applied smoothing techniques for data analysis (1997), Oxford: Claredon Press, Oxford · Zbl 0889.62027
[11] Chacon, JE, A population background for nonparametric density-based clustering, Statist. Sci., 30, 518-532 (2015) · Zbl 1426.62181 · doi:10.1214/15-STS526
[12] Chen, S., Beta kernel estimators for density functions, Computational Statistics & Data Analysis, 31, 131-145 (1999) · Zbl 0935.62042 · doi:10.1016/S0167-9473(99)00010-9
[13] Chen, SX, Probability density function estimation using gamma kernels, Annals of the Institute of Statistical Mathematics, 52, 471-480 (2000) · Zbl 0960.62038 · doi:10.1023/A:1004165218295
[14] Fernandez, M.; Monteiro, PK, Central limit theorem for asymmetric kernel functionals, Annals of the Institute of Statistical Mathematics., 57, 425-442 (2005) · Zbl 1095.62041 · doi:10.1007/BF02509233
[15] Forina, M., Armanino, C., Lanteri, S., & Tiscornia, E. (1983). Classification of olive oils from their fatty acid composition. In M. Martens H.J. Russwurm (Eds.) Food research and data analysis, pp. 189-214. Appl. Sci, London.
[16] Fraley, C.; Raftery, AE, Model-based clustering, discriminant analysis, and density estimation, Journal of the American statistical Association, 97, 458, 611-631 (2002) · Zbl 1073.62545 · doi:10.1198/016214502760047131
[17] Hjort, NL; Glad, IK, Nonparametric density estimation with a parametric start, The Annals of Statistics, 23, 882-904 (1995) · Zbl 0838.62027 · doi:10.1214/aos/1176324627
[18] Hubert, L.; Arabie, P., Comparing partitions, Journal of Classification., 2, 193-218 (1985) · Zbl 0587.62128 · doi:10.1007/BF01908075
[19] Hubert, M.; Vandervieren, E., An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis, 52, 5186-5201 (2008) · Zbl 1452.62074 · doi:10.1016/j.csda.2007.11.008
[20] Ingrassia, S.; Punzo, A., Cluster validation for mixtures of regressions via the total sum of squares decomposition, J. Classif, 37, 526-547 (2020) · Zbl 07223614 · doi:10.1007/s00357-019-09326-4
[21] Kuruwita, CN; Kulasekera, KB; Padgett, WJ, Density estimation using asymmetric kernels and Bayes bandwidths with censored data, Journal of Statistical Planning and Inference, 140, 1765-1774 (2010) · Zbl 1184.62059 · doi:10.1016/j.jspi.2010.01.001
[22] Malsiner-Walli, G.; Frühwirth-Schnatter, S., grün, B, Identifying mixtures of mixtures using Bayesian estimation. Journal of Computational and Graphical Statistics, 26, 285-295 (2017)
[23] Marron, JS; Ruppert, D., Transformations to reduce boundary bias in kernel density estimation, Journal of the Royal Statistical Society: Series B (Methodological), 56, 653-671 (1994) · Zbl 0805.62046
[24] Mazza, A., & Punzo, A. (2011). Discrete beta kernel graduation of age-specific demographic indicators. In S. Ingrassia, R. Rocci, & M. Vichi (Eds.) New Perspectives in Statistical Modeling and Data Analysis, Studies in Classification, Data Analysis, and Knowledge Organization (pp. 127-134). Springer, Berlin.
[25] Mazza, A., & Punzo, A. (2013). Using the variation coefficient for adaptive discrete beta kernel graduation. In P. Giudici, S. Ingrassia, & M. Vichi (Eds.) Statistical models for data analysis, studies in classification, data analysis, and knowledge organization (pp. 225-232). Springer International Publishing, Switzerland.
[26] Mazza, A., & Punzo, A. (2013). Graduation by adaptive discrete beta kernels. In A. Giusti, G. Ritter, & M. Vichi (Eds.) Classification and data mining, studies in classification, data analysis, and knowledge organization (pp 243-250). Springer, Berlin. · Zbl 1451.68022
[27] Mazza, A.; Punzo, A., DBKGrad: An R package for mortality rates graduation by fixed and adaptive discrete beta kernel techniques, Journal of Statistical Software, 57, 1-18 (2014) · doi:10.18637/jss.v057.c02
[28] Mazza, A.; Punzo, A., Bivariate discrete beta kernel graduation of mortality data, Lifetime Data Analysis, 21, 419-433 (2015) · Zbl 1322.62128 · doi:10.1007/s10985-014-9300-1
[29] McNicholas, P.D. (2016). Mixture model-based classification. CRC press.
[30] Menardi, G., Density based silhouette diagnostics for clustering methods, Statistics and Computing, 21, 295-308 (2011) · Zbl 1255.62179 · doi:10.1007/s11222-010-9169-0
[31] Menardi, G.; Azzalini, A., An advancement in clustering via nonparametric density estimation, Statistics and Computing, 24, 753-767 (2014) · Zbl 1322.62175 · doi:10.1007/s11222-013-9400-x
[32] Millard, S. (2019). Contributions to mixture regression modelling with applications in industry. PhD thesis, University of Pretoria.
[33] Moss, J., & Tveten, M. (1566). kdensity: An R package for kernel density estimation with parametric starts and asymmetric kernels. Journal of Open Source Software:4.
[34] Lee, S.; McLachlan, GJ, Finite mixtures of multivariate skew t-distributions: Some recent and new results, Statistics and Computing, 24, 181-202 (2014) · Zbl 1325.62107 · doi:10.1007/s11222-012-9362-4
[35] Lin, TI; Lee, JC; Hsieh, WJ, Robust mixture modelling using the skew-t distribution, Statistics and Computing, 17, 81-92 (2007) · doi:10.1007/s11222-006-9005-8
[36] Loperfido, N., Finite mixtures, projection pursuit and tensor rank: A triangulation, Advances in Data Analysis and Classification, 31, 145-173 (2019) · Zbl 1466.62355 · doi:10.1007/s11634-018-0336-z
[37] Punzo, A. (2010). Discrete beta-type models. In H. Locarek-Junge C. Weihs (Eds.) Classification as a tool for research, studies in classification, data analysis, and knowledge organization (pp 253-261). Springer, Berlin Heidelberg.
[38] Rattihalli, RN; Patil, SB, Data dependent asymmetric kernels for estimating the density function, Sankhya A., 83, 155-186 (2021) · Zbl 1475.62139 · doi:10.1007/s13171-019-00171-6
[39] Salehi, M.; Azzalini, A., On application of the univariate Kotz distribution and some of its extensions, METRON, 76, 177-201 (2018) · Zbl 1404.62018 · doi:10.1007/s40300-018-0137-3
[40] Salehi, M.; Doostparast, M., Expressions for moments of order statistics and records from the skew-normal distribution in terms of multivariate normal orthant probabilities, Statistical Methods and Applications, 24, 547-568 (2015) · Zbl 1416.62273 · doi:10.1007/s10260-015-0306-y
[41] Saulo, H.; Leiva, V.; Ziegelmann, FA, A nonparametric method for estimating asymmetric densities based on skewed Birnbaum-Saunders distributions applied to environmental data, Stoch Environ Res Risk Assess., 27, 1479-1491 (2013) · doi:10.1007/s00477-012-0684-8
[42] Silverman, BW, Density estimation for statistics and data analysis (1986), London: Chapman and Hall, London · Zbl 0617.62042
[43] Tomarchio, SD; Punzo, A., Modelling the loss given default distribution via a family of zero-and-one inflated mixture models, Journal of the Royal Statistical Society: Series A, 182, 1247-1266 (2019) · doi:10.1111/rssa.12466
[44] Wand, MP; Jones, MC, Kernel smoothing (1995), London: Chapman & Hall, London · Zbl 0854.62043 · doi:10.1007/978-1-4899-4493-1
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.