×

Cluster analysis of longitudinal profiles with subgroups. (English) Zbl 1393.62032

The authors propose a new approach to cluster longitudinal data based on nonparametric B-splines. The pairwise distance of the B-spline coefficients is used to group the longitudinal profiles, where the number of clusters is not specified in advance. The new method can be applied to unbalanced data sets, meaning that the longitudinal profiles of the subjects may consist of different numbers of points. The splines are fitted by a penalized regression and the pairwise distances of the spline coefficients is penalized to encourage subjects to fall into the same group. The authors establish the rate of convergence of these estimators and the recovery rate of the subgroups (depending on the distance between the groups). The results are illustrated in a simulation study and in a data example consisting of the weakly sales of different products over a total time of 11 years.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G08 Nonparametric regression and quantile regression
62P20 Applications of statistics to economics

Software:

AS 136

References:

[1] Abraham, C., Cornillon, P.-A., Matzner-Løber, E., and Molinari, N. (2003). Unsupervised curve clustering using b-splines., Scandinavian Journal of Statistics 30, 3, 581-595. · Zbl 1039.91067 · doi:10.1111/1467-9469.00350
[2] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers., Foundations and Trends in Machine Learning 3, 1, 1-122. · Zbl 1229.90122 · doi:10.1561/2200000016
[3] Bronnenberg, B. J., Kruger, M. W., and Mela, C. F. (2008). Database paper - the iri marketing data set., Marketing Science 27, 4, 745-748.
[4] Burren, O. S., Rubio García, A., Javierre, B.-M., Rainbow, D. B., Cairns, J., Cooper, N. J., Lambourne, J. J., Schofield, E., Castro Dopico, X., Ferreira, R. C., Coulson, R., Burden, F., Rowlston, S. P., Downes, K., Wingett, S. W., Frontini, M., Ouwehand, W. H., Fraser, P., Spivakov, M., Todd, J. A., Wicker, L. S., Cutler, A. J., and Wallace, C. (2017). Chromosome contacts in activated t cells identify autoimmune disease candidate genes., Genome Biology 18, 1 (Sep), 165.
[5] Chi, E. C. and Lange, K. (2015). Splitting methods for convex clustering., Journal of Computational and Graphical Statistics 24, 4, 994-1013.
[6] Claeskens, G., Krivobokova, T., and Opsomer, J. D. (2009). Asymptotic properties of penalized spline estimators., Biometrika 96, 3, 529-544. · Zbl 1170.62031 · doi:10.1093/biomet/asp035
[7] Coffey, N., Hinde, J., and Holian, E. (2014). Clustering longitudinal profiles using p-splines and mixed effects models applied to time-course gene expression data., Computational Statistics & Data Analysis 71, 14-29. · Zbl 1471.62045
[8] De Boor, C. (2001)., A practical guide to splines (revised ed.). New York, Springer. · Zbl 0987.65015
[9] Eilers, P. H. and Marx, B. D. (1996). Flexible smoothing with b-splines and penalties., Statistical Science 11, 2, 89-102. · Zbl 0955.62562 · doi:10.1214/ss/1038425655
[10] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties., Journal of the American Statistical Association 96, 456, 1348-1360. · Zbl 1073.62547 · doi:10.1198/016214501753382273
[11] Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation., Journal of the American Statistical Association 97, 458, 611-631. · Zbl 1073.62545 · doi:10.1198/016214502760047131
[12] Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm., Journal of the Royal Statistical Society. Series C (Applied Statistics)28, 1, 100-108. · Zbl 0447.62062
[13] Hastie, T., Tibshirani, R., and Walther, G. (2001). Estimating the number of data clusters via the gap statistic., Journal of the Royal Statistical Society. Series B 63, 411-423. · Zbl 0979.62046 · doi:10.1111/1467-9868.00293
[14] Hsu, Y.-H., Zillikens, M. C., Wilson, S. G., Farber, C. R., Demissie, S., Soranzo, N., Bianchi, E. N., Grundberg, E., Liang, L., Richards, J. B., and others. (2010). An integration of genome-wide association study and gene expression profiling to prioritize the discovery of novel susceptibility loci for osteoporosis-related traits., PLoS Genetics 6, 6.
[15] Hubert, L. and Arabie, P. (1985). Comparing partitions., Journal of Classification 2, 1, 193-218. · Zbl 0587.62128
[16] Jaccard, P. (1912). The distribution of the flora in the alpine zone., New Phytologist 11, 2, 37-50.
[17] Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model with b-splines., Bioinformatics 19, 4, 474-482.
[18] Ma, P., Castillo-Davis, C. I., Zhong, W., and Liu, J. S. (2006). A data-driven clustering method for time course gene expression data., Nucleic Acids Research 34, 4, 1261-1269.
[19] Ma, S. and Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis., Journal of the American Statistical Association 112, 517, 410-423.
[20] Pan, W., Shen, X., and Liu, B. (2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty., The Journal of Machine Learning Research 14, 1, 1865-1889. · Zbl 1317.68179
[21] Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods., Journal of the American Statistical Association 66, 336, 846-850.
[22] Ruppert, D. (2002). Selecting the number of knots for penalized splines., Journal of Computational and Graphical Statistics 11, 4, 735-757.
[23] Shen, X., Pan, W., and Zhu, Y. (2012). Likelihood-based selection and sharp parameter estimation., Journal of the American Statistical Association 107, 497, 223-232. · Zbl 1261.62020 · doi:10.1080/01621459.2011.645783
[24] Wu, J., Zhu, J., Wang, L., and Wang, S. (2017). Genome-wide association study identifies nbs-lrr-encoding genes related with anthracnose and common bacterial blight in the common bean., Frontiers in Plant Science 8, 1398.
[25] Xu, R. and Wunsch, D. (2005). Survey of clustering algorithms., IEEE Transactions on Neural Networks 16, 3, 645-678.
[26] Xue, L., Qu, A., and Zhou, J. (2010). Consistent model selection for marginal generalized additive model for correlated data., Journal of the American Statistical Association 105, 492, 1518-1530. · Zbl 1388.62223 · doi:10.1198/jasa.2010.tm10128
[27] Xue, L. and Yang, L. (2006). Additive coefficient modeling via polynomial spline., Statistica Sinica 16, 4, 1423-1446. · Zbl 1109.62030
[28] Zeger, S. L. and Liang, K.-Y. (1986). Longitudinal data analysis for discrete and continuous outcomes., Biometrics 42, 1, 121-130.
[29] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty., The Annals of Statistics 38, 2, 894-942. · Zbl 1183.62120 · doi:10.1214/09-AOS729
[30] Zhou, S., Shen, X., and Wolfe, D. (1998). Local asymptotics for regression splines and confidence regions., The Annals of Statistics 26, 5, 1760-1782. · Zbl 0929.62052 · doi:10.1214/aos/1024691356
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.