×

Clustering mixed-type player behavior data for churn prediction in mobile games. (English) Zbl 07700325

Summary: Marketers have long since understood the importance of customer segmentation and customer churn prediction modelling. However, linking these processes remains a challenge. Customer segmentation is often performed by applying a clustering algorithm on customer behavioral data, which is another challenging task since datasets on customer behavior typically comprise mixed-data types. This research focuses on clustering player behavior data for churn prediction modelling in the mobile games market and constructing a dissimilarity measure capable of simultaneously handling categorical and quantitative data. The problem of finding an appropriate dissimilarity measure for mixed-type data with unbalanced categorical features and highly skewed numerical features is handled by establishing a hybrid dissimilarity measure constructed as a normalized linear combination of distances. Distances are calculated conditional on feature type following the principles of Gower’s coefficient calculation where for numerical features, distances are calculated by applying a modified winsorized Huber loss, while for categorical features, we incorporate a distance measure based on variable entropy. In conjunction with the PAM clustering algorithm, the established dissimilarity measure is applied on real-world datasets and the performance is compared to several state-of-the-art clustering algorithms. Secondly, this research investigates the potential of customer segmentation as an integral part of churn prediction modelling in online games which is operationalized by applying the proposed clustering method on a real dataset comprising mixed-type data originating from a casual mobile game. The benefits of customer segmentation are supported by the data since churn prediction models exhibit higher performance when the clustering is performed prior to churn classification.

MSC:

90Bxx Operations research and management science
Full Text: DOI

References:

[1] Ahmad, A.; Dey, L., A k-mean clustering algorithm for mixed numeric and categorical data, Data Knowl Eng (2007) · doi:10.1016/j.datak.2007.03.016
[2] Ahmad, A.; Khan, SS, Survey of state-of-the-art mixed data clustering algorithms, IEEE Access (2019) · doi:10.1109/ACCESS.2019.2903568
[3] Bauckhage, C.; Drachen, A.; Sifa, R., Clustering game behavior data, IEEE Trans Comput Intell AI Games (2015) · doi:10.1109/TCIAIG.2014.2376982
[4] Behzadi, S.; Müller, NS; Plant, C.; Böhm, C., Clustering of mixed-type data considering concept hierarchies: problem specification and algorithm, Int J Data Sci Analyt (2020) · doi:10.1007/s41060-020-00216-2
[5] Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: A comparative evaluation. Society for industrial and applied mathematics - 8th SIAM International conference on data mining 2008, proceedings in applied mathematics 130. doi:10.1137/1.9781611972788.22
[6] Budiaji, W.; Leisch, F., Simple k-medoids partitioning algorithm for mixed variable data, Algorithms (2019) · doi:10.3390/a12090177
[7] Castro, EG; Tsuzuki, MSG, Churn prediction in online games using players’ login records: a frequency analysis approach, IEEE Trans Computat Intell AI Games (2015) · doi:10.1109/TCIAIG.2015.2401979
[8] D’Urso, P.; Massari, R., Fuzzy clustering of mixed data, Inf Sci (2019) · Zbl 1456.62120 · doi:10.1016/j.ins.2019.07.100
[9] de Amorim, RC; Makarenkov, V., Applying subclustering and Lp distance in Weighted K-Means with distributed centroids, Neurocomputing (2016) · doi:10.1016/j.neucom.2015.08.018
[10] Dinh, DT; Huynh, VN; Sriboonchitta, S., Clustering mixed numerical and categorical data with missing values, Inf Sci, 571, 418-442 (2021) · Zbl 07769018 · doi:10.1016/J.INS.2021.04.076
[11] Dos Santos, TRL; Zárate, LE, Categorical data clustering: what similarity measure to recommend?, Expert Syst Appl (2015) · doi:10.1016/j.eswa.2014.09.012
[12] Drachen, A.; Connor, S., Game analytics for games user research, Games User Res (2018) · doi:10.1093/oso/9780198794844.003.0019
[13] Drachen, A.; Mirza-Babaei, P.; Nacke, LE, Frontlines in games user research, Games User Res (2018) · doi:10.1093/oso/9780198794844.003.0031
[14] Dua D, Graff C (2019) UCI Machine Learning repository: data sets. University of California, School of Information and Computer Science, Irvine, CA
[15] Foss, A.; Markatou, M.; Ray, B.; Heching, A., A semiparametric method for clustering mixed data, Machine Learn (2016) · Zbl 1432.62182 · doi:10.1007/s10994-016-5575-7
[16] Foss, AH; Markatou, M.; Ray, B., Distance metrics and clustering methods for mixed-type data, Int Stat Rev (2019) · Zbl 07763631 · doi:10.1111/insr.12274
[17] Fu, X.; Chen, X.; Shi, YT; Bose, I.; Cai, S., User segmentation for retention management in online social games, Decis Support Syst (2017) · doi:10.1016/j.dss.2017.05.015
[18] Gagné AR, El-Nasr MS, Shaw CD (2011) A deeper look at the use of telemetry for analysis of player behavior in RTS Games. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). doi:10.1007/978-3-642-24500-8_26
[19] Gower, JC, A general coefficient of similarity and some of its properties, Biometrics (1971) · doi:10.2307/2528823
[20] Hadiji F, Sifa R, Drachen A, Thurau C, Kersting K, Bauckhage C (2014). Predicting player churn in the wild. IEEE Conference on computatonal intelligence and games, CIG. doi:10.1109/CIG.2014.6932876
[21] Hamka, F.; Bouwman, H.; De Reuver, M.; Kroesen, M., Mobile customer segmentation based on smartphone measurement, Telemat Inform (2014) · doi:10.1016/j.tele.2013.08.006
[22] Hashmi N, Butt NA, Iqbal M (2013) Customer churn prediction in telecommunication: a decade review and classification. IJCSI International journal of computer science issues
[23] Hastie T, Tibshirani R, Friedman J (2009) Elements of Statistical Learning 2nd ed. In Elements. doi:10.1007/978-0-387-84858-7 · Zbl 1273.62005
[24] Hsu, CC; Lin, SH; Tai, WS, Apply extended self-organizing map to cluster and classify mixed-type data, Neurocomputing, 74, 18, 3832-3842 (2011) · doi:10.1016/J.NEUCOM.2011.07.014
[25] Hu, YH; Huang, TCK; Kao, YH, Knowledge discovery of weighted RFM sequential patterns from customer sequence databases, J Syst Softw (2013) · doi:10.1016/j.jss.2012.11.016
[26] Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. Proceedings of the 1st pacific-asia conference on knowledge discovery and data mining,(PAKDD).
[27] Huber, PJ, Robust estimation of a location parameter, Ann Math Stat (1964) · Zbl 0136.39805 · doi:10.1214/aoms/1177703732
[28] Hubert, L.; Arabie, P., Comparing partitions, J Classif (1985) · Zbl 0587.62128 · doi:10.1007/BF01908075
[29] Jia, Z.; Song, L., Weighted k-prototypes clustering algorithm based on the hybrid dissimilarity coefficient, Math Probl Eng (2020) · Zbl 1459.62112 · doi:10.1155/2020/5143797
[30] Karnstedt, M.; Hennessy, T.; Chan, J.; Basuchowdhuri, P.; Hayes, C.; Strufe, T., Churn in social networks, Handb Social Netw Tech Appl (2010) · doi:10.1007/978-1-4419-7142-5_9
[31] Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis (Wiley Series in Probability and Statistics) In Eepe.Ethz.Ch · Zbl 1345.62009
[32] Kim, S.; Choi, D.; Lee, E.; Rhee, W., Churn prediction of mobile and online casual games using play log data, PLoS ONE (2017) · doi:10.1371/journal.pone.0180735
[33] Kumar, V.; Chhabra, JK; Kumar, D., Performance evaluation of distance metrics in the clustering algorithms, INFOCOMP J Comput Sci, 13, 1, 38-52 (2014)
[34] Loria, E.; Marconi, A., Exploiting limited players’ behavioral data to predict churn in gamification, Electron Commer Res Appl (2021) · doi:10.1016/j.elerap.2021.101057
[35] Maechler M, Struyf A, Hubert M, Hornik K, Studer M, Roudier P (2019) Cluster: cluster analysis basics and extensions. In R package version 2.1.0
[36] Mori, U.; Mendiburu, A.; Lozano, JA, Similarity measure selection for clustering time series databases, IEEE Trans Knowl Data Eng (2016) · doi:10.1109/TKDE.2015.2462369
[37] Mutazinda, H.; Sowjanya, M.; Mrudula, O., Cluster ensemble approach for clustering mixed data, Int J Comput Tech, 2, 5, 43-51 (2015)
[38] Perišić, A.; Pahor, M., RFM-LIR feature framework for churn prediction in the mobile games market, IEEE Trans Games (2021) · doi:10.1109/TG.2021.3067114
[39] Perišić, A.; Šišak Jung, D.; Pahor, M., Churn in the mobile gaming field: establishing churn definitions and measuring classification similarities, Expert Syst Appl, 191 (2022) · doi:10.1016/j.eswa.2021.116277
[40] Perisic A, Pahor M (2020) Extended RFM logit model for churn prediction in the mobile gaming market. Croat Operat Res Rev, doi:10.17535/crorr.2020.0020
[41] Reddy, J.; Kavitha, B., Clustering the mixed numerical and categorical datasets using similarity weight and filter method, Int J Datab Theory Appl, 5, 1, 121-134 (2012)
[42] Rothmeier, K.; Pflanzl, N.; Hullmann, JA; Preuss, M., Prediction of player churn and disengagement based on user activity data of a freemium online strategy game, IEEE Trans Games (2021) · doi:10.1109/TG.2020.2992282
[43] Rousseeuw, PJ, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J Comput Appl Math (1987) · Zbl 0636.62059 · doi:10.1016/0377-0427(87)90125-7
[44] Runge J, Gao P, Garcin F, Faltings B (2014) Churn prediction for high-value players in casual social games. IEEE Conference on Computatonal Intelligence and Games, CIG. doi:10.1109/CIG.2014.6932875
[45] Sangam RS, Om H (2018) An equi-biased k-prototypes algorithm for clustering mixed-type data. Sadhana - Academy proceedings in engineering sciences. doi:10.1007/s12046-018-0823-0
[46] Schubert E, Rousseeuw PJ (2019) Faster k-Medoids Clustering: improving the PAM, CLARA, and CLARANS Algorithms. Lecture notes in computer science (Including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). doi:10.1007/978-3-030-32047-8_16
[47] Seal, A.; Karlekar, A.; Krejcar, O.; Gonzalo-Martin, C., Fuzzy c-means clustering using Jeffreys-divergence based similarity measure, Appl Soft Comput J (2020) · doi:10.1016/j.asoc.2019.106016
[48] Sharma, KK; Seal, A., Multi-view spectral clustering for uncertain objects, Inf Sci (2021) · Zbl 1479.62042 · doi:10.1016/j.ins.2020.08.080
[49] Sharma, KK; Seal, A., Outlier-robust multi-view clustering for uncertain data, Knowl-Based Syst (2021) · doi:10.1016/j.knosys.2020.106567
[50] Sifa, R.; Drachen, A.; Bauckhage, C., Profiling in games: understanding behavior from telemetry, Soci Interact Virt World (2018) · doi:10.1017/9781316422823.014
[51] Smith, WR, Product differentiation and market segmentation as alternative marketing strategies, J Mark (1956) · doi:10.2307/1247695
[52] Strobl, C.; Boulesteix, AL; Zeileis, A.; Hothorn, T., Bias in random forest variable importance measures: illustrations, sources and a solution, BMC Bioinfor (2007) · doi:10.1186/1471-2105-8-25
[53] Strobl, C.; Boulesteix, AL; Kneib, T.; Augustin, T.; Zeileis, A., Conditional variable importance for random forests, BMC Bioinfor (2008) · doi:10.1186/1471-2105-9-307
[54] Strobl, C.; Hothorn, T.; Zeileis, A., Party on!, R J (2009) · doi:10.32614/rj-2009-013
[55] Šulc, Zdenek, Prochazka, J., & Matejka, M. (2016). Modifications of the Gower similarity coefficient. The 19th Conference of Applications of Mathematics and Statistics in Economics,.
[56] Zdeněk Š, Matějka, M., Procházka, J., & Řezanková, H. (2017). Evaluation of the Gower coefficient modifications in hierarchical clustering. Metodoloski Zvezki.
[57] Šulc, Z.; Řezanková, H., Comparison of similarity measures for categorical data in hierarchical clustering, J Classif (2019) · Zbl 1433.62169 · doi:10.1007/s00357-019-09317-5
[58] Ullah, I.; Raza, B.; Malik, AK; Imran, M.; Islam, SU; Kim, SW, A churn prediction model using random forest: analysis of machine learning techniques for churn prediction and factor identification in telecom sector, IEEE Access (2019) · doi:10.1109/ACCESS.2019.2914999
[59] van de Velden M, Iodice D’Enza A, Markos A (2019) Distance-based clustering of mixed data. Wiley Interdisciplinary Reviews: Computational Statistics. doi:10.1002/wics.1456 · Zbl 07909155
[60] Verbeke, W.; Martens, D.; Mues, C.; Baesens, B., Building comprehensible customer churn prediction models with advanced rule induction techniques, Expert Syst Appl (2011) · doi:10.1016/j.eswa.2010.08.023
[61] Zhang, P.; Wang, X.; Song, PXK, Clustering categorical data based on distance vectors, J Am Stat Assoc (2006) · Zbl 1118.62341 · doi:10.1198/016214505000000312
[62] Zhou, J.; Zhai, L.; Pantelous, AA, Market segmentation using high-dimensional sparse consumers data, Expert Syst Appl (2020) · doi:10.1016/j.eswa.2019.113136
[63] Zhu, X.; Li, Y.; Wang, J.; Zheng, T.; Fu, J., Automatic Recommendation of a Distance Measure for Clustering Algorithms, ACM Trans Knowl Discov Data (2021) · doi:10.1145/3418228
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.