×

On classifier behavior in the presence of mislabeling noise. (English) Zbl 1411.68116

Summary: Machine learning algorithms perform differently in settings with varying levels of training set mislabeling noise. Therefore, the choice of the right algorithm for a particular learning problem is crucial. The contribution of this paper is towards two, dual problems: first, comparing algorithm behavior; and second, choosing learning algorithms for noisy settings. We present the “sigmoid rule” framework, which can be used to choose the most appropriate learning algorithm depending on the properties of noise in a classification problem. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. We study the characteristics of the sigmoid function using five representative non-sequential classifiers, namely, Naïve Bayes, kNN, SVM, a decision tree classifier, and a rule-based classifier, and three widely used sequential classifiers based on hidden Markov models, conditional random fields and recursive neural networks. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, showing that we can estimate an expected performance over a dataset regardless of the underlying algorithm. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI

References:

[1] Abdulrahman SM, Brazdil P, van Rijn JN, Vanschoren J (2015) Algorithm selection via meta-learning and sample-based active testing. In: Proceedings of the 2015 international workshop on meta-learning and algorithm selection (MetaSel) co-located with European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 55-66
[2] Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716-723. doi:10.1109/TAC.1974.1100705 · Zbl 0314.62039 · doi:10.1109/TAC.1974.1100705
[3] Ali S, Smith K (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119-138 · doi:10.1016/j.asoc.2004.12.002
[4] Bodén M (2002) A guide to recurrent neural networks and backpropagation. The Dallas project, SICS technical report (2), pp 1-10. http://130.102.79.1/ mikael/papers/rn_dallas
[5] Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. Wiley, New York · Zbl 1317.62001
[6] Bradley A (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145-1159 · doi:10.1016/S0031-3203(96)00142-2
[7] Brazdil P, Giraud Carrier C, Soares C, Vilalta R (2009) Development of metalearning systems for algorithm recommendation. Metalearning, 31-59 · Zbl 1173.68625
[8] Brazdil PB, Soares C, Pinto Da Costa J (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50(3):251-277 · Zbl 1033.68082 · doi:10.1023/A:1021713901879
[9] Camastra F, Vinciarelli A (2002) Estimating the intrinsic dimension of data with a fractal-based method. IEEE Trans Pattern Anal Mach Intell 24(10):1404-1407 · doi:10.1109/TPAMI.2002.1039212
[10] Chevaleyre Y, Zucker JD (2000) Noise-tolerant rule induction from multi-instance data. In: Proceedings of the workshop on attribute-value and relational learning: crossing the boundaries, co-located with international conference on machine learning (ICML), pp 47-52
[11] Cohen, William W., Fast Effective Rule Induction, 115-123 (1995) · doi:10.1016/B978-1-55860-377-6.50023-2
[12] Corder GW, Foreman DI (2009) Nonparametric statistics for non-statisticians: a step-by-step approach. Wiley, Hoboken · Zbl 1167.62051 · doi:10.1002/9781118165881
[13] Cruz RM, Sabourin R, Cavalcanti GD, Ren TI (2015) Meta-des: a dynamic ensemble selection framework using meta-learning. Pattern Recognit 48(5):1925-1935 · doi:10.1016/j.patcog.2014.12.003
[14] de Sousa E, Traina A, Traina Jr. C, Faloutsos C (2006) Evaluating the intrinsic dimension of evolving data streams. In: Proceedings of the 2006 ACM symposium on applied computing, pp 643-648
[15] Dupont P (2006) Noisy sequence classification with smoothed Markov chains. In: Proceedings of the 8th French conference on machine learning (CAP 2006), pp 187-201
[16] Elman J (1990) Finding structure in time. Cognit Sci 211:179-211 · doi:10.1207/s15516709cog1402_1
[17] Eom SB, Ketcherside MA, Lee HH, Rodgers ML, Starrett D (2004) The determinants of web-based instructional systems’ outcome and satisfaction: an empirical investigation. Cognitive aspects of online programs. Instr Technol, pp 96-139
[18] François JM (2013) Jahmm-hidden Markov model (HMM): an implementation in Java. https://code.google.com/p/jahmm/
[19] Lichman M (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml
[20] Garcia LPF, de Carvalho ACPLF, Lorena AC (2016) Noise detection in the meta-learning level. Neurocomputing 176:14-25 · doi:10.1016/j.neucom.2014.12.100
[21] Giannakopoulos G, Palpanas T (2010) The effect of history on modeling systems’ performance: the problem of the demanding lord. In: IEEE 10th international conference on data mining (ICDM). doi:10.1109/ICDM.2010.90
[22] Giannakopoulos G, Palpanas T (2013) Revisiting the effect of history on learning performance: the problem of the demanding lord. Knowl Inf Syst 36(3):653-691. doi:10.1007/s10115-012-0568-8 · doi:10.1007/s10115-012-0568-8
[23] Giraud-Carrier C, Vilalta R, Brazdil P (2004) Introduction to the special issue on meta-learning. Mach Learn 54(3):187-193 · doi:10.1023/B:MACH.0000015878.60765.42
[24] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10-18 · doi:10.1145/1656274.1656278
[25] Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco · Zbl 1445.68004
[26] Haussler D (1990) Probably approximately correct learning. University of California, Santa Cruz, Computer Research Laboratory · Zbl 0858.68070
[27] Heywood MI (2015) Evolutionary model building under streaming data for classification tasks: opportunities and challenges. Genet Program Evolvable Mach 16(3):283-326 · doi:10.1007/s10710-014-9236-y
[28] Kalapanidas E, Avouris N, Craciun M, Neagu D (2003) Machine learning algorithms: a study on noise sensitivity. In: Proceedings 1st Balcan conference in informatics, pp 356-365
[29] Keerthi S, Shevade S, Bhattacharyya C, Murthy K (2001) Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput 13(3):637-649 · Zbl 1085.68629 · doi:10.1162/089976601300014493
[30] Klinkenberg R (2005) Meta-learning, model selection, and example selection in machine learning domains with concept drift. In: Lernen, Wissensentdeckung und Adaptivität (LWA) 2005, GI Workshops, Saarbrücken, October 10th-12th, pp 164-171
[31] Kuh A, Petsche T, Rivest RL (1990) Learning time-varying concepts. In: Conference on neural information processing systems (NIPS), pp 183-189
[32] Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New York. doi:10.1007/978-1-4614-6849-3 · Zbl 1306.62014 · doi:10.1007/978-1-4614-6849-3
[33] Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning (ICML), pp 282-289
[34] Li Q, Li T, Zhu S, Kambhamettu C (2002) Improving medical/biological data classification performance by wavelet preprocessing. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 657-660
[35] Marsaglia G, Tsang WW, Wang J (2003) Evaluating Kolmogorovś distribution. J Stat Softw 8(1):1-4. doi:10.18637/jss.v008.i18 · doi:10.18637/jss.v008.i18
[36] Massey FJ Jr (1951) The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68-78 · Zbl 0042.14403 · doi:10.1080/01621459.1951.10500769
[37] McCallum AK (2002) Mallet: a machine learning for language toolkit. http://mallet.cs.umass.edu
[38] Mirylenka K, Cormode G, Palpanas T, Srivastava D (2015) Conditional heavy hitters: detecting interesting correlations in data streams. Int J Very Large Data Bases (VLDB) 24(3):395-414 · doi:10.1007/s00778-015-0382-5
[39] Mirylenka, Katsiaryna; Giannakopoulos, George; Palpanas, Themis, SRF: A Framework for the Study of Classifier Behavior under Training Set Mislabeling Noise, 109-121 (2012), Berlin, Heidelberg · doi:10.1007/978-3-642-30217-6_10
[40] Mirylenka K, Palpanas T, Cormode G, Srivastava D (2013) Finding interesting correlations with conditional heavy hitters. In: IEEE 29th international conference on data engineering (ICDE), pp 1069-1080
[41] Mantovani RG, Rossi ALD, Vanschoren J, Carvalho ACPLF (2015) Meta-learning recommendation of default hyper-parameter values for SVMs in classification tasks. In: Proceedings of the 2015 international workshop on meta-learning and algorithm selection (MetaSel), European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 80-92
[42] Nettleton DF, Orriols-Puig A, Fornells A (2010) A study of the effect of different types of noise on the precision of supervised learning techniques. Artif Intell Rev 33(4):275-306. doi:10.1007/s10462-010-9156-z · doi:10.1007/s10462-010-9156-z
[43] Pechenizkiy M (2015) Predictive analytics on evolving data streams anticipating and adapting to changes in known and unknown contexts. In: IEEE international conference on high performance computing & simulation (HPCS), pp 658-659
[44] Pendrith M, Sammut C (1994) On reinforcement learning of control actions in noisy and non-markovian domains. Technical report, School of Computer Science and Engineering, The University of New South Wales, Sydney
[45] Rabiner L, Juang B (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4-16 · doi:10.1109/MASSP.1986.1165342
[46] Rossi ALD, de Leon Ponce, Ferreira de Carvalho AC, Soares C, Feres de Souza B (2014) MetaStream: a meta-learning based method for periodic algorithm selection in time-changing data. Neurocomputing 127:52-64 · doi:10.1016/j.neucom.2013.05.048
[47] Smith MR, Mitchell L, Giraud-Carrier C, Martinez T (2014) Recommending learning algorithms and their associated hyperparameters. arXiv:1407.1890
[48] Sutton C, McCallum A (2010) An introduction to conditional random fields. arXiv:1011.4088 · Zbl 1253.68001
[49] Taylor R (1990) Interpretation of the correlation coefficient: a basic review. J Diagn Med Sonogr 6(1):35-39 · doi:10.1177/875647939000600106
[50] Teytaud O (2001) Learning with noise. Extension to regression. In: Proceedings of the IEEE international joint conference on neural networks (IJCNN’01) vol 3, pp 1787-1792
[51] Theodoridis S, Koutroumbas K (2003) Pattern recognition. Academic Press, San Diego · Zbl 0954.68131
[52] Valiant L (1984) A theory of the learnable. Commun ACM 27(11):1134-1142 · Zbl 0587.68077 · doi:10.1145/1968.1972
[53] Vapnik VN (1998) Statistical learning theory, vol 1. Wiley, New York · Zbl 0935.62007
[54] Waluyan L, Sasipan S, Noguera S, Asai T (2009) Analysis of potential problems in people management concerning information security in cross-cultural environment -in the case of Malaysia. In: Proceedings of the third international symposium on human aspects of information security & assurance (HAISA), pp 13-24
[55] Widmer G (1997) Tracking context changes through meta-learning. Mach Learn 27(3):259-286 · doi:10.1023/A:1007365809034
[56] Wolpert D (1996) The existence of a priori distinctions between learning algorithms. Neural Comput 8:1391-1421 · doi:10.1162/neco.1996.8.7.1391
[57] Wolpert, David H., The Supervised Learning No-Free-Lunch Theorems, 25-42 (2002), London · doi:10.1007/978-1-4471-0123-9_3
[58] Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341-1390 · doi:10.1162/neco.1996.8.7.1341
[59] won Lee J, Giraud-Carrier C (2008) New insights into learning algorithms and datasets. In: IEEE seventh international conference on machine learning and applications (ICMLA’08), pp 135-140
[60] Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM SIGKDD Explor Newsl 12(1):40. doi:10.1145/1882471.1882478 · doi:10.1145/1882471.1882478
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.