Document Zbl 1480.62131

Možina, Martin; Demšar, Janez; Bratko, Ivan; Žabkar, Jure

Extreme value correction: a method for correcting optimistic estimations in rule learning. (English) Zbl 1480.62131

Mach. Learn. 108, No. 2, 297-329 (2019).

Summary: Machine learning algorithms rely on their ability to evaluate the constructed hypotheses for choosing the optimal hypothesis during learning and assessing the quality of the model afterwards. Since these estimates, in particular the former ones, are based on the training data from which the hypotheses themselves were constructed, they are usually optimistic. The paper shows three different solutions; two for the artificial boundary cases with the smallest and the largest optimism and a general correction procedure called extreme value correction (EVC) based on extreme value distribution. We demonstrate the application of the technique to rule learning, specifically to estimating classification accuracy of a single rule, and evaluate it on an artificial data set and on a number of UCI data sets. We observed that the correction successfully improved the accuracy estimates. We also describe an approach for combining rules into a linear global classifier and show that using EVC estimates leads to more accurate classifiers.

MSC:

62H30	Classification and discrimination; cluster analysis (statistical aspects)
62G32	Statistics of extreme values; tail inference
68T10	Pattern recognition, speech recognition

Keywords:

machine learning; multiple comparisons; extreme value distribution; rule learning

Software:

ENDER; ismev; UCI-ml; Orange; LIBLINEAR

Cite Review PDF

Full Text: DOI

References:

[1]	Atkinson, K. E. (1989). An introduction to numerical analysis. New York: Wiley. · Zbl 0718.65001
[2]	Bartlett, P., & Mendelson, S. (2003). Rademacher and Gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3(5), 463-482. · Zbl 1084.68549
[3]	Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning. In Proceedings of the ninth European conference on artificial intelligence (ECAI’90) (pp. 147-149).
[4]	Clark, P., & Boswell, R. (1991). Rule induction with CN2: Some recent improvements. In Proceeding of the fifth European working session on learning (EWSL’91), Berlin (pp. 151-163).
[5]	Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning Journal, 4(3), 261-283.
[6]	Cohen, W. W. (1995). Fast effective rule induction. In Proceedings of the Twelfth international conference on machine learning (ICML’95) (pp. 115-123).
[7]	Coles, S. (2001). An introduction to statistical modeling of extreme values (1st ed.). London: Springer. · Zbl 0980.62043
[8]	Dembczynski, K., Kotlowski, W., & Slowinski, R. (2008). Maximum likelihood rule ensembles. In Proceedings of the twenty-fifth international conference on machine learning (ICML’08) (pp. 224-231).
[9]	Dembczynski, K., Kotlowski, W., & Slowinski, R. (2010). ENDER: A statistical framework for boosting decision rules. Data Mining and Knowledge Discovery, 21, 52-90.
[10]	Demšar, J., Curk, T., Erjavec, A., Gorup, Črt, Hočevar, T., Milutinovič, M., et al. (2013). Orange: Data mining toolbox in Python. Journal of Machine Learning Research, 14, 2349-2353. · Zbl 1317.68151
[11]	Domingos, P. (1999). Process-oriented estimation of generalization error. In Proceedings of the 16th international joint conference on artificial intelligence (IJCAI’99) (pp. 714-719).
[12]	Domingos, P. (2000). Bayesian averaging of classifiers and the overfitting problem. In Proceedings of the 17th international conference on machine learning (ICML’00) (pp. 223-230).
[13]	Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61-74.
[14]	Džeroski, S., Cestnik, B., & Petrovski, I. (1993). Using the m-estimate in rule induction. Journal of Computing and Information Technology, 1(1), 37-46.
[15]	Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871-1874. · Zbl 1225.68175
[16]	Fisher, R., & Tippett, L. (1928). Limiting forms of the frequency distribution of the largest or smallest member of a sample. Mathematical Proceedings of the Cambridge Philosophical Society, 24, 180-190. · JFM 54.0560.05
[17]	Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3), 916-954. · Zbl 1149.62051
[18]	Fürnkranz, J. (2004). From local to global patterns: Evaluation issues in rule learning algorithms. In Local pattern detection, International Seminar (pp. 20-38). Dagstuhl Castle, Germany.
[19]	Fürnkranz, J., & Flach, P. A. (2005). ROC ’n’ Rule learning—Towards a better understanding of covering algorithms. Machine Learning, 58(1), 39-77. · Zbl 1075.68071
[20]	Gumbel, E. J. (1954). Statistical theory of extreme values and some practical applications. National Bureau of Standards Applied Mathematics Series (US Government Printing Office) (p. 33). · Zbl 0056.13107
[21]	Gumbel, E. J., & Lieblein, J. (1954). Some applications of extreme-value models. American Statistician, 8(5), 14-17.
[22]	Gupta, S. S. (1960). Order statistics from the Gamma distribution. Technometrics, 2, 243-262. · Zbl 0236.62035
[23]	Hanhijärvi, S. (2011). Multiple hypothesis testing in pattern discovery. In Proceedings of the 14th international conference on discovery science (DS’11) (pp. 122-134).
[24]	Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. New York: Springer. · Zbl 1273.62005
[25]	Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800-803. · Zbl 0661.62067
[26]	Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65-70. · Zbl 0402.62058
[27]	Janssen, F., & Fürnkranz, J. (2010). On the quest for optimal rule learning heuristics. Machine Learning, 78(3), 343-379. · Zbl 1470.68120
[28]	Jensen, D. D., & Cohen, P. R. (2000). Multiple comparisons in induction algorithms. Machine Learning, 38(3), 309-338. · Zbl 0954.68083
[29]	Lavrač, N., Flach, P., & Zupan, B. (1999). Rule evaluation measures: A unifying view. In Proceedings of the 9th international workshop on inductive logic programming (ILP’99), Bled, Slovenia (pp. 174-185).
[30]	Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 2 June 2016.
[31]	Lin, C. J., Weng, R. C., & Keerthi, S. S. (2008). Trust region Newton method for logistic regression. Journal of Machine Learning Reasearch, 9, 627-650. · Zbl 1225.68195
[32]	Lindgren, T. (2004). Methods for rule conflict resolution. In Proceedings of the 15th European conference on machine learning (ECML’04) (pp. 262-273), Pisa: Springer. · Zbl 1132.68572
[33]	Možina, M., Demšar, J., Žabkar, J., & Bratko, I. (2006). Why is rule learning optimistic and how to correct it. In Proceedings of 17th European conference on machine learning (ECML’06) (pp. 330-340), Berlin: Springer.
[34]	Quinlan, J. R., & Cameron-Jones, R. M. (1995). Oversearching and layered search in empirical learning. In Proceedings of the 14th international joint conference on artificial intelligence (IJCAI’95), Montreal, Canada (pp. 1019-1024).
[35]	Scheffer, T. (2005). Finding association rules that trade support optimally against confidence. Intelligent Data Analysis, 9(4), 381-395.
[36]	Todorovski, L., Flach, P., & Lavrač, N. (2000). Predictive performance of weighted relative accuracy. In Proceedings of the 4th European Conference of Principles of Data Mining and Knowledge Discovery (PKDD’00), Lyon, France (pp. 255-264).
[37]	Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. · Zbl 0833.62008

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.