×

Structural extension to logistic regression: Discriminative parameter learning of belief net classifiers. (English) Zbl 1101.68759

Summary: Bayesian Belief Nets (BNs) are often used for classification tasks-typically to return the most likely class label for each specified instance. Many BN-learners, however, attempt to find the BN that maximizes a different objective function-viz., likelihood, rather than classification accuracy-typically by first learning an appropriate graphical structure, then finding the parameters for that structure that maximize the likelihood of the data. As these parameters may not maximize the classification accuracy, “discriminative parameter learners” follow the alternative approach of seeking the parameters that maximize conditional likelihood, over the distribution of instances the BN will have to classify. This paper first formally specifies this task, shows how it extends standard logistic regression, and analyzes its inherent sample and computational complexity. We then present a general algorithm for this task, ELR, that applies to arbitrary BN structures and that works effectively even when given incomplete training data. Unfortunately, ELR is not guaranteed to find the parameters that optimize conditional likelihood; moreover, even the optimal-CL parameters need not have minimal classification error. This paper therefore presents empirical evidence that ELR produces effective classifiers, often superior to the ones produced by the standard “generative” algorithms, especially in common situations where the given BN-structure is incorrect.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Software:

HandTill2001; UCI-ml
Full Text: DOI

References:

[1] Abe, N., Takeuchi, J., & Warmuth, M. (1991). Polynomial learnability of probablistic concepts with respect to the Kullback-Leibler divergence. In Conference on Learning Theory.
[2] Binder, J., Koller, D., Russell, S., & Kanazawa, K. (1997). Adaptive probabilistic networks with hidden variables. Machine Learning, 29, 213–244. · Zbl 0892.68079 · doi:10.1023/A:1007421730016
[3] Bishop, C. (1998). Neural networks for pattern recognition. Oxford. · Zbl 0868.68096
[4] Blake, C., & Merz, C. (2000). UCI repository of machine learning databases, http://www.ics.uci.edu/learn/MLRepository.html.
[5] Boyd, S., & Vandenberghe. L. (2004). Convex optimization. Cambridge. · Zbl 1058.90049
[6] Buntine, W. (1996). A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering.
[7] Cerquides J., & de Mántaras, R.L. (2003). Tractable Bayesian learning of tree augmented naïve Bayes models. In International Conference on Machine Learning (pp. 75–82).
[8] Cheng J., & Greiner, R. (1999). Comparing Bayesian network classifiers. In Uncertainty in Artificial Intelligence.
[9] Cheng, J., Greiner, R., Kelly, J., Bell, D., & Liu, W. (2002). Learning Bayesian networks from data: An information-theory based approach. Artificial Intelligence, 137. · Zbl 0995.68114
[10] Chickering, D. M., Geiger, D., & Heckerman, D. (1994). Learning Bayesian networks is NP-hard. Technical Report MSR-TR-94-17, Microsoft Research. · Zbl 0831.68096
[11] Chou, W., Juang, B., & Lee, C. (1992). Segmental GPD training of HMM based speech recognizer. In International Conference on Acoustics, Speech and Signal Processing, vol. 1 (pp. 473–476).
[12] Chow, C., & Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Tans. on Information Theory (pp. 462–467). · Zbl 0165.22305
[13] Cooper, G. (1990). The computational complexity of probabilistic inference using Bayesian belief networks. Artificial intelligence, 42. · Zbl 0717.68080
[14] Cooper, G., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347. · Zbl 0766.68109
[15] Cox, D. R., & Snell, E. J. (1989). Analysis of binary data. Chapman & Hall, London, · Zbl 0729.62004
[16] Darwiche, A. (2000). A differential approach to inference in Bayesian networks. In Uncertainty in Artificial Intelligence. · Zbl 1325.68226
[17] Dasgupta, S. (1997). The sample complexity of Learning fixed-structure Bayesian networks. Machine Learning, 29, 165–180, · Zbl 0892.68078 · doi:10.1023/A:1007417612269
[18] Dash D., & Cooper, G. (2002). Exact model averaging with naïve Bayesian classifiers. In International Conference on Machine Learning (pp. 91–98).
[19] Dawid, A. P. (1976). Properties of diagnostic data distributions. Biometrics, 32, 647–658. · Zbl 0332.62078 · doi:10.2307/2529753
[20] Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. (with discussion). J. Royal Statistics Society, Series B, 39. · Zbl 0364.62022
[21] Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. Wiley. · Zbl 0277.68056
[22] Edwards, D., & Lauritzen, S. (2001). The TM algorithm for maximising a conditional likelihood function. Biometrika, 88, 961–972. · Zbl 1099.62526 · doi:10.1093/biomet/88.4.961
[23] Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In International Joint Conferences on Artificial Intelligence.
[24] Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning Journal, 29, 131–163. · Zbl 0892.68077 · doi:10.1023/A:1007465528199
[25] Greiner, R., Grove, A., & Schuurmans, D.(1997). Learning Bayesian nets that perform well. In Uncertainty in Artificial Intelligence.
[26] Greiner, R., & Zhou, W. (2002). Structural extension to logistic regression: Discriminant parameter learning of belief net classifiers. In American Association of Artificial Intelligence.
[27] Grossman, D., & Domingos, P. (2004). Learning Bayesian network classifiers by maximizing conditional likelihood. In International Conference on Machine Learning.
[28] Hagan, M., Demuth, H., & Beale, M. (1996). Neural network design. Boston, MA, PWS Publishing.
[29] Heckerman, D. E. (1998). A tutorial on learning with Bayesian networks. In M. I. Jordan (Ed.), Learning in graphical models. · Zbl 0921.62029
[30] Jaakkola, T., Meila, M., & Jebara, T. (2000). Maximum entropy discrimination. In Neural Information Processing Systems.
[31] Jordan, M. (1995). Why the logistic function? A tutorial discussion on probabilities and neural networks.
[32] Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In International Joint Conference on Artificial Intelligence.
[33] Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97:1–2. · Zbl 0904.68143 · doi:10.1016/S0004-3702(97)00043-X
[34] Kontkanen, P., Myllymäki, P., Silander, T., & Tirri, H. (1999). On supervised selection of Bayesian networks. In Uncertainty in Artificial Intelligence (pp. 334–342).
[35] Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning.
[36] Lauritzen, S. (1995). The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19, 191–201. · Zbl 0875.62237 · doi:10.1016/0167-9473(93)E0056-A
[37] Little, J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York, Wiley. · Zbl 0665.62004
[38] McCullagh, P., & Nelder, J. (1989). Generalized linear models. Chapman and Hall. · Zbl 0744.62098
[39] Minka. T. (2001). Algorithms for maximum-likelihood logistic regression. Technical report, CMU CALD, http://www.stat.cmu.edu/inka/papers/logreg/minka-logreg.pdf.
[40] Mitchell, T. M. (1997). Machine learning. McGraw-Hill. · Zbl 0913.68167
[41] Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52, 239–281, · Zbl 1039.68104 · doi:10.1023/A:1024068626366
[42] Nesterov, Y., & Nemirovskii, A. (1994). Interior-point polynomial methods in convex programming, Society for Industrial and Applied Mathematics. · Zbl 0824.90112
[43] Ng, A., & Jordan, M. (2001). On discriminative versus generative classifiers: A comparison of logistic regression and naive Bayes. In Neural Information Processing Systems.
[44] Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo: Morgan Kaufmann. · Zbl 0746.68089
[45] Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (2002). Numerical recipes in C. Cambridge, http://www.nr.com/. · Zbl 1078.65500
[46] Ripley, B. (1996). Pattern recognition and neural networks, Cambridge, UK: Cambridge University Press. · Zbl 0853.62046
[47] Roos, T., Wettig, H., Grünwald, P., & Myllymäki, P., Tirri, H. (2005). On discriminative Bayesian network classifiers and logistic regression. Machine Learning, 59:3, 269–298. · Zbl 1101.68785
[48] Schlüter, R., Macherey, W., Kanthak, S., Ney, H., & Welling, L. (1997). Comparison of optimization methods for discriminative training criteria. In Proc. of European Conference on Speech Communication and Technology.
[49] Shen, B., Su, X., Greiner, R., Musilek, P., & Cheng, C. (2003). Discriminative parameter learning of general Bayesian network classifiers. In International Conference on Tools with Artificial Intelligence.
[50] Sundberg, R. (2002). The convergence rate of the TM algorithm of Edwards and Lauritzen. Biometrika, 89, 478–483. · doi:10.1093/biomet/89.2.478
[51] Van Allen, T., & Greiner, R. (2000). Model selection criteria for learning belief nets: An empirical comparison. In International Conference on Machine Learning (pp. 1047–1054).
[52] Zhou, W. (2002). Discriminative learning of Bayesian net parameters. Master’s thesis, Dept of Computing Science, University of Alberta.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.