×

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes. (English) Zbl 1473.68189

Summary: This paper introduces a novel parameter estimation method for the probability tables of Bayesian network classifiers (BNCs), using hierarchical Dirichlet processes (HDPs). The main result of this paper is to show that improved parameter estimation allows BNCs to outperform leading learning methods such as random forest for both 0-1 loss and RMSE, albeit just on categorical datasets. As data assets become larger, entering the hyped world of “big”, efficient accurate classification requires three main elements: (1) classifiers with low-bias that can capture the fine-detail of large datasets (2) out-of-core learners that can learn from data without having to hold it all in main memory and (3) models that can classify new data very efficiently. The latest BNCs satisfy these requirements. Their bias can be controlled easily by increasing the number of parents of the nodes in the graph. Their structure can be learned out of core with a limited number of passes over the data. However, as the bias is made lower to accurately model classification tasks, so is the accuracy of their parameters’ estimates, as each parameter is estimated from ever decreasing quantities of data. In this paper, we introduce the use of HDPs for accurate BNC parameter estimation even with lower bias. We conduct an extensive set of experiments on 68 standard datasets and demonstrate that our resulting classifiers perform very competitively with random forest in terms of prediction, while keeping the out-of-core capability and superior classification time.

MSC:

68T20 Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)
62F10 Point estimation
62F15 Bayesian inference
62G05 Nonparametric estimation
62H22 Probabilistic graphical models
62H30 Classification and discrimination; cluster analysis (statistical aspects)

Software:

COFFIN; UCI-ml; XGBoost

References:

[1] Bielza, C; Larrañaga, P, Discrete Bayesian network classifiers: A survey, ACM Computing Surveys, 47, 5, (2014) · Zbl 1322.68147 · doi:10.1145/2576868
[2] Boström, H, Forests of probability estimation trees, International Journal of Pattern Recognition and Artificial Intelligence, 26, 125,1001, (2012) · doi:10.1142/S0218001412510019
[3] Breiman, L, Random forests, Machine Learning, 45, 5-32, (2001) · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[4] Buntine, W, A guide to the literature on learning probabilistic networks from data, IEEE Transactions on Knowledge and Data Engineering, 8, 195-210, (1996) · doi:10.1109/69.494161
[5] Buntine, W., & Mishra, S. (2014). Experiments with non-parametric topic models. In 20th ACM SIGKDD international conference on knowledge discovery and data Mining, ACM, New York, NY, USA, KDD ’14 (pp. 881-890).
[6] Carvalho, AM; Roos, T; Oliveira, AL; Myllymäki, P, Discriminative learning of Bayesian networks via factorized conditional log-likelihood, Journal of Machine Learning Research, 12, 2181-2210, (2011) · Zbl 1280.68158
[7] Chen, S., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In 34th Annual meeting on association for computational linguistics, ACL ’96 (pp. 310-318).
[8] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’16 (pp. 785-794). https://doi.org/10.1145/2939672.2939785.
[9] Chow, C; Liu, C, Approximating discrete probability distributions with dependence trees, IEEE Transactions on Information Theory, 14, 462-467, (1968) · Zbl 0165.22305 · doi:10.1109/TIT.1968.1054142
[10] Du, L; Buntine, W; Jin, H, A segmented topic model based on the two-parameter Poisson-Dirichlet process, Machine Learning, 81, 5-19, (2010) · Zbl 1470.68100 · doi:10.1007/s10994-010-5197-4
[11] Fayyad, U; Irani, K, On the handling of continuous-valued attributes in decision tree generation, Machine Learning, 8, 87-102, (1992) · Zbl 0767.68084
[12] Ferguson, T, A Bayesian analysis of some nonparametric problems, The Annals of Statistics, 1, 209-230, (1973) · Zbl 0255.62037 · doi:10.1214/aos/1176342360
[13] Friedman, N; Geiger, D; Goldszmidt, M, Bayesian network classifiers, Machine Learning, 29, 131-163, (1997) · Zbl 0892.68077 · doi:10.1023/A:1007465528199
[14] Friedman, N; Koller, D, Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks, Machine Learning, 50, 95-125, (2003) · Zbl 1033.68104 · doi:10.1023/A:1020249912095
[15] Gasthaus, J., & Teh, Y. (2010). Improvements to the sequence memoizer. In: J.D. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel, & A. Culotta, Proceedings of the 23rd International Conference on Neural Information Processing Systems NIPS’10, Vancouver, British Columbia, Canada, 6-9 December 2010 (Vol. 1, pp. 685-693). Curran Associates Inc.
[16] Hardy, G. H. [(1889) 1920]. Letter. Transactions of the Faculty of Actuaries 8, pp. 180-181 (originally published on “Insurance Record 457”).
[17] Huynh, V., Phung, D. Q., Venkatesh, S., Nguyen, X., Hoffman, M. D., & Bui, H. H. (2016). Scalable nonparametric Bayesian multilevel clustering. In UAI.
[18] Hwang, HK, Asymptotic expansions for the Stirling numbers of the first kind, Journal of Combinatorial Theory, Series A, 71, 343-351, (1995) · Zbl 0833.05005 · doi:10.1016/0097-3165(95)90010-1
[19] Koller, D., & Friedman, N. (2009). Probabilistic graphical models—Principles and techniques. Adaptive computation and machine learning. Cambridge, MA: The MIT Press. · Zbl 1183.68483
[20] Lewis, D. (1998). Naive Bayes at forty: The independence assumption in information retrieval. In 10th European conference on machine learning, Springer, London, UK, ECML ’98 (pp. 4-15).
[21] Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml Accessed 23 Jan 2015.
[22] Lidstone, G, Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities, Transactions of the Faculty Actuaries, 8, 182-192, (1920)
[23] Lim, K; Buntine, W; Chen, C; Du, L, Nonparametric Bayesian topic modelling with the hierarchical Pitman-Yor processes, International Journal of Approximate Reasoning, 78, 172-191, (2016) · Zbl 06639853 · doi:10.1016/j.ijar.2016.07.007
[24] Lyubimov, D., & Palumbo, A. (2016). Apache Mahout: Beyond MapReduce (1st ed.). North Charleston: CreateSpace Independent Publishing Platform.
[25] Martínez, A; Webb, G; Chen, S; Zaidi, N, Scalable learning of Bayesian network classifiers, Journal of Machine Learning Research, 17, 1-35, (2016) · Zbl 1360.68694
[26] Mitchell, T. (1997). Machine learning. New York: McGraw-Hill. · Zbl 0913.68167
[27] Nguyen, V., Phung, D. Q., Venkatesh, S., & Bui, H. H. (2015). A Bayesian nonparametric approach to multilevel regression. In PAKDD (Vol. 1, pp. 330-342).
[28] Rennie, J., Shih, L., Teevan, J., & Karger, D. (2003). Tackling the poor assumptions of naive Bayes text classifiers. In 20th International conference on machine learning, AAAI Press, ICML’03 (pp. 616-623).
[29] Roos, T; Wettig, H; Grünwald, P; Myllymäki, P; Tirri, H, On discriminative Bayesian network classifiers and logistic regression, Machine Learning, 59, 267-296, (2005) · Zbl 1101.68785
[30] Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Second international conference on knowledge discovery and data mining (pp. 334-338). AAAI Press, Menlo Park, CA.
[31] Shareghi, E., Cohn, T., & Haffari, G. (2016). Richer interpolative smoothing based on modified Kneser-Ney language modeling. In Empirical methods in natural language processing (pp. 944-949).
[32] Shareghi, E., Haffari, G., & Cohn, T. (2017a). Compressed nonparametric language modelling. In IJCAI. Accepted 23/04/2017.
[33] Shareghi, E., Haffari, G., & Cohn, T. (2017b). Compressed nonparametric language modelling. In Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17 (pp. 2701-2707). https://doi.org/10.24963/ijcai.2017/376.
[34] Sonnenburg, S., & Franc, V. (2010). COFFIN: A computational framework for linear SVMs. In J. Fürnkranz & T. Joachims (Eds.), ICML (pp. 999-1006).
[35] Teh, Y. (2006). A Bayesian interpretation of interpolated Kneser-Ney. Tech. Rep. TRA2/06, School of Computing, National University of Singapore.
[36] Teh, Y; Jordan, M; Beal, M; Blei, D, Hierarchical Dirichlet processes, Journal of the American Statistical Association, 101, 1566-1581, (2006) · Zbl 1171.62349 · doi:10.1198/016214506000000302
[37] Webb, GI; Boughton, J; Wang, Z, Not so naive Bayes: aggregating one-dependence estimators, Machine Learning, 58, 5-24, (2005) · Zbl 1075.68078 · doi:10.1007/s10994-005-4258-6
[38] Webb, G; Boughton, J; Zheng, F; Ting, K; Salem, H, Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification, Machine Learning, 86, 233-272, (2012) · Zbl 1238.68136 · doi:10.1007/s10994-011-5263-6
[39] Wermuth, N; Lauritzen, S, Graphical and recursive models for contigency tables, Biometrika, 70, 537-552, (1983) · Zbl 0541.62038 · doi:10.2307/2336490
[40] Wood, F; Gasthaus, J; Archambeau, C; James, L; Teh, Y, The sequence memoizer, Communications of the ACM, 54, 91-98, (2011) · doi:10.1145/1897816.1897842
[41] Zaidi, NA; Webb, GI; Carman, MJ; Petitjean, F; Buntine, W; Hynes, M; Sterck, H, Efficient parameter learning of Bayesian network classifiers, Machine Learning, 106, 1289-1329, (2017) · Zbl 1454.62183 · doi:10.1007/s10994-016-5619-z
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.