Abstract
This paper proposes an innovative method to improve the attribute weighting approaches for naïve Bayes text classifiers using the improved distance correlation coefficient. The resulted model is called improved distance correlation coefficient attribute weighted multinomial naïve Bayes, denoted by IDCWMNB. Unlike the traditional correlation statistical measurements that consider the cumulative distribution function of random vectors, the improved distance correlation coefficient tests the joint correlation of random vectors by describing the distance between the joint characteristic function and the product of the marginal characteristic functions. Specifically, a measurement of inverse document frequency that considers the distribution information of document concentrating and scattering has been proposed. Then, the measurement and the distance correlation coefficient between attributes and categories have been combined to measure the importance of attributes to categories, to allocate different weights to different terms. Meanwhile, the learned attribute weights are incorporated into the posterior probability estimates of the multinomial naïve Bayes model, which is known as deep attribute weighting. This measurement is more effective than the traditional statistical measurements in the presence of nonlinear relationship between two random vectors. Experimental results taking benchmark and real-world data indicate that the new attribute weighting method can achieve an effective balance between classification accuracy and execution time.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Zhang L, Jiang L, Li C et al (2016) Two feature weighting approaches for naïve Bayes text classifiers. Knowl. Based Syst. 100:137–144
Taniguchi H, Sato H, Shirakawa T (2017) Application of human cognitive mechanisms to naïve Bayes text classifier. In: International conference on numerical analysis and applied mathematics. AIP Publishing LLC
Yahya A, Hisyam L (2018) A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv Data Anal Classif 13:753–771
Khan K, Ahmad N, Khan R (2015) Urdu text classification using decision trees. In: International conference on high-capacity optical networks and enabling/emerging technologies. IEEE
Pang G, Jin H, Jiang S (2015) CenKNN: a scalable and effective text classifier. Data Min Knowl Discov 29:593–625
Wang Z, Liu J (2015) PU Chinese text classifier based on support vector machine construction. J Nanjing Univ Posts Telecommun 35:100–105
Conneau A, Schwenk H, Barrault L et al (2017) Very deep convolutional networks for text classification. In: Proceedings of 15th Conference on EACL: Long Papers, vol 1, pp 1107–1116
Jiang L, Zhang L, Li C et al (2019) A correlation-based feature weighting filter for naïve Bayes. IEEE Trans Knowl Data Eng 31:201–213
Jiang L, Zhang L, Yu L (2019) Class-specific attribute weighted naïve Bayes. Pattern Recognit 88:321–330
Zaidi N, Cerquides J, Carman M et al (2013) Alleviating naive Bayes attribute independence assumption by attribute weighting. J Mach Learn Res 14:1947–1988
Zhang L, Jiang L, Li C (2016) A new feature selection approach to naïve Bayes text classifiers. Int J Pattern Recognit 30:1650003.1-1650003.17
Chen S, Webb G, Liu L et al (2020) A novel selective naive Bayes algorithm. Knowl Based Syst 192:105361
Escalante H, García-Limón M, Morales-Reyes A et al (2015) Term-weighting learning via genetic programming for text classification. Knowl Based Syst 83:176–189
Wang S, Jiang L, Li C (2015) Adapting naïve Bayes tree for text classification. Knowl Inf Syst 44:77–89
Jiang L, Wang S, Li C et al (2016) Structure extended multinomial naïve Bayes. Inf Sci 329:346–356
Kim S, Han K, Rim H et al (2006) Some effective techniques for naïve Bayes text classification. IEEE Trans Knowl Data Eng 18:1457–1466
Li Y, Luo C, Chung S (2012) Weighted naïve Bayes for text classification using positive term-class dependency. Int J Artif Intell Tools 21:1250008-1-1250008–16
Wang S, Jiang L, Li C (2014) A CFS-based feature weighting approach to naïve Bayes text classifiers. In: Proceedings of 24th international conference on artificial neural network. Springer, pp 555–562
Jiang L, Li C, Wang S et al (2016) Deep feature weighting for naïve Bayes and its application to text classification. Eng Appl Artif Intell 52:26–39
Ruan S, Li H, Li C et al (2020) Class-specific deep feature weighting for naïve Bayes text classifiers. IEEE Access 8:20151–20159
Zhang H, Jiang L, Yu L (2020) Class-specific attribute value weighting for naive Bayes. Inf Sci 508:260–274
Tang B, He H, Baggenstoss P et al (2016) A Bayesian classification approach using class-specific features for text categorization. IEEE Trans Knowl Data Eng 28:1602–1606
Youn E, Jeong M (2009) Class dependent feature scaling method using naïve Bayes classifier for text data mining. Pattern Recognit Lett 30:477–485
Kim HJ, Kim J, Kim J et al (2018) Towards perfect text classification with Wikipedia-based semantic naïve Bayes learning. Neurocomputing 315:128–134
Szekely G, Rizzo M, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. J Ann Stat 35:2769–2794
Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107:1129–1139
Sheng W, Yin X (2016) Sufficient dimension reduction via distance covariance. J Comput Graph Stat 25:91–104
Liu Y, Bi J, Fan Z (2017) Multi-class sentiment classification: the experimental comparisons of feature selection and machine learning algorithms. Expert Syst Appl 80:323–339
McCallum A, Nigam K (1998) A comparison of event models for naïve Bayes text classification. In: Proceedings of AAAI Workshop Learn. Text Categorization, vol 752, pp 41–48
Witten I, Frank E, Hall M (2017) Data mining: practical machine learning tools and techniques, 4th edn. Morgan Kaufmann, San Mateo
Alcalá-Fdez J, Sánchez L et al (2011) KEEL data-mining software tool: data set repository. Integration of algorithms and experimental analysis framework. Multi-Valued Log Soft Comput 17:255–287
Tang C, Zhu Y, Xie B et al (2019) Study on the text categorization of engineering geological investigation. Stat Appl 8:589–597
Acknowledgements
The authors would like to thank the anonymous reviewers and the editors for their valuable comments and suggestions. The authors would like to thank Chaoguo Tang, the chief engineer in China Railway Erju Group, for providing the real engineering geological survey text data.
Funding
This work is supported by the National Key R&D Program of China (2018YFC1503705), Science and Technology Research Project of Hubei Provincial Department of Education (B2017597).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ruan, S., Chen, B., Song, K. et al. Weighted naïve Bayes text classification algorithm based on improved distance correlation coefficient. Neural Comput & Applic 34, 2729–2738 (2022). https://doi.org/10.1007/s00521-021-05989-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-05989-6