×

Large scale anomaly detection in mixed numerical and categorical input spaces. (English) Zbl 1454.62550

Summary: This work presents the ADMNC method, designed to tackle anomaly detection for large-scale problems with a mixture of categorical and numerical input variables. A flexible parametric probability measure is adjusted to input data, allowing low likelihood values to be tracked as anomalies. The main contribution of this method is that, to cope with the variable nature of the variables, we factorize the joint probability measure into two parts, namely, the marginal density of the continuous variables and the conditional probability of the categorical variables given the continuous part of the feature vector. The result is a model trained through a maximum likelihood objective function optimized with stochastic gradient descent that yields an effective and scalable algorithm. Compared with other well-known anomaly detection algorithms over several datasets, ADMNC is observed to both offer top level accuracy in datasets that are out of reach for the most effective existing methods and to scale up well to processing very large datasets. This makes it a powerful tool for solving a problem growing in popularity that currently lacks suitable scalable algorithms.

MSC:

62R07 Statistical aspects of big data and data science
62G32 Statistics of extreme values; tail inference

Software:

LOF; ELKI; PRMLT

References:

[1] Akoglu, L.; Tong, H.; Vreeken, J.; Faloutsos, C., Fast and reliable anomaly detection in categorical data, Proceedings 21st ACM International Conference on Information and Knowledge Management, CKIM 2012 (2012), ACM: ACM New York, NY, USA
[2] Aleskerov, E.; Freisleben, B.; Rao, B., CARDWATCH: a neural network based database mining system for credit card fraud detection., Proceedings of the IEEE Conference on Computational Intelligence for financial engineering, 220-226 (1997)
[3] Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S., Scalable k-means++, Proc. VLDB Endowment, 5, 7, 622-633 (2012)
[4] Bishop, C., Pattern Recognition and Machine Learning (Information Science and Statistics) (2006), Springer-Verlag New York, Inc.: Springer-Verlag New York, Inc. Secaucus, NJ, USA · Zbl 1107.68072
[5] Blackard, J. A.; Dean, D. J., Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables, Comput. Electron. Agric., 24, 3, 131-151 (1999)
[6] Bottou, L.; Bousquet, O., The tradeoffs of large scale learning, (Platt, J.; Koller, D.; Singer, Y.; Roweis, S., Advances in Neural Information Processing Systems, 20 (2008), NIPS Foundation), 161-168
[7] Bottou, L.; Lin, C.-J., Support vector machine solvers, Large Scale Kernel Mach., 3, 1, 301-320 (2007)
[8] Breunig, M.; Kriegel, H.; Ng, R.; Sander, J., Lof: identifying density-based local outliers, SIGMOD Rec., 29, 2, 93-104 (2000)
[9] Campos, G. O.; Zimek, A.; Sander, J.; Campello, R. J.; Micenková, B.; Schubert, E.; Assent, I.; Houle, M. E., On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study, Data Min. Knowl. Discov., 30, 4, 891-927 (2016)
[10] Castillo, E.; Peteiro-Barral, D.; Berdiñas, B. G.; Fontenla-Romero, O., Distributed one-class support vector machine, Int. J. Neural Syst., 25, 07, 1550029 (2015)
[11] Chandola, V.; Banerjee, A.; Kumar, V., Anomaly detection: a survey, ACM Comput.Surv. (CSUR), 41, 3, 15 (2009)
[12] Chandola, V.; Banerjee, A.; Kumar, V., Anomaly detection for discrete sequences: a survey, IEEE Trans. Knowl. Data Eng., 24, 5, 823-839 (2012)
[13] Das, K.; Schneider, J., Detecting anomalous records in categorical datasets, Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07 (2007), ACM: ACM New York, NY, USA
[14] Das, S.; Matthews, B. L.; Srivastava, A. N.; Oza, N., Multiple kernel learning for heterogeneous anomaly detection: algorithm and aviation safety case study, Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, 47-56 (2010), ACM: ACM New York, NY, USA
[15] Demšar, J., Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res., 7, January, 1-30 (2006) · Zbl 1222.68184
[16] Do, K.; Tran, T.; Venkatesh, S., Energy-based anomaly detection for mixed data, Knowl. Inf. Syst., 1-23 (2018)
[17] Edgeworth, F., On discordant observations, Phylos. Mag., 23, 5, 364-375 (1887) · JFM 19.0214.02
[18] Emmott, A. F.; Das, S.; Dietterich, T.; Fern, A.; Wong, W.-K., Systematic construction of anomaly detection benchmarks from real data, Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, 16-21 (2013), ACM
[19] Fernández-Francos, D.; Martínez-Rego, D.; O. Fontenla-Romero; Alonso-Betanzos, A., Automatic bearing fault diagnosis based on one-class nu-svm, Comput. Ind. Eng., 64, 1, 357-365 (2013)
[20] Fiore, U.; Palmieri, F.; Castiglione, A.; De Santis, A., Network anomaly detection with the restricted boltzmann machine, Neurocomputing, 122, 13-23 (2013)
[21] Ghoting, A.; Otey, M.; Parthasarathy, S., Loaded: link-based outlier and anomaly detection in evolving data sets, Data Mining, 2004. ICDM ’04. Fourth IEEE International Conference on, 387-390 (2004)
[22] Hawkins, D. M., Identification of Outliers, 11 (1980), Springer · Zbl 0438.62022
[23] Hawkins, S.; He, H.; Williams, G.; Baxter, R., Outlier detection using replicator neural networks, International Conference on Data Warehousing and Knowledge Discovery, 170-180 (2002), Springer · Zbl 1016.68596
[24] He, Z.; Deng, S.; Xu, X., An optimization model for outlier detection in categorical data, (Huang, D.; X. P. Zhang, G.; Huang, Advances in Intelligent Computing. Advances in Intelligent Computing, Lecture Notes in Computer Science, 3644 (2005), Springer Berlin Heidelberg), 400-409
[25] Hettich, S.; Bay, S., KDD Cup 1999 Data (1999), The UCI KD Archive, Irvine, CA: University of California, Department of Information and Computer Science
[26] Hu, W.; Hu, W.; Maybank, S., Adaboost-based algorithm for network intrusion detection, IEEE Trans. Syst. Man Cybern.Part B (Cybernetics), 38, 2, 577-583 (2008)
[27] JeffreyXu, Y.; Qian, W.; Hongjun, L.; Aoying, Z., Finding centric local outliers in categorical/numerical spaces, Knowl. Inf. Syst., 9, 3, 309-338 (2006)
[28] Khan, S. S.; Madden, M. G., One-class classification: taxonomy of study and review of techniques, Knowl. Eng. Rev., 29, 03, 345-374 (2014)
[29] Koufakou, A.; Georgiopoulos, M., A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes, Data Min. Knowl. Discov., 20, 2, 259-289 (2010)
[30] Kumar, V., Parallel and distributed computing for cybersecurity., IEEE Distrib. Syst. Online, 6, 10, 1-10 (2005)
[31] Lazarevic, A.; Kumar, V., Feature bagging for outlier detection, Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 157-166 (2005), ACM
[32] M. Lichman, UCI machine learning repository, 2013. http://archive.ics.uci.edu/mll; M. Lichman, UCI machine learning repository, 2013. http://archive.ics.uci.edu/mll
[33] Liu, F. T.; Ting, K. M.; Zhou, Z.-H., Isolation-based anomaly detection, ACM Trans. Knowl. Discov.Data (TKDD), 6, 1, 3 (2012)
[34] Lu, Y.-C.; Chen, F.; Wang, Y.; Lu, C.-T., Discovering anomalies on mixed-type data using a generalized student-\(t\) based approach, IEEE Trans. Knowl. Data Eng., 28, 10, 2582-2595 (2016)
[35] Martinez-Rego, D.; Castillo, E.; Fontenla-Romero, O.; Alonso-Betanzos, A., A minimum volume covering approach with a set of ellipsoids, Pattern Anal. Mach. Intell. IEEE Trans., 35, 12, 2997-3009 (2013)
[36] Martínez-Rego, D.; Fernández-Francos, D.; O. Fontenla-Romero; Alonso-Betanzos, A., Stream change detection via passive-aggressive classification and bernoulli CUSUM, Inf. Sci., 305, 130-145 (2015) · Zbl 1360.68695
[37] Moonesignhe, H.; Tan, P.-N., Outlier detection using random walks, Null, 532-539 (2006), IEEE
[38] Nicolau, M.; McDermott, J., Learning neural representations for network anomaly detection, IEEE Trans. Cybern., 1-14 (2018)
[39] Otey, M.; Ghoting, A.; Parthasarathy, S., Fast distributed outlier detection in mixed-attribute data sets, Data Min. Knowl. Discov., 12, 2-3, 203-228 (2006)
[40] Papadimitrou, S.; Kitagawa, H.; Gibbons, P.; Faloutsos, C., LOCI: Fast Outlier Detection using the LOcal Correlation Integral, Technical report IRP-TR-02-09 (2002), Intel Research Laboratory
[41] Sarasamma, S. T.; Zhu, Q. A.; Huff, J., Hierarchical kohonenen net for anomaly detection in network security, IEEE Trans. Syst. Man Cybern.Part B (Cybernetics), 35, 2, 302-312 (2005)
[42] Scholkopf, B.; Platt, J.; Shawe-Taylor, J.; Smola, A.; Williamson, R., Estimating the support of a high-dimensional distribution, Neural Comput., 13, 7, 1443-1471 (2001) · Zbl 1009.62029
[43] Schubert, E.; Zimek, A.; Kriegel, H., Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection, Data Min. Knowl. Discov., 1-48 (2012)
[44] Shiravi, A.; Shiravi, H.; Tavallaee, M.; Ghorbani, A. A., Toward developing a systematic approach to generate benchmark datasets for intrusion detection, Comput.Secur., 31, 3, 357-374 (2012)
[45] Singh, S.; Tu, H.; Donat, W.; Pattipati, K.; Willett, P., Anomaly detection via feature-aided tracking and hidden markov models, IEEE Trans. Syst. ManCybern.-Part A, 39, 1, 144-159 (2009)
[46] Sodemann, A.; Ross, M.; Borghetti, B., A review of anomaly detection in automated surveillance, IEEE Trans. Syst. Man Cybern.Part C, 42, 6, 1257-1272 (2012)
[47] S. Wu; Wang, S., Parameter-free anomaly detection for categorical data., Proceedings of the 7th International Conference on Machine Learning and Data Mining, MLDM 2011. Lecture notes in Computer Science, 6871, 112-126 (2011)
[48] Wei, L.; Qian, W.; Zhou, A.; Jin, W.; Yu, J., Hot: hypergraph-based outlier test for categorical data, (Whang, K.; Jongwoo, J.; Shim, K.; Srivastava, J., Advances in Knowledge Discovery and Data Mining. Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, 2637 (2003), Springer Berlin Heidelberg), 399-410 · Zbl 1032.68637
[49] Wu, S.; Wang, S., Information-theoretic outlier detection for large-scale categorical data, IEEE Trans. Knowl. Data Eng., 25, 3, 589-602 (2013)
[50] Zhang, K.; Jin, H., An effective pattern based outlier detection approach for mixed attribute data, Australasian Joint Conference on Artificial Intelligence, 122-131 (2010), Springer
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.