×

Anomaly detection in the probability simplex under different geometries. (English) Zbl 07761156

Summary: An open problem in data science is that of anomaly detection. Anomalies are instances that do not maintain a certain property that is present in the remaining observations in a dataset. Several anomaly detection algorithms exist, since the process itself is ill-posed mainly because the criteria that separates common or expected vectors from anomalies are not unique. In the most extreme case, data is not labelled and the algorithm has to identify the vectors that are anomalous, or assign a degree of anomaly to each vector. The majority of anomaly detection algorithms do not make any assumptions about the properties of the feature space in which observations are embedded, which may affect the results when those spaces present certain properties. For instance, compositional data such as normalized histograms, that can be embedded in a probability simplex, constitute a particularly relevant case. In this contribution, we address the problem of detecting anomalies in the probability simplex, relying on concepts from Information Geometry, mainly by focusing our efforts in the distance functions commonly applied in that context. We report the results of a series of experiments and conclude that when a specific distance-based anomaly detection algorithm relies on Information Geometry-related distance functions instead of the Euclidean distance, the performance is significantly improved.

MSC:

62-XX Statistics
94Axx Communication, information
62Hxx Multivariate analysis

References:

[1] Desai, J.; Watson, D.; Wang, V.; Tadeo, M.; Floridi, L., The epistemological foundations of data science: a critical review, Synthese, 200, 469 (2022) · doi:10.1007/s11229-022-03933-2
[2] Carmichael, I.; Marron, JS, Data science vs. statistics: two cultures?, Jpn. J. Stat. Data Sci., 1, 117-138 (2018) · Zbl 1430.62017 · doi:10.1007/s42081-018-0009-3
[3] Daoud, A.; Dubhashi, D., Statistical, modeling: the three cultures, Harvard Data Sci. Rev. (2023) · doi:10.1162/99608f92.89f6fe66
[4] Liberti, L., Distance geometry and data science, TOP, 28, 2, 271-339 (2020) · Zbl 1511.51006 · doi:10.1007/s11750-020-00563-0
[5] Tukey, J., Exploratory Data Analysis (1977), London: Pearson, London · Zbl 0409.62003
[6] Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high-dimensional data. In: New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition · Zbl 1078.62066
[7] Epstein, C.; Carlsson, G.; Edelsbrunner, H., Topological data analysis, Inverse Probl., 27, 12 (2011) · doi:10.1088/0266-5611/27/12/120201
[8] Goldstein, M.; Uchida, S., A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data, PLoS ONE, 11, 4 (2016) · doi:10.1371/journal.pone.0152173
[9] Tenenbaum, JB; Silva, V.; Langford, C., A global geometric framework for nonlinear dimensionality reduction, Science, 290, 5500, 2319-2323 (2000) · doi:10.1126/science.290.5500.2319
[10] Lee, J.; Verleysen, M., Nonlinear Dimensionality Reduction (2007), New York: Springer, New York · Zbl 1128.68024 · doi:10.1007/978-0-387-39351-3
[11] Aguayo, L.; Barreto, G., Novelty detection in time series using self-organizing neural networks: a comprehensive evaluation, Neural Proc. Lett., 1, 1 (2017) · doi:10.1007/s11063-017-9679
[12] Zimek, A., Schubert, E., Kriegel, P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. (2012) · Zbl 07260336
[13] Grubbs, FE, Sample criteria for testing outlying observations, Ann. Math. Stat., 21, 1, 27-58 (1950) · Zbl 0036.21003 · doi:10.1214/aoms/1177729885
[14] Barnett, V.; Lewis, T., Outliers in Statistical Data (1978), New York: Wiley, New York · Zbl 0377.62001
[15] Markou, M.; Singh, M., Novelty detection: a review-Part 1, statistical approaches, Signal Process., 83, 12, 2481-2497 (2003) · Zbl 1145.94402 · doi:10.1016/j.sigpro.2003.07.0
[16] Ester, M., Kriegel, H.P., Sander, J., Xu, X., Xiaowei, E.S., Evangelos, H., Jiawei, F., Usama M. (eds.).: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226-231. AAAI Press, Washington (1996)
[17] Brendan, JF; Delbert, D., Clustering by passing messages between data points, Science, 315, 5814, 972-976 (2007) · Zbl 1226.94027 · doi:10.1126/science.1136800
[18] Breunig, M., Kriegel, H.P., Ng, R., Sander, J., LOF: Identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93-104. SIGMOD. doi:10.1145/335191.335388. ISBN 1-58113-217-4 (2000)
[19] Pimentel, M.; Clifton, D.; Clifton, L.; Tarassenko, L., A review on novelty detection, Signal Process., 99, 215-249 (2014) · doi:10.1016/j.sigpro.2013.12.026
[20] Markou, M.; Singh, M., Novelty detection: a review-Part 2, neural network based approaches, Signal Process., 83, 12, 2499-2521 (2003) · Zbl 1145.94403 · doi:10.1016/j.sigpro.2003.07.019
[21] Selicato, L.; Esposito, F.; Gargano, G.; Vegliante, MC; Opinto, G.; Zaccaria, GM; Ciavarella, S.; Guarini, A.; Del Buono, N., A new ensemble method for detecting anomalies in gene expression matrices, Mathematic, 9, 882 (2021) · doi:10.3390/math9080882
[22] Li, HZ; Boulanger, P., A survey of heart anomaly detection using ambulatory electrocardiogram (ECG), Sensors (Basel), 20, 5, 1461 (2020) · doi:10.3390/s20051461
[23] Basora, L.; Olive, X.; Dubot, T., Recent advances in anomaly detection methods applied to aviation, Aerospace, 6, 11, 117 (2019) · doi:10.3390/aerospace6110117
[24] Schwabacher, M.; Oza, N.; Matthews, B., Unsupervised anomaly detection for liquid-fueled rocket propulsion health monitoring, J. Aerosp. Comput. Inf. Commun., 6, 7 (2009) · doi:10.2514/1.42783
[25] Yepmo, G.; Smits, G.; Pivert, O., Anomaly explanation: a review, Data Knowl. Eng., 137 (2022) · doi:10.1016/j.datak.2021.101946
[26] Greenacre, M., Compositional Data Analysis in Practice (2018), London: CRC Press, London · Zbl 1416.62022 · doi:10.1201/9780429455537
[27] Aitchison, J., The statistical analysis of compositional data, J. R. Stat. Soc. B, 44, 2, 139-177 (1982) · Zbl 0491.62017
[28] Nielsen, F., An elementary introduction to information geometry, Entropy, 22, 10, 1100 (2020) · doi:10.3390/e22101100
[29] Nielsen, F., The many faces of information geometry, Notices AMS, 69, 36-45 (2022) · Zbl 1493.53021
[30] Rao, CR, Information and accuracy attainable in the estimation of statistical parameters, Bull. Calcutta Math. Soc., 37, 81-91 (1945) · Zbl 0063.06420
[31] Deza, M.; Deza, E., Encyclopedia of Distances (2018), New York: Springer, New York · Zbl 1167.51001
[32] Aitchison, J., Principal component analysis of compositional data, Biometrika, 70, 1, 57-65 (1983) · Zbl 0515.62057 · doi:10.1093/biomet/70.1.57
[33] Nielsen, F., Sun, K.: Clustering in Hilbert simplex geometry. Clustering in Hilbert’s projective geometry: the case studies of the probability simplex and the elliptope of correlation matrices. In: Nielsen, F. (eds) Geometric Structures of Information. Signals and Communication Technology. Springer, Cham. doi:10.1007/978-3-030-02520-5_11 (2019) · Zbl 1418.53022
[34] Avalos-Fernandez, M.; Nock, R.; Ong, CS; Rouar, J.; Sun, K., Representation learning of compositional data, NIPS, 18, 6680-6690 (2018) · doi:10.5555/3327757.3327774
[35] Bulmer, M., Principles of Statistics (1979), New York: Dover Publications, New York · Zbl 0139.35804
[36] Li, Q., McKenzie, D., Yin, W.: From the simplex to the sphere: faster constrained optimization using the Hadamard parametrization. arXiv:2112.05273. doi:10.48550/arXiv.2112.05273 (2022) · Zbl 1531.90108
[37] Mehrotra, K.; Mihan, C.; Huang, H., Anomaly Detection, Principles and Algorithms (2017), New York: Springer, New York · doi:10.1007/978-3-319-67526-8
[38] Schubert, E.; Zimek, A.; Kriegel, HP, Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection, Data Min. Knowl. Discov., 28, 190-237 (2014) · Zbl 1281.68192 · doi:10.1007/s10618-012-0300-z
[39] Liu, F.T., Ting, K.M., ZHou, Z.H.: Isolation forest. In: Eighth IEEE International Conference on Data Mining, pp. 413-422. doi:10.1109/ICDM.2008.17. ISBN 978-0-7695-3502-9. S2CID 6505449 (2008)
[40] Knorr, E.; Ng, R.; Tucakov, V., Distance-based outliers: algorithms and applications, VLDB J., 8, 237-253 (2000) · doi:10.1007/s007780050006
[41] Iglewicz, B.; Hoaglin, D., How to Detect and Handle Outliers (1993), New York: American Society for Quality Control, New York
[42] Aguayo, L.; Barreto, G., Novelty detection in time series using self-organizing neural networks: a comprehensive evaluation, Neural Process. Lett., 47, 1 (2017) · doi:10.1007/s11063-017-9679
[43] Neme, A.; Lugo, B.; Cervera, A., Authorship attribution as a case of anomaly detection: a neural network model, Int. J. Hybrid Intell. Syst., 8, 4, 225-235 (2011)
[44] Neme, A.; Gutierrez-Pulido, J.; Muñoz, A.; Hernández, S.; Dey, T., Stylistics analysis and authorship attribution algorithms based on self-organizing maps, Neurocomputing, 147, 147-159 (2015) · doi:10.1016/j.neucom.2014.03.064
[45] Forrest, S., Perelson, A.S., Allen, L., Cherukuri, R.: Self-nonself discrimination in a computer. In: Proceedings of the 1994 IEEE Symposium on Research in Security and Privacy, Los Alamitos, pp. 202-212 (1994)
[46] Wang, K.; Langevin, S.; Shattuck, M.; Ogle, S.; Kirby, M., Anomaly detection in host signaling pathways for the early prognosis of acute infection, PLOS (2016) · doi:10.1371/journal.pone.0160919
[47] Wang, G.; Yang, J.; Li, R., Imbalanced SVM-based anomaly detection algorithm for imbalanced training datasets, Electron. Telecommun. Res. Inst., 39-5, 621-631 (2017) · doi:10.4218/etrij.17.0116.0879
[48] Zhao, W.; Li, L.; Alam, S.; Wang, Y., An incremental clustering method for anomaly detection in flight data, Transport. Res. Part C Emerg. Technol., 132 (2021) · doi:10.1016/j.trc.2021.103406
[49] Evangelou, M.; Adams, N., An anomaly detection framework for cyber-security data, Comput. Secur., 97 (2021) · doi:10.1016/j.cose.2020.101941
[50] Novikova, E., Kotenko, I.: Visual analytics for detecting anomalous activity in mobile money transfer services. In: International Cross-Domain Conference and Workshop on Availability, Reliability,and Security (CD-ARES), Fribourg pp. 63-78. doi:10.1007/978-3-319-10975-65 (2014)
[51] Garrard, P.; Maloney, L.; Hodges, J.; Patterson, K., The effects of very early Alzheimer’s disease on the characteristics of writing by a renowned author, Brain, 128, 2, 250-260 (2005) · doi:10.1093/brain/awh341
[52] Close, L.; Kashef, R., Combining artificial immune system and clustering analysis: a stock market anomaly detection model, J. Intell. Learn. Syst. Appl. (2020) · doi:10.4236/jilsa.2020.124005
[53] Colignatus, T.: Comparing the Aitchison Distance and the Angular Distance for Use as Inequality or Disproportionality Measures for Votes and Seats (2018)
[54] Villani, C.: Optimal Transport, Old and New. Springer, New York. ISBN 978-3-540-71050-9 (2008) · Zbl 1156.53003
[55] Bigot, J., Statistical data analysis in the Wasserstein space, J. 2018 MAS Sampling Process., 68, 1-19 (2020) · Zbl 1444.62171 · doi:10.1051/proc/202068001
[56] Peyre, G., Cuturi, M.: Computational Optimal Transport. arXiv:1803.00567 (2018) · Zbl 1475.68011
[57] Aler, R.; Valss, J.; Bostrom, H., Study of Hellinger distance as a splitting metric for random forests in balanced and imbalanced classification datasets, Expert Syst. Appl., 1 (2020) · doi:10.1016/j.eswa.2020.113264
[58] Lavigne, C.; Ricci, B.; Franck, P.; Senoussi, R., Spatial analyses of ecological count data: a density map comparison approach, Basic Appl. Ecol., 11, 734-742 (2010) · doi:10.1016/j.baae.2010.08.011
[59] Menendez, ML; Pardo, JA; Pardo, M., The Jensen-Shannon divergence, J. Franklin Inst., 334, 2, 307-318 (1997) · Zbl 0928.62009 · doi:10.1016/S0016-0032(96)00063-4
[60] Coles, P.; Cerezo, M.; Cincio, L., Strong bound between trace distance and Hilbert-Schmidt distance for low-rank states, Phys. Rev. A., 100, 2 (2019) · doi:10.1103/PhysRevA.100.022103
[61] Gattone, S.; Sanctis, A.; Russo, T.; Pulcini, D., A shape distance based on the Fisher-Rao metric and its application for shapes clustering, Phys. A Stat. Mech. Appl. (2017) · Zbl 1499.53072 · doi:10.1016/j.physa.2017.06.014
[62] Hawkins, D., Identification of Outliers (1980), New York: Springer, New York · Zbl 0438.62022 · doi:10.1007/978-94-015-3994-4
[63] Nakamura, Y.; Gojobori, T.; Ikemura, T., Codon usage tabulated from the international DNA sequence databases: status for the year 2000, Nucl. Acids Res., 28, 292 (2000) · doi:10.1093/nar/28.1.292
[64] Khomtchouk, BB, Codon usage bias levels predict taxonomic identity and genetic composition, bioRxiv (2020) · doi:10.1101/2020.10.26.356295
[65] Nelson, D.L., Cox, M.M.: Principles of Biochemistry, 4th edn. W. H. Freeman, New York. ISBN 0-7167-4339-6 (2005)
[66] Parvathy, ST; Udayasuriyan, V.; Bhadana, V., Codon usage bias, Mol. Biol. Rep., 49, 539-565 (2022) · doi:10.1007/s11033-021-06749-4
[67] Prat, Y.; Fromer, M.; Linial, N., Codon usage is associated with the evolutionary age of genes in metazoan genomes, BMC Evol. Biol., 9, 285 (2009) · doi:10.1186/1471-2148-9-285
[68] Pearson, K., A First Study of the Statistics of Pulmonary Tuberculosis (1907), London: Dalau, London
[69] Poincare, H.: Analysis Situs. Translated version from French (1895)
[70] Lloyd, SP, Least squares quantization in PCM, IEEE Trans. Inf. Theory, 28, 2, 129-137 (1982) · Zbl 0504.94015 · doi:10.1109/TIT.1982.1056489
[71] Shannon, C.E.A.: Mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379-423, 623-656 (2020) doi:10.1002/j.1538-7305.1948.tb01338.x · Zbl 1154.94303
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.