Article Contents

2024, Volume 6, Issue 4: 492-513. Doi: 10.3934/fods.2024024

This issue Previous Article On Wasserstein distances for affine transformations of random vectors Next Article Sparse additive function decompositions facing basis transforms

A topological approach for capturing high-order interactions in graph data with applications to anomaly detection in time-varying cryptocurrency transaction graphs

1.
Department of Mathematics and Statistics, Bowling Green State University, USA
2.
Department of Mathematics, University of Evansville, USA
3.
Department of Finance, University of Central Florida, USA
4.
Department of Biostatistics, Virginia Commonwealth University, USA

^*Corresponding author: Umar Islambekov
^*Corresponding author: Umar Islambekov

Received: November 2023

Revised: April 2024

Early access: May 2024

Published: December 2024

Abstract / Introduction Full Text(HTML) Figure(8) / Table(5) Related Papers Cited by

Abstract

Time-varying graphs are increasingly common in financial, social and biological data analysis applications. Feature extraction that efficiently encodes the complex structure of sparse, multi-layered, dynamic graphs presents computational and methodological challenges. In the past decade, topological data analysis has become a popular method of studying the shape of data. This is achieved by building an increasing sequence of simplicial complexes (called filtration) indexed by a scale parameter on top of the data to keep track of topological changes along with the filtration. This multi-scale summary, called persistence diagram (PD), is often vectorized to be used in machine learning algorithms. This paper introduces a topological approach to extract information on higher-order interactions encoded in persistence diagrams from graph data. Our framework has two main steps: first, we convert the graph into a higher-dimensional simplicial complex by adding structures such as triangles, tetrahedrons etc., and compute a PD using the so-called lower-star filtration which utilizes quantitative node attributes. Then, we vectorize the PD by averaging the associated Betti function over successive scale values of a one-dimensional grid using integration. A notable aspect of our procedure is that it avoids embedding a graph into a metric space. We show that the proposed vectorization summary is robust against input noise with respect to the $ L_1 $ 1-Wasserstein distance. In simulation studies, the proposed approach leads to improved change point detection rates and outperforms one of the state-of-the-art methods for anomaly detection in time-varying graphs. In real data application, our approach leads to up to a 20% gain in anomalous price prediction in the Ethereum cryptocurrency transaction network.

Keywords:

Mathematics Subject Classification: Primary: 55N31, 62R40, 68T09.

Citation:

Full Text(HTML)

Figure 1. Geometric representation of simplices². A collection of simplices makes up a simplicial complex which is used to approximate the shape structure underlying a discrete set of data points

Download: Full-size image PowerPoint slide

Figure 2. A simple toy graph $ \mathcal G $ with six nodes and eight edges. We can view $ \mathcal G $ as a one-dimensional simplicial complex with six 0-simplices (nodes) and eight 1-simplices (edges)

Download: Full-size image PowerPoint slide

Figure 3. The Vietoris-Rips (VR) and lower-star (LS) filtrations of the toy graph $ \mathcal{G} $ in Figure 2. To construct the VR filtration, the graph nodes are first embedded into a metric space using the geodesic distance. For the LS filtration, the dimension of $ \mathcal G $, viewed as a one-dimensional simplicial complex, is first increased from one to two by adding the triangles formed by the graph edges. The resulting filtration is induced by function $ g $ defined in Example 2.1 on the set of nodes of $ \mathcal{G} $. Notice that the VR filtration involves a lot more 1-simplices (edges) and 2-simplices (shaded triangles) than the LS filtration

Download: Full-size image PowerPoint slide

Figure 4. Plots of persistence diagrams $ D_1 $ (blue) and $ D_2 $ (red) defined in Example 2.6. The points matched by a optimal bijection of the $ L^1 $ 1-Wasserstein distance are joined by dashed lines. The blue point in the top-right corner is matched to a point on the diagonal

Download: Full-size image PowerPoint slide

Figure 5. Graphs of Betti functions $ \beta_1(t) $ (blue) and $ \beta_2(t) $ (red) associated with the persistence diagrams $ D_1 $ and $ D_2 $ from Example 2.6 respectively. The weight function $ w $ used in the construction of the Betti functions is identically equal to one

Download: Full-size image PowerPoint slide

Figure 6. Comparison plots between non-topological (A) and topological (B) vector summaries for the experiment of Section 3.1. The vertical lines indicate the locations of the true change points

Download: Full-size image PowerPoint slide

Figure 7. Performance plots for the experiment of Section 3.2, where the VAB summary is contrasted with Laplacian Anomaly Detection (LAD). The vertical lines in Figure 7b go though the true change points

Download: Full-size image PowerPoint slide

Figure 8. Gains in AUC of the models $ M_2 $, $ M_3 $ and $ M_4 $ containing topological features (based on either VABs, persistence landscapes (PL) or persistence images (PI)) with respect to the baseline model $ M_1 $ for prediction horizons $ h\in \{1, 2, \ldots, 7\} $ days. The baseline model $ M_1 $ is free of topological features. Models $ M_2 $ and $ M_3 $ involve topological features of homological dimension zero and one, whereas $ M_4 $ is the full model including all the available features

Download: Full-size image PowerPoint slide

Table 1. Persistence diagrams of the Vietoris-Rips (VR) and lower-star (LS) filtrations of the toy graph $ \mathcal G $ in Figure 2. Topological features of dimension 0 and 1 are connected components and loops respectively. See the depictions of the VR and LS filtrations in Figures 3a and 3b

Filtration	Dim	Birth	Death
VR	0	0	$ \infty $
	0	0	1
	0	0	1
	0	0	1
	0	0	1
	0	0	1
	1	1	2
LS	0	1	$ \infty $
	0	1	2
	1	3	$ \infty $

| Show Table

DownLoad: CSV

Table 2. Mean absolute errors (computed over 100 simulation runs) of the estimated change points (CP) in time-varying graphs indexed by $ t\in \{1, 2, \ldots, 200\} $. The graphs are generated according to the random dot product graph model. The true CPs occur at the time points $ t = 51 $, $ t = 101 $ and $ t = 151 $ when the graph generating process undergoes a distributional change. The CPs are estimated using a permutation-based algorithm called E-divisive [33]

Input	1st CP	2nd CP	3rd CP
Common Betti (dim=0)	7.41	30.94	24.03
Common Betti (dim=1)	3.20	5.15	32.76
Common Betti (dim=2)	0.72	0.63	7.46
VAB(2,0)	1.25	17.57	11.59
VAB(2,1)	1.06	1.39	28.54
VAB(2,2)	0.71	1.29	2.07
Graph summaries	49.12	0.85	33.93
Motifs	48.59	0.81	34.81

| Show Table

DownLoad: CSV

Table 3. The SBM parameter settings for the experiment of Section 3.2. The time-varying graphs, each with 500 nodes, are indexed by $ t\in \{0, 1, 2, \ldots, 150\} $. The change points are located at seven time points: 16, 31, 61, 76, 91,106 and 136. $ N_c $ is the number of (equal sized) communities, $ p_{in} $ and $ p_{ex} $ are the interval and external connectivity probabilities respectively

Interval	SBM parameter settings
$ [0, 15] $	$ N_c=4 $, $ p_{in}=0.07 $, $ p_{ex}=0.05 $
$ [16, 30] $	$ N_c=10 $, $ p_{in}=0.07 $, $ p_{ex}=0.05 $
$ [31, 60] $	$ N_c=2 $, $ p_{in}=0.07 $, $ p_{ex}=0.05 $
$ [61, 75] $	$ N_c=4 $, $ p_{in}=0.07 $, $ p_{ex}=0.05 $
$ [76, 90] $	$ N_c=10 $, $ p_{in}=0.07 $, $ p_{ex}=0.05 $
$ [91,105] $	$ N_c=2 $, $ p_{in}=0.07 $, $ p_{ex}=0.05 $
$ [106,135] $	$ N_c=4 $, $ p_{in}=0.07 $, $ p_{ex}=0.05 $
$ [136,150] $	$ N_c=10 $, $ p_{in}=0.07 $, $ p_{ex}=0.05 $

| Show Table

DownLoad: CSV

Table 4. Estimated change points (CP) for the experiment of Section 3.2, where the performance of the VAB summary (used within the change point-detection algorithm E-divisive) is compared against one of the state-of-the-art methods – Laplacian Anomaly Detection (LAD). The time-varying graphs $ \{\mathcal{G}_t\}_{t = 0}^{150} $, each with 500 nodes, are generated from the Stochastic Block Model. The true CPs are located at time points 16, 31, 61, 76, 91,106 and 136, where at least one of the model parameters abruptly changes

	1st CP	2nd CP	3rd CP	4th CP	5th CP	6th CP	7th CP
True CPs	16	31	61	76	91	106	136
VAB(2,2)	16	31	61	76	91	106	136
LAD	31	91	94	102	106	112	133
VAB(1,1)	-	31	63	-	89	-	136

| Show Table

DownLoad: CSV

Table 5. Three sets of features used as input variables for modelling. $ PN $ - normalized price, $ |E| $ - number of edges, $ |V| $ - number of nodes, $ RD_7(\boldsymbol\beta_{0}) $, $ RD_7(\boldsymbol\beta_{1}) $ and $ RD_7(\boldsymbol\beta_{2}) $ - rolling depth values of $ \boldsymbol\beta_{0} $, $ \boldsymbol\beta_{1} $ and $ \boldsymbol\beta_{2} $ with respect to past 7 days

Model	Features
$ M_1 $	$ PN $, $ {\|E\|} $, $ \|V\| $, $ C $
$ M_2 $	$ PN $, $ {\|E\|} $, $ \|V\| $, $ C $, $ RD_7(\boldsymbol\beta_{0}) $
$ M_3 $	$ PN $, $ {\|E\|} $, $ \|V\| $, $ C $, $ RD_7(\boldsymbol\beta_{0}) $, $ RD_7(\boldsymbol\beta_{1}) $
$ M_4 $	$ PN $, $ {\|E\|} $, $ \|V\| $, $ C $, $ RD_7(\boldsymbol\beta_{0}) $, $ RD_7(\boldsymbol\beta_{1}) $, $ RD_7(\boldsymbol\beta_{2}) $

| Show Table

DownLoad: CSV

Related Papers

Cited by

References

[1]	N. C. Abay, C. G. Akcora, Y. R. Gel, M. Kantarcioglu, U. D. Islambekov, Y. Tian and B. Thuraisingham, Chainnet: Learning on blockchain graphs with topological features, in 2019 IEEE international conference on data mining (ICDM), IEEE, 2019, 946-951. doi: 10.1109/ICDM.2019.00105.
[2]	H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta and L. Ziegelmeier, Persistence images: A stable vector representation of persistent homology, The Journal of Machine Learning Research, 18 (2017), Paper No. 8, 35 pp.
[3]	P. Bubenik, Statistical topological data analysis using persistence landscapes, The Journal of Machine Learning Research, 16 (2015), 77-102.
[4]	P. Bubenik and T. Vergili, Topological spaces of persistence modules and their properties, Journal of Applied and Computational Topology, 2 (2018), 233-269. doi: 10.1007/s41468-018-0022-4.
[5]	G. Carlsson, Topology and data, Bulletin of the American Mathematical Society, 46 (2009), 255-308. doi: 10.1090/S0273-0979-09-01249-X.
[6]	M. Carrière, F. Chazal, Y. Ike, T. Lacombe, M. Royer and Y. Umeda, Perslay: A neural network layer for persistence diagrams and new graph topological signatures, in International Conference on Artificial Intelligence and Statistics, PMLR, 2020, 2786-2796.
[7]	M. Carrière and R. Rabadán, Topological data analysis of single-cell Hi-C contact maps, in Topological Data Analysis-the Abel Symposium 2018, Springer, Cham, Abel Symp., 15 (2020), 147-162. doi: 10.1007/978-3-030-43408-3_6.
[8]	F. Chazal, V. De Silva and S. Oudot, Persistence stability for geometric complexes, Geometriae Dedicata, 173 (2014), 193-214. doi: 10.1007/s10711-013-9937-z.
[9]	F. Chazal and B. Michel, An introduction to topological data analysis: Fundamental and practical aspects for data scientists, Frontiers in Artificial Intelligence, 4 (2021). doi: 10.3389/frai.2021.667963.
[10]	Y.-M. Chung and A. Lawson, Persistence curves: A canonical framework for summarizing persistence diagrams, Advances in Computational Mathematics, 48 (2022), Paper No. 6, 42 pp. doi: 10.1007/s10444-021-09893-4.
[11]	D. Cohen-Steiner, H. Edelsbrunner and J. Harer, Stability of persistence diagrams, Discrete & Computational Geometry, 37 (2007), 103-120. doi: 10.1007/s00454-006-1276-5.
[12]	D. Cohen-Steiner, H. Edelsbrunner, J. Harer and Y. Mileyko, Lipschitz functions have l p-stable persistence, Foundations of Computational Mathematics, 10 (2010), 127-139. doi: 10.1007/s10208-010-9060-6.
[13]	G. Csardi and T. Nepusz, The igraph software package for complex network research, InterJournal, Complex Systems, 1695 (2006), 1-9.
[14]	H. Edelsbrunner and J. Harer, Computational Topology: An Introduction, American Mathematical Soc., Providence, RI, 2010. doi: 10.1090/mbk/069.
[15]	H. Edelsbrunner, J. Harer et al., Persistent homology-a survey, Surveys on Discrete and Computational Geometry, Contemporary Mathematics, American Mathematical Society, Providence, RI, 453 (2008), 257-282. doi: 10.1090/conm/453/08802.
[16]	H. Edelsbrunner and J. L. Harer, Computational Topology: An Introduction, American Mathematical Society, Providence, RI, 2010. doi: 10.1090/mbk/069.
[17]	B. T. Fasy, J. Kim, F. Lecci and C. Maria, Introduction to the R package TDA, arXiv preprint, arXiv: 1411.1830.
[18]	B. T. Fasy, J. Kim, F. Lecci, C. Maria, D. L. Millman and V. Rouvreau., TDA: Statistical Tools for Topological Data Analysis, 2021, URL https://CRAN.R-project.org/package=TDA, R package version 1.7.7.
[19]	R. Ghrist, Barcodes: the persistent topology of data, Bulletin of the American Mathematical Society, 45 (2008), 61-75. doi: 10.1090/S0273-0979-07-01191-3.
[20]	M. Gidea, Topology data analysis of critical transitions in financial networks, arXiv preprint arXiv: 1701.06081. doi: 10.2139/ssrn.2903278.
[21]	M. Gidea, D. Goldsmith, Y. Katz, P. Roldan and Y. Shmalo, Topological recognition of critical transitions in time series of cryptocurrencies, Physica A: Statistical Mechanics and Its Applications, 548 (2020), 123843. doi: 10.1016/j.physa.2019.123843.
[22]	M. Hajij, B. Wang, C. Scheidegger and P. Rosen, Persistent homology guided exploration of time-varying graphs, arXiv preprint, arXiv: 1707.06683.
[23]	P. W. Holland, K. B. Laskey and S. Leinhardt, Stochastic blockmodels: First steps, Social Networks, 5 (1983), 109-137. doi: 10.1016/0378-8733(83)90021-7.
[24]	S. Huang, Y. Hitti, G. Rabusseau and R. Rabbany, Laplacian change point detection for dynamic graphs, in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, 349-358. doi: 10.1145/3394486.3403077.
[25]	U. Islambekov and A. Luchinsky, TDAvec: Vector Summaries of Persistence Diagrams, 2022, URL https://CRAN.R-project.org/package=TDAvec, R package version 0.1.2.
[26]	U. Islambekov and H. Pathirana, Vector summaries of persistence diagrams for permutation-based hypothesis testing, Foundations of Data Science, 6 (2024), 41-61. doi: 10.3934/fods.2024002.
[27]	M. Kerber, D. Morozov and A. Nigmetov, Geometry helps to compare persistence diagrams, ACM J. Exp. Algorithmics, 22 (2017), Art. 1.4, 20 pp. doi: 10.1145/3064175.
[28]	M. Kraetzl, C. Nickel and E. R. Scheinerman, Random dot product graphs: A model for social networks, Preliminary manuscript.
[29]	H. W. Kuhn, The hungarian method for the assignment problem, Naval Research Logistics Quarterly, 2 (1955), 83-97. doi: 10.1002/nav.3800020109.
[30]	M. Z. Li, M. S. Ryerson and H. Balakrishnan, Topological data analysis for aviation applications, Transportation Research Part E: Logistics and Transportation Review, 128 (2019), 149-174. doi: 10.1016/j.tre.2019.05.017.
[31]	Y. Li, U. Islambekov, C. Akcora, E. Smirnova, Y. R. Gel and M. Kantarcioglu, Dissecting ethereum blockchain analytics: What we learn from topology and geometry of the ethereum graph?, in Proceedings of the 2020 SIAM International Conference on Data Mining, SIAM, 2020, 523-531. doi: 10.1137/1.9781611976236.59.
[32]	S. López-Pintado and J. Romo, On the concept of depth for functional data, Journal of the American statistical Association, 104 (2009), 718-734. doi: 10.1198/jasa.2009.0108.
[33]	D. S. Matteson and N. A. James, A nonparametric approach for multiple change point analysis of multivariate data, Journal of the American Statistical Association, 109 (2014), 334-345. doi: 10.1080/01621459.2013.849605.
[34]	Y. Mileyko, S. Mukherjee and J. Harer, Probability measures on the space of persistence diagrams, Inverse Problems, 27 (2011), 124007. doi: 10.1088/0266-5611/27/12/124007.
[35]	V. Nanda and R. Sazdanović, Simplicial models and topological inference in biological systems, in Discrete and Topological Models in Molecular Biology, Springer, 2014, 109-141. doi: 10.1007/978-3-642-40193-0_6.
[36]	S. W. Park, Y. Y. Choi, D. Joe, U. J. Choi and Y. Woo, The pwlr graph representation: A persistent weisfeiler-lehman scheme with random walks for graph classification, in Topological, Algebraic and Geometric Learning Workshops 2022, PMLR, 2022, 287-297.
[37]	J. A. Pike, A. O. Khan, C. Pallini, S. G. Thomas, M. Mund, J. Ries, N. S. Poulter and I. B. Styles, Topological data analysis quantifies biological nano-structure from single molecule localization microscopy, Bioinformatics, 36 (2020), 1614-1621. doi: 10.1093/bioinformatics/btz788.
[38]	B. B. F. Pontiveros, M. Steichen and R. State, Mint centrality: A centrality measure for the bitcoin transaction graph, in 2019 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), IEEE, 2019, 159-162. doi: 10.1109/BLOC.2019.8751401.
[39]	T. Qaiser, Y.-W. Tsang, D. Taniyama, N. Sakamoto, K. Nakane, D. Epstein and N. Rajpoot, Fast and accurate tumor segmentation of histology images using persistent homology and deep convolutional features, Medical Image Analysis, 55 (2019), 1-14. doi: 10.1016/j.media.2019.03.014.
[40]	R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2021, URL https://www.R-project.org/.
[41]	B. Rieck, C. Bock and K. Borgwardt, A persistent weisfeiler-lehman procedure for graph classification, in International Conference on Machine Learning, PMLR, 2019, 5448-5458.
[42]	B. Rieck, U. Fugacci, J. Lukasczyk and H. Leitte, Clique community persistence: A topological visual analysis approach for complex networks, IEEE Transactions on Visualization and Computer Graphics, 24 (2018), 822-831. doi: 10.1109/TVCG.2017.2744321.
[43]	S. Rudkin, W. Rudkin and P. Dłotko, On the topology of cryptocurrency markets, International Review of Financial Analysis, 89 (2023), 102759. doi: 10.1016/j.irfa.2023.102759.
[44]	K. Shamsi, F. Victor, M. Kantarcioglu, Y. Gel and C. G. Akcora, Chartalist: Labeled graph datasets for utxo and account-based blockchains, Advances in Neural Information Processing Systems, 35 (2022), 34926-34939.
[45]	A. D. Smith, P. Dłotko and V. M. Zavala, Topological data analysis: concepts, computation, and applications in chemical engineering, Computers & Chemical Engineering, 146 (2021), 107202. doi: 10.1016/j.compchemeng.2020.107202.
[46]	Y. Umeda, Time series classification via topological data analysis, Information and Media Technologies, 32 (2017), 1-12. doi: 10.1527/tjsai.D-G72.
[47]	F. Victor, Address clustering heuristics for ethereum, in International Conference on Financial Cryptography and Data Security, Springer, 12059 (2020), 617-633. doi: 10.1007/978-3-030-51280-4_33.
[48]	F. Victor, C. G. Akcora, Y. R. Gel and M. Kantarcioglu, Alphacore: Data depth based core decomposition, in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, 1625-1633. doi: 10.1145/3447548.3467322.
[49]	G. Wood, Ethereum: A secure decentralised generalised transaction ledger, Ethereum Project Yellow Paper, 151 (2014), 1-32.
[50]	Q. Zhao and Y. Wang, Learning metrics for persistence-based summaries and applications for graph classification, Advances in Neural Information Processing Systems, 32.
[51]	A. Zomorodian and G. Carlsson, Computing persistent homology, Discrete & Computational Geometry, 33 (2005), 249-274. doi: 10.1007/s00454-004-1146-y.