1 Introduction

Anti-money laundering (AML) is a crucial task of financial security departments of all countries. Data from the International Monetary Fund (IMF) and related statistics show that the scale of money laundering globally is about two trillion dollars per year, accounting for about 2% of global GDP, and is still increasing at a rate of 100 billion dollars per year. [1, 2]. Facing the massive and rapid transactions at every minute, a timely scaleable money laundering identification (MLI) method is urgently required to distinguish suspicious transactions, thereby preventing the further usage of illicit money, such as drugs dealing, smuggling, and terrorism [3].

Fig. 1
figure 1

The overview of SEGE. The structure entropy measures the higher-order structure of graphs and is guided by the unsupervised account features on transaction graphs, thereby establishing money laundering groups

Previous MLI studies have made remarkable achievements from heuristic methods to data-driven one, such as rule-based, graph division-based [4], decision tree based [5, 6], statistical inference based [7] and dynamic graph based methods [8]. However, these methods mostly focused on individual-level behavior of accounts, which has several insurmountable obstacles. For example, the above MLI methods based on individual-level features have serious label-shifting issues and many accounts with very similar behaviors have totally different labels. On the other hand, the label time bias issue also impedes the application of supervised or semi-supervised approaches. That is, money laundering behaviors always severely precede its identification, and when such a behavior is confirmed after judgments, the pattern has already changed. Furthermore, an important fact that money laundering is usually a kind of well-organized crime is still not fully studied by the current methods. However, the collaborative group behaviors of money laundering are not able to be easily captured by shallow models in a straightforward way (Fig. 1).

For these issues, the approach of structural entropy minimization (SEM) shows its significant theoretical guidance for MLI. SEM is a principle of community detection based on a random walk heuristic [9]. Consider a money transaction network that takes accounts as vertices, transaction relations as edges and amounts of money as edge weights. The random walk in this network tends to take steps along with edges of high weights and, consequently, to be trapped with high frequency in the communities whose vertices are connected with heavy edges. Tight connection is supposed to be a typical characteristic of money laundering communities, which can be detected by structural entropy since it is a natural measure for the complexity of random walks that reveals clustering structure. Considering the rapid advances of unsupervised deep graph representation learning, the SEM would be an ideal tool to enhance the utilization of node embeddings, thereby improving the effectiveness of automatic money laundering identification.

To this end, in this paper, we propose an SEM-based unsupervised algorithm that also utilizes the information of graph embedding, namely Structural Entropy with Graph Embedding (SEGE) algorithm, to identify money laundering on real-world transaction data where the real money laundering community has been confirmed by the anti-money laundering agency. For public verification, we also generate synthetic datasets from a public money laundering model. The experiment results have indicated that the proposed SEGE algorithm with its non-parametric variant SE could achieve the best performance compared to all baselines. The contributions of this paper are summarized as follows:

  • We study the community-based characteristic of money laundering, and investigate the collaborative group behaviors. To the best of our knowledge, it is the first time to investigate the comprehensive community structure of real money laundering criminal gangs from the view of local explorations. We also verify the phenomenon in the real world that the money laundering accounts disperse notably but are clustered locally.

  • We propose an unsupervised local-graph-structure-enhanced algorithm to discover suspicious money laundering accounts based on Structural Entropy Minimization. The proposed SE algorithm avoids a global clustering for the entire graph which spends too much time on irrelevant clusters. In addition, we developed the algorithm SEGE that uses the node features from graph embedding to achieve better performance. It is demonstrated that graph embedding is a good compensation in the measure of similarity during the local clustering process.

  • Our experiments are performed on a real-world dataset including real money laundering activities confirmed by an anti-money laundering agency. The results show that our proposed algorithm SEGE derived the best F1 score at 0.505 on average, thereby indicating a prospect for being widely applied in actual MLI scenarios. For public verification, we also conduct our experiments on a synthetic dataset that is generated from a public money laundering model, which has also verified our conclusions.

  • We investigate the structure of the real ML dataset. Our local community detection procedure on the real-world benchmark reveals the sub-community structure of the ground truth of real money laundering. To the best of our knowledge, it is the first time to disclose a real ML group and claimed a sub-community structure. Clarifying the structural characteristics of a real ML group has guiding significance to AML. Our experiments show that the ML accounts disperse notably, but they are clustered locally in the sense of trading volume. The remarkable clustering feature accounts for the difficulty of AML tasks to some extent. Based on this, we develop a potentially actual combat strategy against an ML group. It takes the union of the sub-communities that our local algorithm yields starting from a very small fraction of the real ML accounts, and achieves extremely high F1-scores.

2 Related work

In this section, we introduce the present money laundering methods and community detection via graphic representation learning, respectively.

2.1 Money laundering identification

Early money laundering detection methods are based on exercisable heuristic rules [10, 11], static record-based design [12, 13] and statistical correlation analysis [14]. Then, to make AML more intelligent, machine learning-driven models, such as SVM, fuzzy logic, and decision trees, are studied to detect money laundering in different scenarios [15]. Other techniques within the artificial intelligence domain also helps the identification of money laundering. For example, NextGen AML [16] shows a different approach by applying NLP models to analyze the text contents such as account name, location, account details of the parties, and map them to domain knowledge graph to help human experts to make the judgment. However, these methods are usually individual-oriented which neglects the characteristics of group behavior such as smurf. Furthermore, these machine learning models are label-required, which increases the expenses on label acquisition and also delays the model update. Therefore, a modern AML model requires both less labeled learning as well as group-oriented detection strategy [17,18,19].

For graph-based AML approaches, studies have revealed many meaningful phenomena and behaviors. Social network analysis (SNA)-based algorithm [20, 21] utilizes both profile and transaction information of customers to identify social relations and compute graph attributes such as centrality, hubness, PageRank and density subgraph [22,23,24,25], and finds different attribute patterns to identify different role of criminals. However, this kind of social information is not easy to obtain.

Spotlight [26] a randomized sketching-based anomaly detection streaming algorithm that detects anomalous snapshots of activities in a sequence of time windows. It can spot a sudden appearance of dense graph that is possibly related to money laundering. FlowScope [8] proposes a novel metric namely Anomalousness to assess the tendency of money-laundering in a dense and multi-step streaming-based transaction graph. It aims to detect the complete flow of money from source to destination. Starnini et al. [27] focuses on the smurf behavior and proposes an AML method based on smurf-like subgraph detection, where two typical patterns of smurf motifs are studied. FRAUDAR [28] proposes a subgraph density of suspicion metric for improving the accuracy of detecting both disguised and non-disguised frauds and finding fraudsters in the case of disguised or hijacked accounts. EigenPulse [29] proposes a row-enhanced matrix with a sliding window to model the flow graph and found a density surge subgraph in the singular spectrum-based flow graph. MonLAD [30] proposes a streaming graph-based model that computes residual statistical features and tries to explain such streaming-based behaviors. SpecGreedy [31] proposes a dense graph detection model that leverages the graph spectral properties and greedy peeling strategy and can be used to AML group mining. These methods are based on spotting some specific or dense subgraphs that can be identified as money laundering behaviors.

Very recently, WinGNN [32] Modeling dynamics into graph neural networks (GNNs) helps to understand evolution in dynamic graphs by introducing random sliding windows to obtain window-aware gradients on successive snapshots, which helps to optimize spatio-temporal representations of real-world dynamic network problems. On the other hand, open source platforms have emerged such as CogDL: a comprehensive library for graph deep learning [33]. CogDL optimizes GNN training loops using multiple training techniques (e.g., mixed-precision training) and develops efficient sparse operators, making it the most competitive graph library in terms of efficiency.

Although these graph-based methods have considered the money laundering behavior in a group view, they are hard to detect organized money-laundering such as illegal private banks. The community-based characteristic and its relation with local and global network structures are still uncovered.

2.2 Community detection and representation learning in graphs

In social networks, users spontaneously form a list of subsets according to their interactions and interests. The users in the same subset are more densely connected with each other than the users from different subsets, and each such subset is regarded as a community. Generalizing to network analysis, the task of community detection is to automatically find clusters of nodes that are tightly linked within the cluster and sparsely linked between different clusters in complex networks. The early community detection methods are mainly based on graph theory, such as maximum flow [34], split-based approach [35], modularity optimization [36], etc. Compared with graph theory-based methods, deep learning models have the advantage of making full use of high-dimensional nonlinear structure and features of nodes, neighborhoods, edges, subgraphs, etc., and such models are more resilient to sparse networks [37]. The emergence of deep graph representation learning enables to aggregation of the neighborhood information of nodes in the graph, to capture the complex features used for community detection in a global view. For example, GAE [38] encodes hierarchical community structure into network embedding by optimizing the objective with spherical constraints. CommunityGAN [39] combines the community membership graph model with the Generative Adversarial Network (GAN) to develop a novel framework for community detection based on the presence of intermediate terms. SEComm [40] combines the principle of self-expressiveness with the framework of a self-supervised graph neural network for unsupervised community detection. It uses the principle of self-expressiveness to generate a set of soft must-link or no-link constraints on a subset of nodes divided into batches.

It should be noted that the essence of GNN-based community detection is unsupervised clustering learning on graph embeddings. Therefore, the graph embedding learning model that can preserve the community and subgraph structure information is equally effective. There have been several graph embedding methods that may benefit community detection. Random walk-based methods such as DeepWalk [41], Node2vec [42], LINE [43] generate sequence information by walking between nodes, so as to derive vector representation according to the context relationship of sequence occurrence. Specifically, NetMF [44] shows the theoretical connections between skip-gram-based network embedding algorithms and the theory of graph Laplacian, which unifies the DeepWalk, LINE, PTE, and node2vec models with negative sampling into the matrix factorization framework with closed forms. Thus, these methods also share a similar theoretical basis with matrix-based methods such as NetMF, NetSMF [45] and ProNE [46]. GCN-based methods such as DGI [47] utilize graph convolution to propagate one-hop or higher-order feature information from nodes, thereby generating deep semantic embedding of nodes. AMAP [48] designs an adaptive subnetwork proposer guided by supervised contrast loss, based on neural network discovery of subnetwork structures. Therefore, AML has a clear motivation to benefit from the view of community analysis and graph representation learning.

3 The proposed method

In this section, we introduce the unsupervised clustering-based SEGE algorithm for money laundering identification from the view of community structure. We first introduce the basic idea of structural entropy minimization. Then, we propose the SEGE algorithm by focusing on the optimization of local graph clustering and interpret the combinatorial essence of the proposed SEGE algorithm.

3.1 Structural entropy minimization

Structural entropy is a measure for the hierarchical clustering structure of graphs [9], and has a series of applications in graph neural networks [49,50,51,52] and many practical fields [53,54,55,56]. Essentially, it measures the uncertainty of a random walk under a specific scheme, which indicates whether a cluster tree is “good”. A cluster tree T for graph G is a recursive partition for vertices of G into small clusters as the tree nodes get deeper. A “good” cluster tree should yield a large number of clusters with low conductance at deep levels that are far from the root, where conductance of a vertex set S is the ratio of total weights of edges with exactly one endpoint in S to the volume of S. Consider a random walk process on G that takes one step to a uniformly and randomly chosen neighbor at each time step. Intuitively, a good cluster tree should trap the random walk within deep clusters frequently. So, when a transaction network G contains a money laundering group, the heavy edges that connect money laundering accounts tend to absorb the random walk with high probability into the group. This is the reason why we adopt structural entropy for the MLI problem.

From another view, the cluster tree T is an encoding scheme for all vertices of G. The codeword for each node u of T consists of the labels of children along with the unique path from the root to u. When a random walk in G takes one step from node x to node y, it is obvious that this step is within the cluster represented by the least common ancestor of corresponding leaves x and y on T, denoted by \(x\vee y\). Thus, to measure the uncertainty of the present position, we only need to calculate the uncertainty within \(x\vee y\). Consider an encoding scheme for a sufficiently long random walk on G, It has converged to the stationary distribution that is proportional to node degrees. Other than the plain codeword expression of each node (corresponding to the leaf on T) it travels, when it walks from x to y, we omit from y the codeword of \(x\vee y\) which is a prefix of y and only retain the labels corresponding to the path from \(x\vee y\) to y.

Formally, given a weighted graph \(G=(V,E)\) with vertex set V and edge set E, let \(d_v\) be the weighted degree of a vertex \(v \in V\), \(vol(S) = \sum _{v \in S} d_v\) be the volume of a subset \(S\subseteq V\), and T be a cluster tree of G. For each cluster or leaf \(\alpha \) on T, let \(g_\alpha \) be the sum of weights of edges with exactly one end-point in \(\alpha \), the uncertainty of the random walk is measured by the average codeword length of each step:

$$\begin{aligned} \mathcal {H}^T (G) = -\sum _{\alpha \in T} \frac{g_{\alpha }}{vol(V)} \log \frac{vol(\alpha )}{vol(\alpha ^-)}, \end{aligned}$$
(1)

where \(\alpha ^-\) is the parent of \(\alpha \). For notational convenience, for the root \(\lambda \) of T, we set \(\lambda ^-=\lambda \). We call \(\mathcal {H}^T(G)\) the structural entropy of G on T. Moreover, we define the structural entropy of G to be the minimum one overall cluster trees, denoted by

$$\begin{aligned} \mathcal {H}(G)=\min _{T}\{\mathcal {H}^T (G)\}. \end{aligned}$$
(2)

For the hierarchical clustering problems, \(\mathcal {H}^T(G)\) is an effective objective function. In this paper, we also regard the money laundering community detection as a varietal clustering problem. We utilize \(\mathcal {H}^T(G)\) with the constraint that T has a specific depth \(d=2\). This constraint provides a robust objective function for the flat clustering problems. That is,

$$\begin{aligned} \min _{T:\text {height(T)=2}}\{\mathcal {H}^T (G)\}. \end{aligned}$$
(3)

In this case, \(\mathcal {H}^T(G)\) is called two-dimensional structural entropy according to the label dimension of codewords for leaves.

A clustering algorithm based on two-dimensional structural entropy minimization is a standard agglomerate procedure [9]. Starting from the state in which each single vertex is treated as a cluster, we repeatedly join two clusters together such that the two-dimensional structural entropy decreases most until a cut-off criterion is met. The agglomerate fashion is widely used for clustering for which any reasonable objective function can be incorporated. In our experiments, we will use the two-dimensional structural entropy for our algorithm SEGE as the objective function in the agglomerate process in a local fashion.

3.2 SEGE: structural entropy-based local clustering algorithm

In practice, it is often not needed and even unable to give a partition for the whole graph when the graph is too large. For the problem of money laundering detection, we take a common scenario as the basic assumption: a single suspicious account appears first, and then the MLI algorithm needs to dig out its group of accounts that have participated in money laundering activities. To this end, we propose a structural entropy-based local clustering algorithm to establish the money laundering detection, which is shown in Algorithm 1.

Algorithm 1
figure a

The SEGE Algorithm

Fig. 2
figure 2

An illustration of SE and SEGE algorithm. Let S be the predetermined set of suspicious nodes. The algorithm begins at state a. In step b, all neighboring nodes of S are identified. In step c, the node with the maximum structural entropy variation within the neighbor set is selected and added to S, and the set of neighboring vertices is updated. Similarly, in step d, the node with the maximum structural entropy variation within the neighbor set is involved into S with the list of neighboring vertices updated. The final vertex subset of suspicious accounts is explored out gradually in this recursive fashion

When given a graph embedding, the algorithm is essentially a localized version of the agglomerate process. Initially, SEGE computes a graph embedding \(\{em(a):a\in V\}\) and each single vertex is treated as a cluster. The average-linkage distance from a vertex u to a vertex set S, formulated as

$$\begin{aligned} d(u,S)=\Vert em(u)-\frac{1}{|S|}\sum _{a\in V} em(a) \Vert _2, \end{aligned}$$

will be used together with structural entropy as the measure for community exploration. For each iteration, the algorithm considers the cluster S that includes the starting vertex v and its neighboring vertices, and calculates the difference of \(\mathcal {H}^{\mathcal {P}_S}(G)\) and \(\mathcal {H}^{\mathcal {P}_{S \cup \{u\}}}(G)\) for each vertex. At the end of each iteration, The neighboring vertex which would reduce the entire two-dimensional structural entropy of \(\Delta S(u)\) divided by d(uS) most is added in S. So after \(k-1\) iterations, the proposed algorithm yields a cluster of size k that contains the starting vertex v. Any cut-off criterion can be specified. Figure 2 illustrates this local exploration process on a small example.

For a more intuitive interpretation of SEGE, we investigate the criterion of neighbor selection in each iteration. Let G[S] denote the induced subgraph of vertex set S in G, and w(Su) denote the total weight of edges between S and vertex u. The following theorem indicates the combinatorial essence of this local process.

Theorem 1

In each iteration of Algorithm 1, the neighbor u of S such that \(\delta '(S)\) is least will be picked, where

$$\begin{aligned} \delta '(S)=\frac{2w(S,u) \log vol(V)+ vol(G[S]) \log vol(S) -vol(G[S\cup \{u\}]) \log vol(S\cup \{u\})}{d(u,S)}. \end{aligned}$$

Theorem 1 implies that in the early steps at which vol(S) is much less than vol(V), or in the case of identical degree of two neighbors, the vertex u that has the heavier connections to S and smaller embedded distance from S will be picked. If a decreasing tendency of structural entropy is required, that is \(\delta '(S)<0\), the vertices that are not neighbors of S need not be considered. It is why we only choose \(u \in N\) as candidates, which naturally guarantees the connectedness of S. Moreover, for two vertices with similar weight of connections to S and similar embedded distance from S, the one with a smaller degree, which means it has fewer connections to the complement of S, will be picked naturally. All the above properties coincide with our intuitions of a “good”-quality cluster.

Proof

Consider the two-dimensional structural entropy which is determined by a partition \(\mathcal {P}=\{V_1,V_2,\ldots ,V_\ell \}\). Assume the clustering tree T has height \(d=2\). Then

$$\begin{aligned} \mathcal {H}^T(G){} & {} = -\sum \limits _{j=1}^\ell \sum \limits _{v\in V_j} \frac{d_v}{vol(V)}\log \frac{d_v}{vol(V_j)} \\{} & {} -\sum \limits _{j=1}^\ell \frac{w(V_j,V\setminus V_j)}{ vol(V)}\log \frac{ vol(V_j)}{ vol(V)}. \end{aligned}$$

We omit the calculation details and get

$$\begin{aligned}{} & {} \mathcal {H}^{\mathcal {P}_S}(G) - \mathcal {H}^{\mathcal {P}_{S \cup \{u\}}}(G) \\{} & {} \quad = \frac{1}{vol(V)} \left[ 2w(S,u) \log vol(V)+ vol(G[S]) \log vol(S) \right. \\{} & {} \qquad \left. - vol(G[S\cup \{u\}]) \log vol(S\cup \{u\}) \right] \end{aligned}$$

The theorem follows from the observation that vol(V) is irrelevant to u. \(\square \)

According to Theorem 1, we know that the information required by SEGE in each iteration for updating S is locally computable. It does not require the information of nodes that are far away from S and u.

To clarify the impact of graph embedding features, and also to reduce the computation time and better adapt to the actual scenario, we transformed a non-parametric learning variant version of SEGE, namely SE. In SE, we no longer use graph embedding features of nodes, and we only rely on structural entropy for local clustering without normalizing \(\Delta S(u)\) by using d(uS). Thus, the time complexity of our proposed algorithm only depends on the output cluster and is irrelevant to the total number of vertices and edges. Suppose that the algorithm outputs S of size k, and denote \(E_S\) as the number of edges with at least one endpoint in S. Note that for any edge \(e=(u,v)\in S\), the variation \(\Delta S(u)\) (or \(\Delta S(v)\)), raised by absorbing u (or v) to S when v (or u) has been included, should be updated for at most k times. Thus, the total running time complexity is \(O(k|E_S|)\). In practice, the community size \(k\ll |V|\) and \(|E_S|\ll |E|\). Thus our local-based algorithm is practically much more efficient than any global clustering algorithm.

4 Experiments

In this section, we evaluate the effectiveness of the proposed algorithm on two datasets that are constructed based on two ground-truth groups of money laundering. We perform the experiments with comparison with the state-of-the-art community detection and graph representation methods and also give detailed analyses for the experimental results of both global and local-based algorithms. After that, we take an investigation into the structure of the real money laundering dataset and find that there is an obvious sub-community structure within it. To the best of our knowledge, it is the first time a real money-laundering group and claimed the sub-community structure of it.

All models are run on a server with 2*Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 378GB Memory, with 8*Tesla V100-PCIE, 32 G Memory per GPU. The operating system is Ubuntu 18.04 LTS. The source codes and public datasets are available on https://github.com/BlueP0int/SEGE.

4.1 Datasets

The experiments utilize a general framework for dataset generation when ground-truth money laundering transaction data has been given. This framework is designed as follows.

The transaction data involves a list of accounts associated with their clients who have participated in money laundering activities. A we

ighted graph, denoted by \(G_0\), can be constructed naturally. The accounts are linked by transactions and the edge weights between the same two accounts sum up together. To simulate a reasonable testing environment, we generate a random graph \(G_\text {SBM}\) by using the Stochastic Block Models (SBM) [57], with more clusters as disturbing negative samples. The probabilities for the presence of edges are set properly for both intralinks and interlinks of all presumed clusters, such that the sparsity of each cluster is similar to (even slightly larger than) that of the ground truth. For edge weights, we observed the power law distribution for the weights of the real transactions. So we assign a weight for each edge in \(G_\text {SBM}\) by picking a random positive real number under the power law distribution, whose mean is \(\alpha =10\%\) of that in \(G_0\). We set a ratio here because the amounts of cash flows in money laundering cases are usually much higher than those of daily civil transactions. The ratio \(\alpha \) is not sensitive to the experiment results since any mild choice of \(\alpha \) will yield similar results in our experiments. Then we combine \(G_\text {SBM}\) with \(G_0\) as the final dataset for experiment and evaluation, denoted by G. The ways of this combination are slightly different for two ground-truth datasets regarding their characteristics, which will be introduced in detail, respectively. Whenever we refer to precision and recall rates, we always mean these rates for \(G_0\).

4.1.1 Real money laundering dataset, cnBank

The transaction data of the real-world money laundering case is provided by a department of the Chinese Government,Footnote 1 and we name it as cnBank. Among the positively labeled data, there are 2690 accounts, and all transactions among them with concrete amounts. After summing up, it amounts to 3435 weighted edges in \(G_0\). Moreover, 947 accounts, denoted by set \(V_1\), have no incoming edges, while 1316 ones, denoted by set \(V_2\), have no outgoing edges. The rest 427 accounts, denoted by \(V_3\), whose in-degree and out-degree are both non-zero and have complete transaction history, while those in \(V_1 \cup V_2\) accounts do not have. For the environment, we generate an SBM graph with 20 clusters, each of which has 2690 vertices, the same as the size of \(G_0\). The probabilities for the presence of edges are set to be \(10^{-3}\) for intralinks and \(10^{-6}\) for interlinks for all presumed clusters.

\(G_\text {SBM}\) and \(G_0\) are combined as follows. Since the transactions of \(V_1 \cup V_2\) are incomplete, we also embed them into \(G_\text {SBM}\) in a random way. We pick \(|V_1 \cup V_2|=2263\) vertices in \(G_\text {SBM}\) uniformly at random to represent \(V_1 \cup V_2\). \(V_3\) joins in as a new vertex set. All weighted edges of \(G_\text {SBM}\) and \(G_0\) are preserved. The edges from \(G_\text {SBM}\) associated with \(V_1 \cup V_2\) simulates the missing transactions.

4.1.2 Synthetic money laundering dataset, MahData

The synthetic dataset called MahData is generated from a public tool for ML data generation.Footnote 2 The strategy of the simulation is based on three processes of money laundering in financial transactions: money placement, money layering, and money integration. In our experiments, the ground truth \(G_0\) has 806 accounts and all transactions among them have concrete amounts. 255 accounts denoted by set \(V_1\) have no incoming edges, while 350 ones denoted by set \(V_2\) have no outgoing edges. The rest 201 accounts denoted by \(V_3\) have non-zero in-degrees and out-degrees, while those in \(V_1 \cup V_2\) do not have. For the environment, each cluster of the SBM has 806 vertices, which is the same as the size of \(G_0\). The probabilities for the presence of edges are set to be \(6\times 10^{-3}\) for intralinks and \(10^{-4}\) for interlinks for all presumed clusters.

\(G_\text {SBM}\) and \(G_0\) are combined similarly to cnBank. We embed \(V_1 \cup V_2\) into \(G_\text {SBM}\) in a random way. We pick \(|V_1 \cup V_2|=605\) vertices in \(G_\text {SBM}\) uniformly at random to represent \(V_1 \cup V_2\). \(V_3\) joins in as a new vertex set. All weighted edges of \(G_\text {SBM}\) and \(G_0\) are preserved. The edges from \(G_\text {SBM}\) associated with \(V_1 \cup V_2\) simulate the missing transactions.

4.1.3 Privacy protection and ethical requirements

The banking industry desensitization cases cnBank data in this paper are crimes that have been recognized by the judiciary. The suspects and behaviors have been convicted by the public security, procuratorate, and court trials, and their use as experimental samples is in line with the algorithm requirements. Meanwhile, the case data are desensitized to protect personal privacy and meet ethical requirements. The other case, MahData, comes from the public dataset Kaggle, which is a more recognized case dataset in the industry and is generated for the model. The kaggle dataset meets the ethical requirements in the industry.

We fully recognize the importance of ensuring that the data used in our study adheres to appropriate licenses and maintains robust privacy protection measures, especially when dealing with real-world money laundering cases. To address this, we implemented a comprehensive approach to guarantee the ethical use of the data throughout the research process.

In sourcing the data for our study, we collaborated closely with financial institutions, regulatory bodies, and law enforcement agencies. All datasets used in our research were obtained through legal channels and with explicit permissions. We made sure to adhere to any licensing agreements or restrictions imposed by the data providers, ensuring that our research complies with legal and ethical standards.

We took significant steps to anonymize and de-identify the data to safeguard the privacy of individuals involved in the real-world money laundering cases. Any personally identifiable information (PII) was carefully removed or encrypted to prevent the identification of individuals while still preserving the integrity of the dataset for analysis. Our commitment to privacy extends to every phase of data handling, from acquisition to processing and analysis. Furthermore, we elaborate on the steps taken to ensure that our research complies with relevant data protection regulations, including GDPR and Chinese government issued regional laws.

4.2 Baselines and evaluation criteria

Nine unsupervised learning models and two AML models are presented as the baseline methods. The unsupervised learning models are listed as follows.

  • DeepWalk [41]: DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of network nodes.

  • LINE [43]: LINE utilizes an edge-sampling algorithm, and optimizes a carefully designed objective function that preserves both the local and global network structures.

  • Node2vec [42]: Node2vec defines the neighborhood of nodes and factorizes a matrix related to the stationary distribution and transition probability tensor of a 2nd-order random walk.

  • HOPE [58]: HOPE is a scalable model to capture structures of networks and recover from partially observed networks. It preserves high-order proximities of large-scale networks and is capable of capturing asymmetric transitivity.

  • NetMF [44]: NetMF shows the theoretical connections between skip-gram-based network embedding algorithms and the theory of graph Laplacian, which unifies the DeepWalk, LINE, PTE, and node2vec models with negative sampling into the matrix factorization framework with closed forms.

  • NetSMF [45]: NetSMF aims to exchange the matrix which is directly constructing and factorizing from dense to sparse, and leverages theories from spectral sparsification to efficiently sparsify the dense matrix.

  • ProNE [46]: ProNE is a pre-training embeddings model for very large-scale networks, which first initializes network embeddings by formulating the task as sparse matrix factorization, and then enhances the embeddings by propagating them in the spectrally modulated space.

  • DGI [47]: DGI is an unsupervised node representation framework that is based on mutual information maximization between local representations and a high-level summary of the structure of the entire graph.

  • MVGRL [59]: MVGRL introduces a self-supervised method for learning node-level and graph-level representations by applying contrastive learning on first-order neighbors and diffused graph nodes.

The two AML models include:

  • SpecGreedy [31]: SpecGreedy is an efficient and scalable model for dense graph mining. It leverages the graph spectral properties and greedy peeling strategy to solve application problems which can be formulated as dense subgraph detection, such as AML.

  • MonLAD [30]: MonLAD is an unsupervised and streaming model for AML by recognizing the patterns with key statistics of money laundering activities, especially the balance state of fan-ins and fan-outs for individual accounts.

For the starting (suspicious) vertex, in cnBank, we pick randomly and uniformly \(10\%\) fraction of vertices of \(G_0\), each of which is taken as the starting one, respectively. In MahData, we take each of all 806 vertices as the starting vertex.

For the cut-off condition of our algorithms SEGE and SE, we give a rough estimation of the ground truth size, which is an empirically loose range around \(|G_0|\), and then output the integer in this window that achieves the minimum conductance of S. In Table 1, for cnBank, we take the window length 1000 around \(|G_0|=2690\), that is, [2200, 3200]. For MahData, we take the window length 400 around \(|G_0|=806\), that is, [600, 1000]. Further exploration of the choice of window length is presented in Sect. 4.3.2.

Based on the ground truth, we compare the precision rate, recall rate, and F1-score to evaluate the performance of money laundering identification. Suppose that \(G_0\) and S are the ground truth and output communities, respectively. Precision rate is defined as

$$\begin{aligned} \text {Precision}=\frac{|G_0 \cap S|}{|S|}, \end{aligned}$$
(4)

while recall rate is defined as

$$\begin{aligned} \text {Recall}=\frac{|G_0 \cap S|}{|G_0|}. \end{aligned}$$
(5)

F1-score is the harmonic mean of Precision and Recall, which is formulated as

$$\begin{aligned} \text {F1}=\frac{2\cdot \text {Precision}\cdot \text {Recall}}{\text {Precision}+\text {Recall}}. \end{aligned}$$
(6)

For each comparative learning-based baseline, we first generate embeddings of nodes from the dataset. During the embedding process, we report the optimal performance of each model by testing different combinations of the two hyper-parameters, namely the feature number and dimension. The feature number decides the number of features per node which is the length of the one-hot vector for transaction features. The dimension parameter defines the length of the embedding vector that ranges over [8, 16, 32, 64, 128, 256, 512]. After getting the embedding of the network, we use k-means to classify the embedding of each node generated by these models, and then output the cluster that contains the given vertex. For non-learning methods, we directly compare the results outputted by the model. Each score is averaged over all simulations conducted by picking different starting vertices.

4.3 Experiment results

Table 1 Result Comparison of the money-laundering account identification (the optimal and sub-optimal F1 scores are bold and underlined respectively)

We present the main results in Table 1. We demonstrate for each baseline model the best F1-score among all the dimension trials, and the F1-score for SEGE incorporating the graph embedding generated from the node2vec model. The influence of the choice of embedding dimension is relatively stable, and further exploration is presented in Fig. 4. Overall, many baseline methods do not perform well on identifying the embedded ground truth \(G_0\), while our proposed SEGE algorithm derives a promising performance. In particular, SEGE achieves the optimal performance with F1-score at 0.505 which is improved by \(9.50\%\) compared to the node2vec model that achieves the best score 0.457 out of baselines. In addition, the non-parametric learning-based variant SE also outperforms all the baseline models with F1-score at 0.484. As analyzed in the previous section, SE has a high-efficient setting without learnable parameters, so it is very suitable for the detection of abnormal money flow in large-scale and large-flow application environments. For MahData, a similar result that SEGE and SE outperform all baseline models can be observed.

Focusing on the recall and precision rates, we observe that some algorithms output communities with very high one-sided rates, suggesting the existence of over-fitting and under-fitting. For example, MVGRL has an extremely high recall rate, but its precision rate is as low as 0.050, which means that it has combined almost all vertices in their largest clusters. On the other side, some of them have high precision rates, but low recall rates, which indicates that they have found precisely only a small part of vertices in \(G_0\).

Fig. 3
figure 3

Visualization of MLI results on cnBank by using SE, SEGE (a single trial) and three representative baselines node2vec [42], DGI [47] and NetSMF [45]. Compared with the ground truth, blue nodes denotes the correctly identified (true positive) accounts, while the red nodes are false negative and false positive accounts

To further intuitively show the effectiveness of the proposed SEGE and SE algorithm, we present Fig. 3 for a single trial on the cnBank dataset starting with the same vertex to visualize the prediction results with three representative baselines. In this figure, we show the nodes of the entire graph with labeled positive samples (i.e. accounts joining money laundering) and omit the edges for better visualization. For each method, we mark the correctly identified nodes with blue color and mark the wrongly predicted nodes with red color including both Type-I Error (false negative) and Type-II Error (false positive). It is observed that the difference from the prediction result of SEGE to those of node2vec and DGI is obviously visible, especially for recall rates (blue nodes). The statistical results also show that our method has better performance, which indicates that our proposed algorithm does make full use of the advantages of SEM for money laundering identification. That is, the random walk in the transaction network tends to take steps along with edges of high weights and, consequently, to be trapped with high frequency in the money laundering community. Therefore, tight connection can be supposed to be a typical characteristic of money laundering communities, which can be detected by structural entropy, since it is a natural measure for the complexity of random walks that reveals a clustering structure of low sparsity but connected with heavy edges.

Fig. 4
figure 4

Scores and community sizes achieved by SEGE with different embedding dimensions. a,b: SEGE with cnBank, c,d: SEGE with MahData. The ground-truth size of cnBank is 2690, and that of MahData is 806

Fig. 5
figure 5

Scores achieved by SE and SEGE with sliding windows on two datasets. a: SE with cnBank, b: SEGE with cnBank, c: SE with MahData, and d: SEGE with MahData

4.3.1 Sensitivity of hyper-parameters

We investigate the sensitivity of hyper-parameters. For the SEGE algorithm, we select different embedding sizes. Figure 4 presents the MLI results from a single starting vertex for cnBank and \(10\%\) randomly picked starting vertices for MahData, where we use different embedding sizes ranging over [8, 16, 32, 64, 128, 256, 512]. The scores for MahData are averaged. For the two datasets, the influence on three scores is relatively stable with the variation of the embedding dimension, and the community sizes approximate the ground truth. In particular for cnBank, from the general trend, the recall rate grows while the precision rate falls with the increase of embedding sizes. When the embedding size is 32, SEGE reaches a relatively prominent performance point and finally derives the optimal F1 score with a proper detected community size.

4.3.2 Reasonableness of window sizes

We compare the scores in Fig. 5 when sliding this window over a wide range. In each subfigure, the x-axis indicates the center of the window and the y-axis indicates three scores. It can be observed that for both datasets, the F1 scores around the ground-truth sizes are relatively stable, where precision and recall rates achieve a good tradeoff. This means that a rough estimation of the community size is sufficient for a good performance.

4.4 Sub-community structure of the real ML dataset

Recall that some algorithms output communities with very high one-sided rates in the above experiments. For those algorithms that have a high precision rate but a low recall rate, probably a small sub-community of \(G_0\) is dug out. This indicates that sub-communities may be pervasive within \(G_0\). Clarifying the structural characteristics of a real money-laundering group has a guiding significance to AML. To this end, we investigate the structure of the real ML dataset.

First, we employ the structural entropy-based global algorithm proposed in [9] to the subgraph induced by the ground truth \(G_0\), which yields 14 sub-communities. However, this cannot be practically used in a real scenario for AML since the ground truth community is unknown and the sub-communities are too small to be mined by global algorithms. From now on, we set the 14 sub-communities identified by the global algorithm as found truth sub-communities, and use our local clustering algorithm SE to find them in the cnBank network.

Fig. 6
figure 6

The variation of conductance in a community exploration process. Starting from a specific starting point, there are several peaks of conductance during the exploration of the main community. The scores are evaluated according to the sub-community in which the starting vertex settles

Starting from a specific vertex, the sub-community structure can be observed to some extent from the variation of conductance during the procedure which vertices are added one by one. Given a subset of vertices S, the conductance of S is defined as

$$\begin{aligned} \Phi (S)=w(S,\overline{S})/vol(S). \end{aligned}$$
(7)

Intuitively, if a series of sub-communities are explored one by one, the conductance will have a sudden peak when the early vertices are included in the main community. Then as more vertices are included, the conductance will decrease gradually. Figure 6 demonstrates this phenomenon. A series of peaks can be observed during the exploration of the main community, which indicates the potential sub-community structure within the main community.

Next, we plot the P-R dotted diagram. The precision and recall rates are evaluated according to the sub-community in which the starting vertex settles. The k-th dot represents the recall and precision rates when the k-th vertex is added. Since the recall rate is non-decreasing, these dots are plotted from left to right. The cut-off criterion here is set due to a very interesting phenomenon of conductance variation. After the k-th vertex is added, we calculate the conductance \(\Phi (S_k)\) of the current set \(S_k\). The canonical relationship between \(\Phi (S_k)\) and k is shown in Fig. 7. All four figures which correspond to four starting vertices demonstrate a power-law feature. At the end of the long tail, the function becomes flat at an apparent turning point. Therefore, our local algorithm explores and ends at this point. Our criterion of termination is activated when two continuous points make the variation of conductance under the logarithmic coordinate within a small range \(\varepsilon >0\). Formally, the local algorithm ends at \(S_k\) when both \(|\Phi (S_{k+1})-\Phi (S_k)|\) and \(|\Phi (S_{k+2})-\Phi (S_{k+1})|\) are at most \(\varepsilon \). In our experiment, we pick \(\varepsilon =5\times 10^{-3}\). However, we need to emphasize that the value of \(\varepsilon \) is not very sensitive to the results, and any \(\varepsilon \) value in the range of \(10^{-3}\sim 10^{-2}\) yields similar results.

Fig. 7
figure 7

The conductance and F1-score correspondence. The turning points of them coincide with each other under logarithmic scaling

Figure 7 shows both the conductance variations and the corresponding F1-scores of \(S_k\) starting from four distinct vertices, respectively. Figure 7a starts from the vertex of the largest degree in \(G_0\), and Fig. 7b–d from three randomly chosen vertices. For each starting vertex, the logarithm of \(\Phi (S_k)\) decreases linearly as the logarithm of k grows, while the logarithm of the corresponding F1-score grows linearly. Then at almost the same turning point, these two curves become flat. The linearly monotonic growth of the logarithmic F1-score demonstrates that all the vertices that our local algorithm finds as a sub-community are right before the turning point, and the turning point on conductance verifies this local sub-community’s boundary.

Fig. 8
figure 8

The zoom-in precision and recall rates at the cut-off points for the four distinct starting vertices in \(G_0\)

Therefore, when we apply the aforementioned cut-off criterion, we can always find a sub-community with high precision. This conclusion has been illustrated in Fig. 8, where Fig. 8a–d zoom in for the four starting vertices the precision and recall curves as \(S_k\) grows, and the algorithm terminates at the red star points. We can see that although the recall rates are low, the precision rates are always very high.

Table 2 Comparison of cut-off and Max F1-score points

Table 2 lists the concrete positions and precision rates for them, respectively. It can be observed that the cut-off points \(k_c\) are always close to the corresponding peak value positions \(k_{F1}\) of F1-score in the sense that the difference ratios, which is \(|k_{F1}-k_c|/k_{F1}\) are less than 0.09. Moreover, we also apply our local algorithm starting from every vertex in \(G_0\), and get 2690 local clusters with an average precision rate 0.953, average recall rate 0.085, and average F1-score 0.154. Similarly, the average recall rate is low due to the sub-community structure of \(G_0\). However, our local algorithm digs out sub-communities with extremely high precision rates. The total running time for all 2690 local clusters is 4.16 hours and averages 5.56 seconds for one cluster.

Based on the results of previous experiments, we conduct the following simulations which are possibly close to a real scenario of actual combat strategy against a money laundering group. Suppose that we have k suspicious accounts. Then we can run our local algorithm starting from these k accounts respectively and take a union U of the k clusters it outputs. How large a part of \(G_0\) can we retrieve from U? Given that we have an extremely high average precision rate (0.953 over all 2690 starting points), if the recall rate of U is high, then it is expected to retrieve \(G_0\) with high quality.

Fig. 9
figure 9

Recall rates and F1-scores when multiple starting points are chosen randomly

The effectiveness of this strategy is verified in Fig. 9. We pick k starting points at random from the vertices in \(G_0\) and then run our local algorithm with the aforementioned cut-off criterion. For each k, 20 random trials are made independently, and the curves of recall and F1-score versus k are plotted in Fig. 9a,b, respectively. It can be observed that the curve of recall rates soars fast up to 0.8 when the number of starting points gets to around only 30. This means that we only have to get a very small fraction (30/2690) of true initial accounts which are scattered (in the sense of randomly picked) in the ground truth, and then we can dig out a large part (almost \(80\%\)) of the whole money laundering group. Correspondingly, the highest F1-score approaches 0.9 when k is around 50. This is the best number of suspicious accounts from which we retrieve the money laundering group with both high precision and recall rates.

5 Conclusion and future works

In this paper, we studied the anti-money laundering problem from the viewpoint of the community in transaction networks. We analyzed a real money laundering dataset and a synthetic one by embedding them into SBM random graphs and modeled AML problem as a clustering task. We proposed a new local clustering algorithm SEGE that starts from a given vertex and outputs a cluster containing it. This algorithm proceeds in the guide of structural entropy minimization and the node features from graph embedding. We gave a comprehensive comparison of our methods to modern graph learning methods. Equipped with an easy cut-off criterion, our proposed local algorithm is capable of mining money laundering communities with high F1 scores that are much better than both learning methods and pure structure entropy-based algorithm SE. Moreover, our local algorithm is capable of mining sub-communities with extremely high precision with a certain cut-off criterion. When we take the union of sub-communities that our local algorithm yields from a small fraction of ground-truth accounts, we can retrieve the whole ground-truth group with both high precision and recall rates.

For future directions, we emphasize that local clustering algorithms are very promising in practical use. Local algorithms focus on the only community that contains a designated vertex and utilize the valuable data with much less abundance. More local clustering algorithms and their potential applications are worth studying.