1 Introduction

An attributed network contains a network topology with attributes associated with nodes. Representative types of attributed networks include attributed graphs, attributed hypergraphs, and attributed multiplex graphs. Given an attributed network \(\mathcal {N}\), node clustering is an important task in graph mining, which aims to divide the n nodes of \(\mathcal {N}\) into k disjoint clusters, such that nodes within the same cluster are close to each other in the network topology and similar to each other in terms of attribute values. Clustering on attributed networks finds important applications in biological analysis [1], online marketing [2], social network [3, 4], Web analysis [5], etc.

In this work, we present ANCKA, an effective and efficient attributed network clustering method that is versatile to support attributed hypergraph clustering (AHC), attributed graph clustering (AGC), and attributed multiplex graph clustering (AMGC). ANCKA subsumes our previous work AHCKA [6] that is dedicated to AHC. In what follows, we first elaborate on AHC and then generalize to AGC and AMGC.

In a hypergraph, each edge can join an arbitrary number of nodes, referred to as a hyperedge. The hyperedge allows a precise description of multilateral relationships between nodes, such as collaboration relationships of multiple authors of a paper, interactions among proteins [7], products purchased together in one shopping cart, transactions involving multiple accounts [8]. In practice, nodes in hypergraphs are often associated with many attributes, e.g., the academic profile of authors and the descriptive data of products. The AHC problem is to divide the n nodes in such an attributed hypergraph into k disjoint clusters such that nodes within the same cluster are close to each other with high connectedness and homogeneous attribute characteristics. AHC finds numerous real-life applications in community discovery [9], organization structure detection [1], Web query analysis [5], biological analysis [10], etc. As another example, AHC can cluster together academic publications with high relevance by considering co-authorship hyperedges and keyword attributes in academic hypergraphs [11].

Effective AHC computation is a highly challenging task, especially for large attributed hypergraphs with millions of nodes. First, nodes, hyperedge connections, and attributes are heterogeneous objects with inherently different traits, whose information cannot be seamlessly integrated in a simple and straightforward way. Second, as observed in previous works on simple graphs [12, 13], higher-order relationships between nodes and node-attribute associations are crucial for clustering. However, computing such multi-hop relationships and associations via hyperedges usually with more than two nodes in attributed hypergraphs is rather difficult due to the complex hypergraph structures and prohibitive computational overheads (up to \(O(n^2)\) in the worst case).

In the literature, a plethora of clustering solutions [14,15,16] are developed for plain hypergraphs. These methods overlook attribute information, leading to severely compromised AHC result quality. Besides, a large body of research on attributed graph clustering is conducted, resulting in a cornucopia of efficacious techniques [2, 13]. However, most of these works cannot be directly applied to handle large attributed hypergraphs with more complex and unique structures. Inspired by the technical advances in the above fields, a number of efforts have been made towards AHC computation in the past years. The majority of AHC methods rely on non-negative matrix factorization [1, 5], which requires numerous iterations of expensive matrix operations and even colossal space costs of materializing \(n\times n\) dense matrices. Particularly, none of them take into account the higher-order relationships between nodes, thereby limiting their result utility. The state-of-the-art approach GRAC [11] extends graph convolution [17] to hypergraphs, indirectly incorporating higher-order relationships of nodes and attributes for clustering. Notwithstanding its enhanced clustering quality, GRAC runs in \(O(n^2)\) time as an aftermath from costly graph convolution and SVD operations, which is prohibitive for large hypergraphs. To recapitulate, existing AHC approaches either yield sub-optimal clustering results or incur tremendous computational costs, rendering them impractical to cope with large attributed hypergraphs with millions of nodes.

Given the above, can we combine and orchestrate hypergraph topology and attribute information in an optimized way for improved clustering quality while achieving high scalability over large attributed hypergraphs? We offer a positive answer by presenting AHCKA (Attributed Hypergraph Clustering via K-nearest neighbor Augmentation), a novel AHC approach that significantly advances the state of the art in AHC computation. AHCKA surpasses existing solutions through several key techniques. The first one is a K-nearest neighbor (KNN) augmentation scheme, which augments the original hypergraph structure with a KNN graph containing additional connections constructed by adjacent nodes with K highest attribute similarities. This is inspired by a case study on a real dataset manifesting that incorporating all-pairwise node connections via attributes or none of them jeopardizes the empirical clustering quality. Second, AHCKA formulates the AHC task as a novel optimization problem based on a joint random walk model that allows for the seamless combination of high-order relationships from both the hypergraph and KNN graph. Further, AHCKA converts the original NP-hard problem into an approximate matrix trace optimization and harnesses efficient matrix operations to iteratively and greedily search for high-quality solutions. Lastly, AHCKA includes an effective initialization method that considerably facilitates the convergence of the optimization process using merely a handful of iterations. We conduct extensive experiments on attributed hypergraph data in different domains. Compared with baselines, AHCKA exhibits superior performance in both clustering quality and efficiency. For instance, on the Amazon dataset with 2.27 million nodes, AHCKA gains over 10-fold speedup and a significant improvement of \(4.8\%\) in clustering accuracy compared to state-of-the-art. Our work AHCKA has been published in [6].

In addition to attributed hypergraphs, attributed graphs and attributed multiplex graphs are prevalent in real-world scenarios, such as social networks [18] and citation networks [19]. Different from hypergraphs that allow more than two nodes to form an edge, in a graph, an edge connects exactly two nodes. A multiplex graph consists of multiple layers of graphs with a shared set of nodes, and different graph layers represent node connections from different perspectives or domains, e.g., different types of relationships or relations formed in different time frames or spaces [18, 19]. Attributed graph clustering (AGC) is one of the most significant graph mining problems, extensively studied in the literature [2, 13], with many applications, e.g., community detection in social networks [20] and functional cartography of metabolic networks [21]. Furthermore, a rich collection of studies on attributed multiplex graph clustering (AMGC) also exists in [22,23,24,25], to support important applications, e.g., biological analysis [19], community detection [18] and social analysis [26]. A previous general framework [27] relies on expensive graph convolutions to support various clustering tasks.

In this work, we extend AHCKA for AHC to a versatile framework ANCKA that can efficiently handle attributed Network clustering tasks (AHC, AGC, and AMGC) to produce high-quality clusters on large data. ANCKA inherits the powerful KNN augmentation scheme and the formulation of clustering objective in AHCKA. We further develop a generalized joint random walk model in ANCKA with proper transition matrices to support random walks on KNN augmented hypergraphs, graphs, and multiplex graphs simultaneously. Efficient optimization techniques are applied in ANCKA to retain the advantage of high efficiency for clustering. Despite the superior efficiency, clustering million-scale datasets with ANCKA can still take dozens of minutes. Moreover, after observing the limited speedup ratio by increasing the number of CPU threads used, we pinpoint the efficiency bottlenecks and design the GPU-accelerated ANCKA-GPU, to boost the efficiency to another level, especially on large-scale datasets. ANCKA-GPU consists of GPU-based optimization techniques and KNN construction procedures to speed up. We have conducted extensive experiments to compare ANCKA with 16 competitors on various attributed graphs and 16 competitors on attributed multiplex graphs. In all three tasks, ANCKA obtains superior performance regarding both clustering quality and efficiency. The GPU implementation ANCKA-GPU further reduces time costs significantly, often by an order of magnitude on large datasets.

We summarize the contributions of this work as follows:

  • We devise a KNN augmentation scheme that exploits attributes to augment the original hypergraph structure in a cost-effective manner.

  • We formulate the AHC task as an optimization with the objective of optimizing a quality measure based on a joint random walk model over the KNN augmented hypergraph.

  • We propose a number of techniques for efficient optimization of the objective, including a theoretically-grounded problem transformation, a greedy iterative framework, and an effective initialization approach that drastically reduces the number of iterations till convergence.

  • We justify the application of KNN augmentation to various types of networks, generalize the techniques, and design a versatile method ANCKA to efficiently perform AHC, AGC, and AMGC and produce high-quality clusters.

  • We develop ANCKA-GPU with customized GPU kernels to improve the efficiency further with a series of GPU-based optimizations while maintaining clustering quality.

  • The excellent performance of ANCKA is validated by comprehensive experiments against 19 AHC competitors, 16 AGC competitors, and 16 AMGC competitors, over real-world datasets.

The remainder of this paper is structured as follows: Sect. 2 introduces the preliminaries of AHC, AGC, and AMGC. Section 3 outlines the KNN augmentation strategy and random walk scheme for AHC, along with the AHC clustering objective. Section 4 offers a theoretical analysis of the proposed AHC method AHCKA, while Sect. 5 details the algorithmic procedures of AHCKA. Section 6 presents the versatile ANCKA framework for AHC, AGC, and AMGC. Section 7 discusses GPU-based techniques for enhancing clustering efficiency in ANCKA-GPU. Section 8 provides a comprehensive experimental evaluation. Section 9 reviews relevant literature, and Sect. 10 concludes the paper.

2 Preliminaries

Attributed Network. Let \(\mathcal {N}=(\mathcal {V}, \mathcal {E}, \textbf{X})\) be an attributed network, where \(\mathcal {V}\) is the node set with cardinality \(|\mathcal {V}|=n\), \(\mathcal {E}\) is the edge (or hyperedge) set with cardinality \(|\mathcal {E}|=m\), and \(\textbf{X} \in \mathbb {R}^{n\times d}\) represents a node attribute matrix. A node \(v_j\in \mathcal {V}\) has degree \(\delta (v_j)\), which is the number of edges (or hyperedges) incident to \(v_j\). Each node \(v_j\) in \(\mathcal {V}\) is associated with a d-dimensional attribute vector, denoted as \(\textbf{X} [j]\), i.e., the j-th row of the node attribute matrix \(\textbf{X} \). We consider three types of attributed networks \(\mathcal {N}\), including attributed hypergraphs \(\mathcal {H}\), attributed graphs \(\mathcal {G}\), and attributed multiplex graphs \(\mathcal {G}_M\), characterized by different nature of \(\mathcal {E}\).

Attributed Hypergraph is denoted by \(\mathcal {H}=(\mathcal {V}, \mathcal {E}, \textbf{X})\). \(\mathcal {E}\) is the set of m hyperedges where each \(e_i\in \mathcal {E}\) is a subset of \(\mathcal {V}\) containing at least two nodes. A hyperedge \(e_i\) is said to be incident with a node \(v_j\) if \(v_j\in e_i\). We denote by \(\textbf{H} \in \mathbb {R}^{m\times n}\) the incidence matrix of hypergraph \(\mathcal {H}\), where each entry \(\textbf{H} [i,j]=1\) if \(v_j\in e_i\), otherwise \(\textbf{H} [i,j]=0\). Let diagonal matrices \(\textbf{D} _V\in \mathbb {R}^{n\times n}\) and \(\textbf{D} _E\in \mathbb {R}^{m\times m}\) represent the degree matrix and hyperedge-size matrix of \(\mathcal {H}\), where the diagonal entry \(\textbf{D} _V[j,j]=\delta (v_j)\) for \(v_j\in \mathcal {V}\) and \(\textbf{D} _E[i,i]=|e_i|\) for \(e_i\in \mathcal {E}\), respectively. Figure 1 shows an attributed hypergraph \(\mathcal {H}\) with 8 nodes and 5 hyperedges, where each node has an attribute vector and hyperedges \(e_1,e_2\) contain 4 and 3 nodes, i.e., \(\{v_1,v_2,v_4,v_5\}\) and \(\{v_1,v_3,v_4\}\), respectively.

Attributed Graph is denoted by \(\mathcal {G}=(\mathcal {V}, \mathcal {E}, \textbf{X})\), where every edge in \(\mathcal {E}\) connects exactly two nodes. A graph \(\mathcal {G}\) can be undirected or directed. An undirected edge can be viewed as two directed edges of the same node pair in reversed directions. Different from a hypergraph incident matrix between nodes and hyperedges, graph adjacency matrix \({\textbf{A}} \in \mathbb {R}^{n\times n}\) encodes the structure of \(\mathcal {G}\), where entry \({\textbf{A}}[i,j]\) is 1 if there is an edge from node \(v_i\) to node \(v_j\), i.e., \((v_{i},v_{j}) \in \mathcal {E}_G\), or 0 if otherwise. Let \({\textbf{D}}\in \mathbb {R}^{n\times n}\) be the diagonal node degree matrix of \(\mathcal {G}\).

Attributed Multiplex Graph is \(\mathcal {G}_M=(\mathcal {V}, \mathcal {E}_1,..., \mathcal {E}_L,\textbf{X})\), consisting of L graph layers. Every l-th layer has its own edge set \(\mathcal {E}_l\), and can be viewed as an attributed graph \(\mathcal {G}_l\) with \(\mathcal {E}_l\), adjacency matrix \(\textbf{A} _l\), and diagonal node degree matrix \({\textbf{D}}_l\).

The Clustering Problem. Given an attributed network \(\mathcal {N}\) that can be \(\mathcal {H}\), \(\mathcal {G}\), or \(\mathcal {G}_M\), we study the clustering problem that encompasses attributed hypergraph clustering (AHC), attributed graph clustering (AGC), and attributed multiplex graph clustering (AMGC). Given a specified number k of clusters and an attributed network \(\mathcal {N}\), the clustering task is to divide the node set \(\mathcal {V}\) into k disjoint subsets \(\{\mathcal {C}_1, \dots , \mathcal {C}_k\}\) such that \(\bigcup _{i=1}^k \mathcal {C}_i=\mathcal {V}\) and the following properties are satisfied:

  1. 1.

    Nodes within the same cluster are closely connected to each other in the network structure, while nodes in different clusters are far apart (structure closeness);

  2. 2.

    Nodes in the same cluster have similar attribute values, while nodes in different clusters vary significantly in attribute values (attribute homogeneity).

For instance, when the input network \(\mathcal {N}\) is the attributed hypergraph \(\mathcal {H}\) in Fig. 1, \(\mathcal {H}\) is partitioned into two clusters \(\mathcal {C} _1\) and \(\mathcal {C} _2\). We can observe that nodes \(v_1\)-\(v_5\) in \(\mathcal {C} _1\) share similar attributes and are closely connected to each other, whereas nodes \(v_6,v_7\) and \(v_8\) form a cluster \(\mathcal {C} _2\) that is separated from \(\mathcal {C} _1\) with a paucity of connections and distinct attributes.

Fig. 1
figure 1

An Example Attributed Hypergraph

3 Attributed hypergraph clustering

As mentioned, we first focus on attributed hypergraph clustering (AHC) and present our method AHCKA [6] in Sects. 3, 4, and 5. Specifically, we will devise a random walk scheme on a K-nearest neighbor augmented hypergraph and present the AHC objective in Sect. 3, conduct theoretical analysis to support the design of AHCKA in Sect. 4, and develop the algorithmic details of AHCKA in Sect. 5.

For the problem of AHC, a central challenge is how to simultaneously exploit both hypergraph structure and attribute information for improved clustering quality. In literature, it is a natural and effective approach to augment network structure with attribute similarity strengths [13, 28]. However, since a hypergraph yields different topological characteristics as illustrated in Fig. 1, we argue that attribute augmentation should be conducted in a controlable way; otherwise, attributes may hamper, instead of improving, clustering quality, as shown in experiments (Sect. 8.3).

Therefore, in this section, we first develop a carefully-crafted augmentation strategy to augment attributes of nodes with hypergraph topology, which will benefit the clustering quality shown later on. As this augmentation strategy is orthogonal to the topological nature of hypergraph, its application to other types of networks, such as attributed graphs and attributed multiplex graphs, will be explained shortly in Sect. 6. Then we formulate Attributed Hypergraph Clustering as Augmented Hypergraph Clustering, with the same abbreviation AHC. The augmented hypergraph involves both hypergraph connections as well as augmented attribute connections. It is challenging to define a unified way to preserve the high-order information of both sides. To tackle this, we design the \((\alpha ,\beta ,\gamma )\)-random walk to uniformly model the node relationships (in terms of both the structural closeness and attribute similarity) in the augmented hypergraphs. Based thereon, we define a multi-hop conductance (MHC), and formulate the objective of AHC as optimizing the conductance.

3.1 KNN augmentation

Although the vanilla augmentation strategy improves the clustering quality in attributed graphs [13, 28], to our knowledge, its effectiveness over attributed hypergraphs is as of yet under-explored. Moreover, it requires constructing a densely connected graph, causing severe efficiency issues on large graphs. To this end, we first demystify the attribute homogeneity of nodes within the same cluster through an empirical study on a real-world attributed hypergraph, i.e., the Cora-CA datasetFootnote 1 containing 2.7k academic papers in 7 research fields (i.e., 7 clusters). Every node has an attribute vector indicating the presence of words in the corresponding publication. First, we use \(f(v_i,v_j)=cosine(\textbf{X} [i],\textbf{X} [j])\) to denote the attribute similarity of nodes \(v_i\), \(v_j\). We refer to \(v_j\) as the K-th nearest neighbor of \(v_i\) if \(f(v_i,v_j)\) is the K-th largest \(\forall {v_j\in \mathcal {V}\setminus v_i}\). Figure 2 plots the averaged attribute similarity (AAS for short) \(f(v_i,v_j)\) of any randomly picked node \(v_i\) and its K-th nearest neighbor \(v_j\), and their ratio of co-occurring in the same cluster (RCC for short), when varying K from 1 to 1000. The AAS and RCC results from this real-world example demonstrate that two nodes with higher attribute similarity are also more likely to appear in the same cluster. Intuitively, applying the attribute-based augmentation strategy to hypergraphs can enhance the clustering results.

Fig. 2
figure 2

AAS and RCC on Cora-CA (best viewed in color)

However, excessively augmenting the hypergraph with attribute information, namely, building too many connections between nodes according to attributes, will introduce distortion and adversely impact the clustering performance. To illustrate this, consider the example in Fig. 1, where nodes \(v_2\), \(v_3\) are in the same cluster as they share multiple common neighbors while \(v_2\), \(v_7\) are not. If we were to assign a cluster to node \(v_2\) as per the additional connections created by attribute similarities, it is more likely to be \(v_2\), \(v_7\) rather than \(v_2\), \(v_3\) in the same cluster given \(f(v_2,v_7)=0.41> f(v_2,v_3)=0\), which is counter-intuitive.

Therefore, unlike the vanilla augmentation strategy employed in prior works, we propose a KNN augmentation strategy. That is, given the input attributed hypergraph \(\mathcal {H}=(\mathcal {V}, \mathcal {E}, \textbf{X})\) and an integer K, we augment \(\mathcal {H}\) with an undirected KNN graph \(\mathcal {G}_K=(\mathcal {V},\mathcal {E}_K)\). More specifically, for each node \(v_i\in \mathcal {V}\), we identify K nodes in \(\mathcal {V}\) (excluding \(v_i\) itself) that are most similar to \(v_i\) in terms of attribute similarity computed based on a similarity function \(f(\cdot ,\cdot )\) as \(v_i\)’s neighbors in \(\mathcal {G}_K\), denoted by \(N_K(v_i)\). In other words, for every two nodes \(v_i\), \(v_j\) (\(v_j\in N_K(v_i)\)), we construct an edge \((v_i,v_j)\) with weight \(f(\textbf{X} [i],\textbf{X} [j])\) in \(\mathcal {E}_K\). Accordingly, the adjacency matrix \(\textbf{A} _K\) of \(\mathcal {G}_K\) is defined as follows:

(1)

Thus, we obtain an augmented hypergraph \(\mathcal {H}_A\) containing the hypergraph \(\mathcal {H}_O=(\mathcal {V},\mathcal {E})\) and the KNN graph \(\mathcal {G}_K=(\mathcal {V},\mathcal {E}_K)\). The reasons that we only consider K nearest neighbors for augmented hypergraph construction are three-fold. In the first place, the case study in Fig. 2 suggests that there is no significant difference between the RCC of two random nodes (depicted by the gray dashed line) and that of two nodes \(v_i,v_j\) such that \(v_j\in N_K(v_i)\), when K is beyond a number (roughly 500 in Fig. 2). Therefore, such connections can be overlooked without impeding the clustering quality. Secondly, if we revisit the example in Fig. 1 and apply the KNN strategy (\(K=3\)) here, we can exclude the connection between \(v_2\) and \(v_7\) from \(\mathcal {G}_K\) since \(f(v_2,v_1)=f(v_2,v_4)=f(v_2,v_5)=0.5> f(v_2,v_7)=0.41\). The distortion issue mentioned previously is therefore resolved. In comparison with the densely connected graph that encodes all attribute similarities (with up to \(O(n^2)\) edges in the worst case), \(\mathcal {G}_K\) can be efficiently constructed by utilizing well-established approximate nearest neighbor techniques with \(O(n\log n)\) complexity [29, 30].

The range of the KNN neighborhood is determined by parameter K. While a larger K allows the KNN graph to include more attribute similarity relations, this also leads to a higher proportion of unwanted inter-cluster edges in the KNN graph as evidenced by the lower RCC in Fig. 2. Meanwhile, K cannot be too small (e.g., 5), or it will fail to utilize highly similar nodes that usually have high RCC. The trade-off of choosing K is evaluated in Sect. 8.3.

Now, the question lies in how to model the relationships of nodes in \(\mathcal {V}\) of the augmented graph \(\mathcal {H}_A\), which is a linchpin to AHC. In the following section, we present a joint random walk model that enables us to capture the multi-hop proximities of nodes over \(\mathcal {H}_O\) and \(\mathcal {G}_K\) jointly.

3.2 \((\alpha ,\beta ,\gamma )\)-random walk

Random walk with restart [31] (RWR) is one of the most common and effective models for capturing the multi-hop relationships between nodes in a graph [32], and is widely used in many tasks such as ranking [31, 33], recommendation [34], and clustering [35]. Given a graph \(\mathcal {G}\), a source node u and a stopping probability \(\alpha \) (typically \(\alpha =0.2\)), at each step, an RWR originating from u either stops at the current node with probability \(\alpha \), or randomly picks an out-neighbor v of the current node according to the weight of edge (uv) and navigates to v with the remaining \(1-\alpha \) probability. It follows that RWR score (a.k.a. personalized PageRank [36]) of any node pair (uv) represents the probability that an RWR from u ends at node v. Intuitively, two nodes with dense (one-hop or multi-hop) connections should have a high RWR score.

Nevertheless, RWR is designed for general graphs, and thus cannot be directly applied to our augmented hypergraph \(\mathcal {H}_A\) as it consists of a hypergraph \(\mathcal {H}_O\) and a general graph \(\mathcal {G}_K\). We devise a joint random walk scheme, named \((\alpha ,\beta ,\gamma )\)-random walk, which conducts the RWR process over \(\mathcal {H}_O\) and \(\mathcal {G}_K\) jointly to seamlessly integrate topological proximity over both networks. Definition 1 states the formal definition of the \((\alpha ,\beta ,\gamma )\)-random walk process.

Definition 1

Given an augmented hypergraph \(\mathcal {H}_A=(\mathcal {H}_O,\mathcal {G}_K)\) and a source node u, an \((\alpha ,\beta ,\gamma )\)-random walk W starting from u conducts \(\gamma \) steps and at each step proceeds as follows.

  • With probability \(\alpha \), W terminates at the current node \(v_i\);

  • with the other \(1-\alpha \) probability, W navigates to a node \(v_j\) picked by the following rules:

    • with probability \(\beta _i\), W draws an out-neighbor \(v_j\) of the current node \(v_i\) in \(\mathcal {G}_K\) according to probability \(\frac{\textbf{A} _K[i,j]}{\sum _{v_l\in N_K(v_i)}{\textbf{A} _K[i,l]}}\);

    • or with probability \(1-\beta _i\), W first draws an hyperedge \(e_i\) incident to \(v_i\) in \(\mathcal {H}_O\), and then draws node \(v_j\) from \(e_i\) uniformly at random.

Each node \(v_i\) is associated with a parameter \(\beta _i\) (see Eq. (2)) used to control the joint navigation between hypergraph \(\mathcal {H}_O\) and KNN \(\mathcal {G}_K\). The larger \(\beta _i\) is, the more likely that the random walk jumps to the neighbors of \(v_i\) in KNN \(\mathcal {G}_K\).

$$\begin{aligned} \beta _i = {\left\{ \begin{array}{ll} 0, & \quad \text {if } \textbf{X} [i] \text { is a zero vector};\\ 1, & \quad \text {else if } \delta (v_i)=0;\\ \beta , & \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(2)

In general, we set \(\beta _i\) to \(\beta \in [0,1]\), which is a user-specified parameter. In particular cases, when node \(v_i\)’s attribute vector \(\textbf{X} [i]\) is a zero vector, i.e., \(v_i\) has no useful information in the KNN \(\mathcal {G}_K\), we set \(\beta _i\) to 0. Conversely, \(\beta _i\) is configured as 1 if \(v_i\) is connected to none of the hyperedges, i.e., \(\delta (v_i)=0\). Let \(s(v_i,v_j)\) denote the probability of an \((\alpha ,\beta ,\gamma )\)-random walk from \(v_i\) stopping at \(v_j\) in the end. Based on Definition 1, we can derive the following formula for \(s(v_i,v_j)\):

$$\begin{aligned} \begin{aligned} s(v_i,v_j)=\textbf{S} [i,j]&\textstyle = \alpha \sum _{\ell =0}^{\gamma } (1-\alpha )^{\ell }\textbf{P} ^{\ell }[i,j], \\ \end{aligned} \end{aligned}$$
(3)

where \(\textbf{P} \) is a transition matrix defined by

$$\begin{aligned} \textbf{P} = (\textbf{I}-\textbf{B})\cdot \textbf{D} _V^{-1} \textbf{H} ^\textsf{T}\textbf{D} _E^{-1} \textbf{H} + \textbf{B} \textbf{D} _K^{-1} \textbf{A} _K, \end{aligned}$$
(4)

\(\textbf{B} =diag(\beta _1, \dots , \beta _n)\) is a diagonal matrix containing \(\beta _i\) parameters, and \(\textbf{D} _K\) is the diagonal degree matrix of \(\mathcal {G}_K\). \(\textbf{P} ^{\ell }[i,j]\) is the probability that a \(\ell \)-hop walk from \(v_i\) terminates at \(v_j\).

3.3 Objective function

In what follows, we formally define the objective function of AHC. Intuitively, a high-quality cluster \(\mathcal {C} \) in the augmented hypergraph \(\mathcal {H}_A\) should be both internally cohesive and well disconnected from the remainder of the graph with the consideration of multi-hop connections. Hence, if we simulate an \((\alpha ,\beta ,\gamma )\)-random walk W from any node in \(\mathcal {C} \), W should have a low probability of escaping from \(\mathcal {C} \), i.e., ending at any node outside \(\mathcal {C} \). We refer to this escaping probability \(\phi (\mathcal {C})\) as the multi-hop conductance (MHC) of \(\mathcal {C} \), defined in Eq. (5).

$$\begin{aligned} \textstyle \phi (\mathcal {C}) = \frac{1}{|\mathcal {C} |} \sum _{v_i \in \mathcal {C}} \sum _{v_j\notin \mathcal {C}} s(v_i, v_j) \end{aligned}$$
(5)

Since a low MHC \(\phi (\mathcal {C})\) reflects a high coherence of cluster \(\mathcal {C} \), we then formulate AHC as an optimization problem of finding k clusters \(\{\mathcal {C} _1,\dots ,\mathcal {C} _k\}\) such that their MHC \(\varPhi (\{\mathcal {C} _1, \dots , \mathcal {C} _k\})\) (Eq. (6)) is minimized.

$$\begin{aligned} \varPhi (\{\mathcal {C} _1, \dots , \mathcal {C} _k\}) = \frac{1}{k} \sum _{\mathcal {C} \in \{\mathcal {C} _1,\dots ,\mathcal {C} _k\}} \frac{1}{|\mathcal {C} |} \sum _{v_i \in \mathcal {C}} \sum _{v_j\notin \mathcal {C}} s(v_i, v_j) \end{aligned}$$
(6)

Directly minimizing Eq. (6) requires computing \(s(v_i,v_j)\) (Eq. (3)) of every two nodes \(v_i\in \mathcal {C} \), \(v_j\in \mathcal {V}\setminus \mathcal {C} \), \(\forall {\mathcal {C}}\in \{\mathcal {C} _1,\mathcal {C} _2,\) \(\cdots ,\mathcal {C} _k\}\), which is prohibitively expensive due to intractable computation time (i.e., \(O(n^3)\)) and storage space (i.e., \(O(n^2)\)). In addition, the minimization of \(\varPhi (\{\mathcal {C} _1, \dots , \mathcal {C} _k\})\) is an NP-complete combinatorial optimization problem [37], rendering the exact solution unattainable on large graphs.

4 Theoretical analysis for AHCKA

This section presents the top-level idea of our proposed solution, AHCKA, to AHC computation, and explains the intuitions behind it. At a high level, AHCKA first transforms the objective of AHC in Eq. (6) to a matrix trace maximization problem, and then derives an approximate solution via a top-k eigendecomposition. Note that for any k non-overlapping clusters \(\{\mathcal {C} _1,\mathcal {C} _2,\cdots ,\mathcal {C} _k\}\) on \(\mathcal {H}\) satisfying \(\bigcup _{i=1}^k \mathcal {C}_i=\mathcal {V}\), they can be represented by a binary matrix \({\textbf{Y}}\in \{0,1\}^{n\times k}\), where for each node \(v_i\) and cluster \(\mathcal {C} _j\)

$$\begin{aligned} {\textbf{Y}}[i, j] = {\left\{ \begin{array}{ll} 1, & v_i\in \mathcal {C} _j\\ 0, & v_i\in \mathcal {V}\setminus \mathcal {C} _j. \end{array}\right. } \end{aligned}$$
(7)

We refer to \({\textbf{Y}}\) as a binary cluster membership (BCM) matrix of \(\mathcal {H}\) and we use

$$\begin{aligned} h(\textbf{Y}) = ({\textbf{Y}}^\textsf{T}{\textbf{Y}})^{-1/2}{\textbf{Y}} = \widehat{\textbf{Y}} \end{aligned}$$
(8)

to stand for the \(L_2\) normalization of \(\textbf{Y} \). Particularly, \(\widehat{\textbf{Y}}\) has orthonormal columns, i.e., \(\widehat{\textbf{Y}}^\textsf{T}\widehat{\textbf{Y}}=\textbf{I} _k\) where \(\textbf{I} _k\) is a \(k\times k\) identity matrix. Given k non-overlapping clusters \(\{\mathcal {C} _1,\mathcal {C} _2,\cdots ,\mathcal {C} _k\}\) and their corresponding BCM matrix \(\textbf{Y} \), it is trivial to show

$$\begin{aligned} \varPhi (\{\mathcal {C} _1,\dots ,\mathcal {C} _k\})= 1-\varPsi (\textbf{Y}), \end{aligned}$$
(9)

where \(\varPsi (\textbf{Y})\) is defined as follows:

$$\begin{aligned} \varPsi (\textbf{Y})=\frac{1}{k}trace(\widehat{\textbf{Y}}^\textsf{T}\textbf{S} \widehat{\textbf{Y}}). \end{aligned}$$
(10)

Equation (9) suggests that the minimization of MHC \(\varPhi (\{\mathcal {C} _1,\dots ,\mathcal {C} _k\})\) is equivalent to finding a BCM matrix \(\textbf{Y} \) such that the trace of matrix \(\widehat{\textbf{Y}}^\textsf{T}\textbf{S} \widehat{\textbf{Y}}\) is maximized. Due to its NP-completeness, instead of computing the exact solution, we utilize a two-phase strategy to derive an approximate solution as follows.

If we relax the binary constraint on \(\textbf{Y} \), the following lemma establishes an upper bound \(\psi _\sigma \) for \(\varPsi (\textbf{Y})\).

Lemma 1

Let \(\sigma _1\ge \sigma _2\ge \dots \ge \sigma _k\) be the k largest singular values of matrix \(\textbf{S} \) in Eq. (3). Given any matrix \(\textbf{W} \in \mathbb {R}^{n\times k }\) such that \(h(\textbf{W})\) satisfies \({h(\textbf{W})}^\top \cdot h(\textbf{W})=\textbf{I} _K\), then \(\varPsi (\textbf{W}) \le \frac{1}{k}\sum _{i=1}^{k}\sigma _i = \psi _\sigma \).

Lemma 1 implies that if we can first find a fractional matrix \(\textbf{W} \) such that \(\varPsi (\textbf{W})\) is close to \(\psi _\sigma \), a high-quality BCM matrix \(\textbf{Y} \) can be converted from \(\textbf{W} \) by leveraging algorithms such as \(k\text {-}\texttt{Means}\) [38]. Although we can obtain such a fractional matrix \(\textbf{W} \) by applying trace maximization techniques [39] to Eq. (10), it still remains tenaciously challenging to compute \(\textbf{S} \). (All proofs are in the technical report [40].)

Lemma 2

Let the columns of \(\textbf{Q} \in \mathbb {R}^{n\times k}\) be the second to \((k+1)\)-th leading eigenvectors of \(\textbf{P} \) (Eq. (4)). Then, we have \(\varPsi (\textbf{Q}) = \frac{1}{k}\sum _{i=2}^{k+1}\lambda _i = \psi _\lambda \), where \(\lambda _2\ge \dots \ge \lambda _k\ge \lambda _{k+1}\) are the second to \((k+1)\)-th leading eigenvalues of \(\textbf{S} \), sorted by algebraic value in descending order.

We exclude the first eigenvector \(\frac{1}{\sqrt{n}}\cdot \textbf{1}\) of \(\textbf{P} \) as it is useless for clustering. By virtue of our analysis in Lemma 2, the second to \((k+1)\)-th leading eigenvectors \(\textbf{Q} \) of \(\textbf{P} \) (see Eq. (4)) can be regarded as a rough \(\textbf{W} \) since \(\varPsi (\textbf{Q})=\psi _\lambda \le \psi _\sigma \) and the gap between \(\psi _\lambda \) and \(\psi _\sigma \) is insignificant in practice. For instance, on the Cora-CA dataset, we can obtain \(\psi _\sigma = 0.668\) and \(\psi _\lambda = 0.596\) (i.e., \(\varPhi _\sigma =1-\psi _\sigma =0.332\), \(\varPhi _\lambda =1-\psi _\lambda =0.404\)), both of which are better than \(\varPsi (\textbf{Y} ^{*})=0.533\) (i.e., \(\varPhi ^{*}=1-\varPsi (\textbf{Y} ^{*})=0.467\)) of the ground-truth BCM matrix \(\textbf{Y} ^{*}\). Consequently, using the second to \((k+1)\)-th leading eigenvectors \(\textbf{Q} \) of \(\textbf{P} \) as the fractional solution \(\textbf{W} \) is sufficient to derive a favorable BCM matrix. Moreover, in doing so, we can avoid the tremendous overhead incurred by the materialization of \(\textbf{S} \).

To summarize, AHCKA adopts a two-phase strategy to obtain an approximate solution to the AHC problem. First, AHCKA computes the second to \((k+1)\)-th leading eigenvectors \(\textbf{Q} \) of \(\textbf{P} \). After that, AHCKA transforms \(\textbf{Q} \) into a BCM matrix \(\textbf{Y} \) through a discretization approach [41] that minimizes the difference between \(\textbf{Q} \) and \(\textbf{Y} \). The rationale is that \(\varPsi (\textbf{Q})=\varPsi (\textbf{Q} \textbf{R})\) if \(\textbf{R} \) is a \(k \times k\) orthogonal matrix, ensuring \(\textbf{R} ^\textsf{T}\textbf{R} =\textbf{I} _k\). Accordingly, we can derive a BCM matrix \(\textbf{Y} =\textbf{Q} \textbf{R} \) by minimizing the Frobenius norm \(||\textbf{Q}- \textbf{Q} \textbf{R} ||_F\) with a binary constraint exerted on \(\textbf{Q} \textbf{R} \). Note that we do not adopt \(k\text {-}\texttt{Means}\) over \(\textbf{Q} \) to get the BCM matrix \(\textbf{Y} \) as it deviates from the objective in Eq. (10), and thus, produces sub-par result quality, as revealed by experiments (Table 14).

Nevertheless, to realize the above idea, there still remain two crucial technical issues to be addressed:

  1. 1.

    The brute-force computation of \(\textbf{Q} \) is time-consuming as it requires numerous iterations and the construction of \(\textbf{P} \).

  2. 2.

    In practice, directly utilizing the exact or near-exact \(\textbf{Q} \) might incur overfitting towards the objective instead of ground-truth clusters, and hence, lead to sub-optimal clustering quality. It is challenging to derive a practically effective and robust BCM matrix \(\textbf{Y} \) from \(\textbf{Q} \).

5 The AHCKA algorithm

To circumvent the above challenges, AHCKA integrates the aforementioned two-phase scheme into an iterative framework, which enables us to approximate the second to \((k+1)\)-th leading eigenvectors \(\textbf{Q} \) without constructing \(\textbf{P} \) explicitly, and greedily search the BCM matrix \(\textbf{Y} \) with the best MHC. Figure 3 sketches the main ingredients and algorithmic procedure of AHCKA. More specifically, AHCKA employs orthogonal iterations [42] to approximate the second to \((k+1)\)-th leading eigenvectors \(\textbf{Q} \) of \(\textbf{P} \). During the course, AHCKA starts with an initial BCM matrix, followed by an orthogonal iteration to compute an approximate \(\textbf{Q} \) and an updated BCM matrix \(\textbf{Y} \) from the \(\textbf{Q} \) through \(\texttt{Discretize}\) algorithm [41]. Afterward, AHCKA inspects if \(\textbf{Q} \) reaches convergence and computes the MHC with the current BCM matrix \(\textbf{Y} \) via CalMHC algorithm (Algorithm 2). If \(\textbf{Q} \) converges (i.e., the BCM remains nearly stationary ) or the early termination condition is satisfied (i.e., the MHC of current \(\textbf{Y} \) is satisfying), AHCKA terminates. Otherwise, AHCKA enters into the next orthogonal iteration with the updated \(\textbf{Q} \) and \(\textbf{Y} \).

Fig. 3
figure 3

Overview of AHCKA

In what follows, a detailed description of AHCKA is given in Sect. 5.1. Section 5.2 introduces an effective approach InitBCM for initializing the BCM matrix \(\textbf{Y} \), which drastically curtails the number of iterations needed and significantly boosts the computation efficiency of AHCKA. The complexity of the complete algorithm is analyzed in Sect. 5.3.

5.1 Main algorithm

Algorithm 1
figure a

AHCKA

The pseudo-code of AHCKA is presented in Algorithm 1, which takes as input an attributed hypergraph \(\mathcal {H}\), transition matrix of attribute KNN graph \(\textbf{P} _K\), the number k of clusters, a diagonal matrix \(\textbf{B} \) containing n parameters defined in Eq. (2), the random walk stopping probability \(\alpha \), an error threshold \(\epsilon _Q\), the numbers \(\gamma , T_a\) of iterations, an integer \(\tau \), and an initial BCM matrix \({\textbf{Y}}^{(0)}\). AHCKA starts by computing the normalized BCM matrix \(\widehat{\textbf{Y}}^{(0)}= h(\textbf{Y} ^{(0)})\) (Eq. (8)) and setting the initial \(k+1\) leading eigenvectors \(\textbf{Q} ^{(0)}\) as \(\frac{1}{\sqrt{n}}\cdot \textbf{1} | \widehat{\textbf{Y}}^{(0)}\) (Lines 1-2), where | represents the horizontal concatenation and \(\frac{1}{\sqrt{n}}\cdot \textbf{1}\) is the first leading eigenvector of \(\textbf{P} \) since it is a stochastic matrix. After that, AHCKA enters into at most \(T_a\) orthogonal iterations for computing the \(k+1\) leading eigenvectors \(\textbf{Q} \) and the BCM matrix \(\textbf{Y} \) (Lines 3–10). At step t, orthogonal iteration updates the approximate \(k+1\) leading eigenvectors of \(\textbf{P} \) as \(\textbf{Q} ^{(t)}\) by the formula below (Lines 4–5):

$$\begin{aligned} \textbf{Q} ^{(t)}\textbf{R} ^{(t)}=\textbf{Z} ^{(t)}=\textbf{P} \textbf{Q} ^{(t-1)}, \end{aligned}$$
(11)

where \(\textbf{Q} ^{(t)}\) is obtained by a QR decomposition over \(\textbf{Z} ^{(t)}\). If t is sufficiently large, \(\textbf{Q} ^{(t)}\) will converge to the exact \(k+1\) leading eigenvectors of \(\textbf{P} \) [42]. Note that the direct computation of \(\textbf{Z} ^{(t)}=\textbf{P} \textbf{Q} ^{t-1}\) requires constructing \(\textbf{P} \) explicitly as per Eq. (4), which incurs an exorbitant amount of time and space (up to \(O(n^2)\) in the worst case). To mitigate this, we decouple and reorder the matrix multiplication as in Eq. (12).

$$\begin{aligned}&\textbf{Z} ^{(t)} = (\textbf{I}-\textbf{B})\cdot \textbf{P} _V \cdot \left( \textbf{P} _E \textbf{Q} ^{(t-1)}\right) + \textbf{B} \textbf{P} _K\cdot \textbf{Q} ^{(t-1)}, \end{aligned}$$
(12)
$$\begin{aligned}&\text {where } \textbf{P} _V=\textbf{D} ^{-1}_V\textbf{H} ^\textsf{T},\ \textbf{P} _E=\textbf{D} ^{-1}_E\textbf{H} \end{aligned}$$
(13)

\(\textbf{P} _V\) and \(\textbf{P} _E\) are two sparse matrices of \(\mathcal {H}\) and \(\textbf{P} _K=\textbf{D} _K^{-1} \textbf{A} _K\) is the sparse transition matrix of the KNN graph \(\mathcal {G}_K\) defined in Sect. 3.1. Note that all of them can be efficiently constructed in the preprocessing stage. As such, we eliminate the need to materialize \(\textbf{P} \) and reduce the time complexity of computing \(\textbf{Z} ^{(t)}\) to \(O(nk\cdot (\overline{\delta } +K))\).

Algorithm 2
figure b

CalMHC

After obtaining \(\textbf{Q} ^{(t)}\), AHCKA converts \(\textbf{Q} ^{(t)}\) into a new BCM matrix \(\textbf{Y} ^{(t)}\) (Lines 6–7) using the Discretize algorithm [41]. Notice that we conduct this conversion every other \(\tau \) iterations in order to avert unnecessary operations as the difference between \(\textbf{Y} ^{(t)}\) and \(\textbf{Y} ^{(t-1)}\) is often insignificant.

Next, at Line 8, AHCKA invokes CalMHC (i.e., Algorithm 2) with a BCM matrix \({\textbf{Y}}^{(t)}\), other parameters including \(\textbf{P} _V\), \(\textbf{P} _E\), \(\textbf{P} _K\), \(\textbf{B} \), \(\alpha \), and the number of iterations \(\gamma \) as input to calculate the MHC \(\phi _t\) of the current BCM matrix \(\textbf{Y} ^{(t)}\). To avoid the materialization of \(\textbf{S} \) required in Eqs. (9) and (10), Algorithm 2 computes \(\phi _t\) in an iterative manner by reordering the matrix multiplications (Lines 2–3 in Algorithm 2). More precisely, at the \(\ell \)-th iteration, it obtains the intermediate result \(\textbf{F} ^{(\ell )}\) via the following equation:

(14)

\(\textbf{F} ^{(0)}\) is initialized as Line 1 in Algorithm 2. It can be verified that \(\phi _t=1-\frac{1}{k}trace({\widehat{\textbf{Y}}^{(t)\top }} \textbf{F} ^{(\gamma )})\) (Line 4 in Algorithm 2).

Once the convergence criterion of \(\textbf{Q} ^{(t)}\) (Eq. (15)) is satisfied, or the early termination condition (Eq. (16)) holds, AHCKA ceases the iterative process and returns the BCM matrix \(\textbf{Y} \) with the lowest MHC (Lines 9–11 in Algorithm 1).

$$\begin{aligned}&||\textbf{Q} ^{(t)}-\textbf{Q} ^{(t-1)}||<\epsilon _Q \end{aligned}$$
(15)
$$\begin{aligned}&\phi _{t-2\tau }<\phi _{t-\tau }<\phi _{t} \end{aligned}$$
(16)

Otherwise, AHCKA proceeds to the next orthogonal iteration. The rationale for the early termination condition in Eq. (16) is that, in practice, successive increases in \(\phi _t\) indicate that clusters with desirable MHC objective have been attained.

5.2 Greedy initialization of BCM

Akin to many optimization problems, AHCKA requires many iterations to achieve convergence when \(\textbf{Y} ^{(0)}\) is randomly initialized. To tackle this issue, we propose a greedy initialization technique, InitBCM , whereby we can immediately gain a passable BCM matrix \(\textbf{Y} ^{(0)}\) and expedite the convergence, as demonstrated by our experiments in Sect. 8.4.

The rationale of InitBCM is that most nodes tend to cluster together around a number of center nodes [43]. Therefore, we can first pick a set \(\mathcal {V}_c\) of top influential nodes w.r.t. the whole hypergraph, and calculate the multi-hop proximities (i.e., RWR scores) of each node to the influential nodes \(\mathcal {V}_c\) (i.e., centers). Then, the cluster center of each node can be determined by its proximity to nodes in \(\mathcal {V}_c\) accordingly.

Algorithm 3
figure c

InitBCM

Algorithm 3 displays the pseudo-code of InitBCM . Given hypergraph \(\mathcal {H}\), and transition matrices \(\textbf{P} _V, \textbf{P} _E\) defined in Eq. (13), the number k of clusters, random walk stopping probability \(\alpha \), and the number of iterations \(T_i\), as input, InitBCM begins by initializing an ordered set \(\mathcal {V}_c\) consisting of the k nodes with k largest degrees in \(\mathcal {H}\) (sorted by their indices), which later serves as the cluster centers (Line 1). Then, a \(k\times n\) matrix \(\textbf{Z} _0\) is created, where for each integer \(j\in [1,k]\), \(\textbf{Z} _0[j,\mathcal {V}_c[j]]\) is set to 1 and 0 otherwise and \(\mathcal {V}_c[j]\) denotes the node index of the j-th node in \(\mathcal {V}_c\) (Lines 2–3). Next, InitBCM launches \(T_i\) iterations to calculate the RWR scores of all nodes w.r.t the k nodes in \(\mathcal {V}_c\) (Lines 5–6). Specifically, at t-th iteration, we compute approximate RWR \(\varvec{\varPi }_c^{(t)}\) (Line 6):

$$\begin{aligned} \varvec{\varPi }_c^{(t)} = (1-\alpha ) \left( \varvec{\varPi }_c^{(t-1)} \textbf{P} _V\right) \cdot \textbf{P} _E + \varvec{\varPi }_0, \end{aligned}$$
(17)

where \(\varvec{\varPi }_0=\alpha \textbf{Z} _0\) (Line 4). Note that we reorder the matrix multiplications as in Eq. (17) so as to bypass the materialization of the \(n\times n\) matrix \(\textbf{P} _V\textbf{P} _E\). After obtaining \(\varvec{\varPi }_c^{(T_i)}\), InitBCM assigns the node \(\mathcal {V}_c[g(v_j)]\) as the cluster center to each node \(v_j\) in \(\mathcal {H}\) as per Eq. (18) (Lines 7–9).

$$\begin{aligned} g(v_j) = \arg \max _{1\le l\le k} \varvec{\varPi }_c^{(T_i)}[l, j], \end{aligned}$$
(18)

meaning that we pick a cluster center from \(\mathcal {V}_c\) such that its RWR score \(\varvec{\varPi }_c^{(T_i)}[l, j]\) w.r.t \(v_j\) is the highest. Finally, an \(n\times k\) binary matrix \(\textbf{Y} ^{(0)}\) is constructed by setting \(\textbf{Y} ^{(0)}[j,g(v_j)]\) to 1 for \(v_j\in \mathcal {V}\) and returned as the initial BCM matrix.

5.3 Complexity

One of the main computational costs of AHCKA stems from the sparse matrix multiplications, i.e., Line 4 in Algorithm 1, Line 3 in Algorithm 2, and Line 6 in Algorithm 3. We first consider Line 4 in Algorithm 1, i.e., Eq. (12). Since \(\textbf{Q} ^{(t-1)}\) is an \(n \times (k+1)\) matrix and the numbers of non-zero entries in sparse matrices \(\textbf{P} _V\), \(\textbf{P} _E\), and \(\textbf{P} _K\) are \(n\overline{\delta }\), \(n\overline{\delta }\), and nK, respectively, its complexity is \(O((n\overline{\delta }+nK)\cdot k)\) [44]. Analogously, according to Eqs. (14), and (17), both the time costs of Line 3 in Algorithm 2 and Line 6 in Algorithm 3 are bounded by \(O(n\overline{\delta }k)\). Recall that these three operations are conducted up to \(T_a\), \(\gamma \), and \(T_i\) times in Algorithms 12, and 3, respectively. Therefore, the total time cost of sparse matrix multiplications is \(O(kn\overline{\delta }\cdot (T_a+T_i+\gamma ) + knKT_a)\). Moreover, in Algorithm 1, the QR decomposition at Line 5 takes \(O(k^2 n)\) time and Discretize [41] runs in \(O(k^2 n+k^3)\) time. Overall, the time complexity of AHCKA is \(O(kn\overline{\delta }\cdot (T_a+T_i+\gamma ) + knKT_a + k^2 n)\), which equals \(O(n\overline{\delta })\) when \(T_a, T_i, \gamma , k\), and K are regarded as constants. The space complexity of AHCKA is \(O(n\cdot (\overline{\delta }+K+k))\) as all matrices are in sparse form.

6 The ANCKA framework

In this section, we generalize AHCKA that is for AHC to a versatile framework ANCKA to process all of AHC, AGC, and AMGC, formulated in Sect. 2. ANCKA aims to efficiently find high-quality clusters on various types of network \(\mathcal {N}\).

As mentioned, the proposed KNN augmentation in Sect. 3.1 is orthogonal to the high-order nature of hypergraph, and therefore, we can apply the KNN augmentation to input attributed network \(\mathcal {N}\) that can be an attributed hypergraph \(\mathcal {H}\), graph \(\mathcal {G}\), and multiplex graph \(\mathcal {G}_M\).

Recall that, in Fig. 2, we have empirically shown that nodes with higher attribute similarity are more likely to appear in the same cluster of a hypergraph \(\mathcal {H}\). This also holds for attributed graphs and attributed multiplex graphs. Figures 4 and 5 illustrate the AAS and RCC on the attributed graph Citeseer-DG and the attributed multiplex graph ACM, with binary keyword vectors as node attributes. On both datasets, nodes with higher attribute similarity (i.e., higher AAS with smaller K) are more likely to be in the same cluster (i.e., higher RCC). Moreover, above a certain K value, there is no significant difference between the RCC of two random nodes and that of two nodes \(v_i\) and \(v_j\) such that \(v_j\) is the K-nearest neighbor of \(v_i\). Based on these observations, it is viable to extend KNN augmentation in Sect. 3.1 to an attributed network \(\mathcal {N}\) with n nodes and attribute matrix \(\textbf{X} \in \mathbb {R}^{n\times d}\), by building a KNN augmentation graph \(\mathcal {G}_K\) via Eq. (1).

Then we obtain an augmented network \(\mathcal {N}_A\) with topology \(\mathcal {N}_O\) and KNN graph \(\mathcal {G}_K\), where \(\mathcal {N}_O\) is \((\mathcal {V}, \mathcal {E})\) when \(\mathcal {N}\) is an attributed hypergraph \(\mathcal {H}\) or \((\mathcal {V}, \mathcal {E}_G)\) for graph \(\mathcal {G}\), and \(\mathcal {N}_O\) is \((\mathcal {V}, \mathcal {E}_1,\dots ,\mathcal {E}_L)\) when \(\mathcal {N}\) is an attributed multiplex graph \(\mathcal {G}_M\).

Fig. 4
figure 4

AAS and RCC on Citeseer-DG

Fig. 5
figure 5

AAS and RCC on ACM

6.1 Generalized \((\alpha ,\beta ,\gamma )\)-random walk

For the augmented network \(\mathcal {N}_A = (\mathcal {N}_O, \mathcal {G}_K)\), define \(\textbf{P} _N\) and \(\textbf{P} _K\) as the random walk transition matrices of \(\mathcal {N}_O\) and \(\mathcal {G}_K\) respectively. The generalized \((\alpha ,\beta ,\gamma )\)-random walk on \(\mathcal {N}_A\) is an RWR process over the augmented network \(\mathcal {N}_A\), similar to the case of attributed hypergraphs in AHCKA. The difference from Definition 1 is that when the random walk navigates to another node, with probability \(1-\beta _i\), an out-neighbor is drawn from the distribution of \(\textbf{P} _N\) instead of incident hyperedges. This generalized random walk can also be characterized by the probability in Eq. (3), with transition matrix \(\textbf{P} \) given as follows.

$$\begin{aligned} \textbf{P} = (\textbf{I}-\textbf{B})\cdot \textbf{P} _N + \textbf{B} \cdot \textbf{P} _K. \end{aligned}$$
(19)

We now formulate \(\textbf{P} _N\) for different types of networks, including attributed hypergraphs as one special case.

Attributed Hypergraph. \(\mathcal {H}\) When \(\mathcal {N}_O\) is a hypergraph with hyperedge incidence matrix \(\textbf{H} \), based on Eq. (4), \(\textbf{P} _N\) is shown below. \(\textbf{P} _N\) considers the transition probability \(\textbf{P} _V\) from a node to its incident hyperedges and the transition probability \(\textbf{P} _E\) from each hyperedge to nodes connected by the hyperedge.

$$\begin{aligned} \textbf{P} _N = \textbf{P} _V \textbf{P} _E, \text { where } \textbf{P} _V=\textbf{D} ^{-1}_V\textbf{H} ^\textsf{T}\text { and } \textbf{P} _E=\textbf{D} ^{-1}_E\textbf{H}. \end{aligned}$$
(20)

Attributed Graph \(\mathcal {G}\). When \(\mathcal {N}_O\) is an undirected graph, we can acquire the transition matrix \(\textbf{P} _N\) in Eq. (21). If \(\mathcal {N}_O\) is directed, we introduce a reversed edge for each edge and consider bidirectional connections between nodes to get \(\textbf{A} \), \(\textbf{D} \), and subsequently \(\textbf{P} _N\).

$$\begin{aligned} \textbf{P} _N = \textbf{D} ^{-1}\textbf{A}, \end{aligned}$$
(21)

where \(\textbf{A} \) is the adjacency matrix and \(\textbf{D} \) is the degree matrix.

Attributed Multiplex Graph. \(\mathcal {G}_M\) When \(\mathcal {N}_O\) is a multiplex graph comprising L layers with the same node set \(\mathcal {V}\), the l-th layer has its own edge set \(\mathcal {E}_l\) representing a unique type of connections. The overall goal of the clustering task is to make cluster assignments that capture the collective structure of the multiplex graph, transcending the differences across layers. To achieve this, intuitively, we treat every layer equally and compute \(\textbf{P} _N\) as in Eq. (22), while layer weighting is left as future work [27]. Given the degree matrix \(\textbf{D} _l\) and adjacency matrix \(\textbf{A} _l\) of every l-th layer, we get the layer’s random walk transition matrix \(\textbf{D} _l^{-1}\textbf{A} _l\), and then compute \(\textbf{P} _N\) of the multiplex graph by averaging the layer-specific transition matrices. Consequently, from the current node v, a random walk has 1/L probability of selecting each layer \(\mathcal {G}_l\), and then within this chosen layer, the next node to visit is picked uniformly at random from the out-neighbors of v in \(\mathcal {G}_l\).

$$\begin{aligned} \textbf{P} _N = \frac{1}{L} \sum _{l=1}^L \textbf{D} _l^{-1}\textbf{A} _l, \end{aligned}$$
(22)

where \(\textbf{D} _l\) and \(\textbf{A} _l\) are the degree matrix and adjacency matrix of the l-th layer.

6.2 ANCKA algorithm

With the random walk transition matrix \(\textbf{P} \) formulated above for various types of attributed networks \(\mathcal {N}\), Eq. (3) can be reused to calculate \(\textbf{S} [i,j]\), the probability of a generalized \((\alpha ,\beta ,\gamma )\)-random walk from \(v_i\) stopping at \(v_j\) in the end. The objective function in Sect. 3.3 is naturally extended to ANCKA. Consequently, our theoretical analysis in Sect. 4 remains valid for ANCKA over attributed networks that can be hypergraphs, graphs, and multiplex graphs.

The pseudo-code of ANCKA is outlined in Algorithm 4. At Line 1, it obtains transition matrix \(\textbf{P} _K\) for attribute KNN augmentation. Then as a framework supporting various attributed networks, ANCKA is a generalization of Algorithms 1–3 with transition matrix \(\textbf{P}_{N} \) computed depending on the network type at Line 2. \(\textbf{P}_{N} \) is then used throughout the algorithm as a part of the generalized \((\alpha , \beta , \gamma )\)-random walk. The greedy initialization of clusters in Lines 3–11 resembles the procedure in InitBCM with the corresponding \(\textbf{P}_{N} \) for RWR simulation. Since ANCKA needs to pick k nodes in \(\mathcal {N}\) with the largest degrees as tentative cluster centers at Line 3 when \(\mathcal {N}\) is an attributed multiplex graph, we rank the nodes by their summed degrees across all layers.

Lines 12–24 describe the main clustering process of ANCKA, which extends the hypergraph-specific Algorithms 1 and 2 with modifications to support attributed graphs and multiplex graphs. First, in orthogonal iterations, calculating \(\textbf{Z} ^{(t)}\) is dependent on the type of \(\mathcal {N}\). Second, the MHC objective for general networks stems from the analysis in Sect. 4, while the formulation with \(\textbf{P}_{N} \) is slightly different. In particular, to get MHC \(\phi _t\) without materializing the dense matrix \(\textbf{S} \) in Eq. (10) that is expensive to compute, we iteratively obtain \(\phi _t\) via the intermediate matrix \(\textbf{F} ^{(\ell )}\) in Eq. (23) at Line 21.

$$\begin{aligned} \textbf{F} ^{(\ell )} = (1-\alpha ) ((\textbf{I}-\textbf{B}) \textbf{P}_{N} \textbf{F} ^{(\ell -1)} + \textbf{B} \textbf{P} _K \textbf{F} ^{(\ell -1)}) + \textbf{F} ^{(0)},\nonumber \\ \end{aligned}$$
(23)

where \(\textbf{P}_{N} \) is Eqs. (20), (21), or (22), depending on the type of \(\mathcal {N}\). Finally, ANCKA adopts the early stopping criteria in Line 24 and returns the clusters with the lowest MHC obtained.

Algorithm 4
figure d

ANCKA

Complexity. When \(\mathcal {N}\) is an attributed graph, constructing transition matrix \(\textbf{P}_{N} \) takes \(O(n\overline{\delta })\) time, where \(\overline{\delta }\) is the average node degree. For a multiplex network \(\mathcal {N}\) with L layers, the previous results are still valid when L is regarded as constant, as \(\textbf{P}_{N} \) is aggregated from the transition matrices of all simple graph layers. Given that the number of nonzero entries in \(\textbf{P}_{N} \) is subject to \(O(n\overline{\delta })\), ANCKA (Algorithm 4) has the same complexity as Algorithm 1. According to our analysis in Sect. 5.3, the time complexity of ANCKA is \(O(kn(\overline{\delta }+K+k))\) while its space complexity is \(O(n(\overline{\delta }+K+k))\). Since k and K can be viewed as constants, ANCKA has space and time complexity of \(O(n\overline{\delta })\).

7 GPU-accelerated ANCKA-GPU

On large attributed networks, e.g., Amazon and MAG-PM hypergraphs, each with more than 2 million nodes, as reported in Table 7, AHCKA with 16 CPU threads still needs 1286s and 1372s respectively for clustering, despite its superior efficiency compared with baselines. Moreover, AHCKA does not exhibit acceleration proportional to increased CPU threads. As shown in Fig. 6, when the number of CPU threads is raised from 1 to 32, the time drops from around 3000 to 1200 s, with a speedup of merely 2.5 (Amazon) or 2.7 (MAG-PM). In particular, increasing the number of threads from 16 to 32 provides rather limited acceleration (less than 10%).

To overcome the limitation of CPU parallelization, we resort to the massive parallel processing power of GPUs (graphical processing unit) and develop ANCKA-GPU to boost efficiency, with about one order of magnitude speedup on large networks with millions of nodes in experiments. For example, ANCKA-GPU only needs 120 s on an MAG-PM dataset, over 10 times faster than the 1372s of ANCKA. Compared to CPUs, the design of GPUs enables them to leverage numerous threads to handle data processing simultaneously, which is beneficial for vector and matrix operations at scale. Please see [45] for details on GPU computing.

As shown in Fig. 15 of Sect. 8.5 for runtime analysis, the major time-consuming components of ANCKA include invoking Discretize (Line 18 in Algorithm 4), the construction of KNN graph \(\mathcal {G}_K\), and expensive matrix operations in orthogonal iterations, greedy initialization and MHC evaluation. With the CuPy library, matrix operations throughout Algorithm 4 can be done on GPUs more efficiently. In the following, we elaborate on the GPU-based discretization and \(\mathcal {G}_K\) construction techniques adopted in ANCKA-GPU.

GPU-based Discretization Discretize-GPU . ANCKA uses the off-the-shelf Discretize approach [41] to compute discrete cluster labels \(\textbf{Y} \) from real-valued eigenvectors \(\textbf{Q} \), which could cost substantial time on large datasets. Here, we develop a CUDA kernel Discretize-GPU for efficiency. In what follows, we first explain how the discretization algorithm improves the optimization objective in Definition 2, and then present the design of Discretize-GPU in Algorithm 5.

Given an eigenvector matrix \(\textbf{Q} \) with its row-normalized matrix \({\tilde{\textbf{Q}}}\), discretization is aimed to find a discrete solution \(\textbf{Y} _{opt}\) that minimizes the objective in Definition 2.

Definition 2

(Discretization [41]) The solution to the following optimization problem is the optimal discrete \(\textbf{Y} _{opt}\).

$$\begin{aligned} \textbf{Y} _{opt}= & \mathop {\textrm{argmin}}\limits _\textbf{Y} ||\textbf{Y}-{\tilde{\textbf{Q}}}\textbf{R} ||_F^{2} \\ & \text {s.t. } \textbf{Y} \in \{0,1\}^{n\times k},\ \textbf{Y} \textbf{1}_k =\textbf{1}_n,\ \textbf{R} \in \mathbb {R}^{k\times k},\ \textbf{R} ^{\textbf{T}} \textbf{R} =\textbf{I} _k, \end{aligned}$$

where \({\tilde{\textbf{Q}}}\) is the row-normalized matrix of an eigenvector matrix \(\textbf{Q} \), \(\textbf{R} \) is a rotation matrix, and \(||\textbf{M}||_F\) denotes the Frobenius norm of matrix \(\textbf{M}\).

The Discretize approach finds a nearly global optimal solution by alternately updating one of \(\textbf{Y} \) and \(\textbf{R} \) while keeping the other fixed. With \(\textbf{R} \) fixed, \(\textbf{Y} [i,l]\) is updated to

$$\begin{aligned} \textbf{Y} [i,g]={\left\{ \begin{array}{ll} 1, & \quad \text { if } g = \arg \max _{1\le j \le k} (\tilde{\textbf{Q}} \textbf{R})[i, j] \\ 0, & \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(24)

With \(\textbf{Y} \) fixed, \(\tilde{\textbf{Y}}\) is the column-normalized matrix of \(\textbf{Y} \), and \(\textbf{R} \) can be updated as follows with SVD decomposition.

$$\begin{aligned} \textbf{R} =\textbf{V} \textbf{U} ^\textsf{T}\text {, where } \textbf{U} \varvec{\varOmega } \textbf{V} ^\textsf{T}\text { is an SVD of } \tilde{\textbf{Y}}^\textsf{T}\tilde{\textbf{Q}}. \end{aligned}$$
(25)

The iterative process can terminate early when an objective value obj based on \(\varvec{\varOmega } \) converges, i.e., its change over the last iteration is within machine precision. This objective is calculated as \(obj=n-2\times trace(\varvec{\varOmega })\) [41].

Fig. 6
figure 6

Runtime of AHCKA with CPU parallelization

Algorithm 5
figure e

Discretize-GPU

We implement the CUDA kernel Discretize-GPU in Algorithm 5 to perform the process above to obtain \(\textbf{Y} \). In details, Discretize-GPU leverages the grid-block-thread hierarchy of GPU to assign threads to handle \(n\times k\) matrices, including \(\textbf{Q} \) and \(\textbf{Y} \). Each row in such a matrix is processed by a block of threads, identified by a block id bid; each of the k elements in the row is handled by a thread tid in the block. Consequently, given a matrix \(\textbf{Q} \), we can use \(\textbf{Q} [bid,tid]\) to represent that the corresponding element in \(\textbf{Q} \) is handled by the tid-th thread in block bid on a GPU. Parallel row normalization is performed at Lines 1–2 to get \(\tilde{\textbf{Q}}\). After initializing \(\textbf{R} \) as a \(k\times k\) identity matrix (Line 3), we alternately update \(\tilde{\textbf{Y}}\) and \(\textbf{R} \) for at most \(max\_iter\) iterations (Lines 4–12) and terminate early when the objective value obj does not change over the current iteration at Line 12. Within an iteration, we first update \(\textbf{Y} \) at Line 5, then perform column normalization to get \(\tilde{\textbf{Y}}\) (Lines 6–9), and then perform SVD on GPU over \(\tilde{\textbf{Y}}^\textsf{T}\tilde{\textbf{Q}}\) to get \(\textbf{U} \) and \(\textbf{V} \) at Line 10, which helps to update \(\textbf{R} \) at Line 11. Finally, \(\textbf{Y} \) is returned at Line 13.

KNN construction. An \(n\times d\) attributed matrix \(\textbf{X} \) requires KNN search on its rows to construct the augmented graph \(\mathcal {G}_K\) and thus the transition matrix \(\textbf{P} _K\). For this purpose, we adopt Faiss [30], a GPU-compatible similarity search library. In Algorithm 6 for \(\mathcal {G}_K\) construction, we first normalize all rows in \(\textbf{X} \) at Lines 1–2 to facilitate the computation of cosine similarity between row vectors. Faiss supports various indexes for KNN computation, and the index type suitable for ANCKA is determined based on the input data volume. For small or medium datasets where the number of nodes \(|\mathcal {V}|\) is below 100,000, since the time cost for exact similarity search is affordable, we choose the flat index with a plain encoding of each row vector in \(\textbf{X} \), to achieve exact KNN computation (Lines 3–4). Otherwise, we turn to approximate nearest neighbor search on large datasets with the IVFPQ index that combines the inverted file index (IVF) with the product quantization (PQ) technique at Line 6. In particular, IVF index narrows down the search to closely relevant partitions that contain the nearest neighbors at a high probability, while PQ produces memory-efficient encoding of attribute vectors. Faiss on GPU is invoked to get the KNN of each row in \(\textbf{X} \), and \(\textbf{A} _K\) is obtained by Eq. (1) at Lines 7–8. Then, the degree matrix \(\textbf{D} _K\) and transition matrix \(\textbf{P} _K\) are computed on GPU (Lines 9-10) and returned at Line 11.

Algorithm 6
figure f

GPU-based \(\mathcal {G}_K\) construction

8 Experiments

We evaluate the proposed ANCKA and competitors in terms of clustering quality and efficiency. We also evaluate the performance of ANCKA-GPU on all clustering tasks. In experiments, we uniformly refer to our method as ANCKA while making it clear in the context whether ANCKA is for AHC (i.e., AHCKA), AGC, or AMGC. All the experiments are conducted on a Linux machine powered by Intel Xeon(R) Gold 6226R CPUs, 384GB RAM, and NVIDIA RTX 3090 GPU. A maximum of 16 CPU threads are available if not otherwise stated. The code is at https://github.com/gongyguo/ANCKA.

8.1 Experimental setup

8.1.1 Datasets

Table 1 provides the statistics of 17 real-world attributed networks used in experiments, including attributed hypergraphs (HG), undirected graphs (UG), directed graphs (DG), and multiplex graphs (MG). \(|\mathcal {V}|\) and \(|\mathcal {E}|\) are the number of nodes and edges (or hyperedges), respectively, d is the attribute dimension and k is the number of ground-truth clusters.

Table 1 Dataset statistics

We gather 8 attributed hypergraph datasets. Query dataset [5] is a Web query hypergraph, where nodes represent queries and are connected by hyperedges representing query sessions, and nodes are associated with attributes of keyword embeddings and associated webpages. Cora-CA, Cora-CC, Citeseer, and DBLP are four benchmark datasets used in prior work [46]. All of them are originally collected from academic databases, where each node represents a publication, node attributes are binary word vectors of abstract, and research topics are regarded as ground-truth clusters. Hyperedges correspond to co-authorship in Cora-CA and DBLP datasets or co-citation relationship in Cora-CC and Citeseer datasets. 20News dataset [47] consists of messages taken from Usenet newsgroups. Messages are nodes, and the messages containing the same keyword are connected by a corresponding hyperedge, and the TF-IDF vector for each message is used as the node attribute. Amazon dataset is constructed based on the 5-core subset of Amazon reviews dataset [48], where each node represents a product and a hyperedge contains the products reviewed by a user. For each product, we use the associated textual metadata as the node attributes and the product category as its cluster label. MAG-PM dataset is extracted from the Microsoft Academic Graph [49], where nodes, co-authorship hyperedges, attributes, and cluster labels are obtained as in other academic datasets (i.e., Cora-CA, Cora-CC, Citeseer, and DBLP).

In Table 1, we also consider 6 attributed graphs, which are commonly used for AGC [13, 27, 50, 51]. Cora, Citeseer-UG, Wiki, and Amazon2M are undirected, while Citeseer-DG and TWeibo are directed. TWeibo [13] and Amazon2M [51] are two large-scale attributed graphs. TWeibo is a social network where each node represents a user, and the directed edges represent relationships between users. Amazon2M is constructed based on the co-purchasing networks of products on Amazon. Cora, Citeseer-UG, and Citeseer-DG are citation networks where nodes represent publications, a pair of nodes are connected if one cites the other, and nodes are associated with binary word vectors as features. Wiki is a webpage network where each edge in the graph indicates that one webpage is linked to the other, while the node attributes are TF-IDF feature vectors.

Moreover, three attributed multiplex graphs, namely ACM, IMDB, and DBLP-MG, are considered for AMGC [22,23,24]. ACM is an academic publication network comprising co-author and co-subject graph layers, as well as bag-of-words attributes of keywords. IMDB is a movie network with plot text embeddings as attributes and two graph layers representing the co-director (directed by the same director) and co-actor (starring the same actor) relations, respectively. DBLP-MG is a researcher network including publication keyword vectors as attributes and three graph layers: co-author, co-conference (publishing at the same conference), and co-term (sharing common key terms). ACM and DBLP-MG have research areas labeled as ground truth clusters, while IMDB is labeled by movie genres.

8.1.2 Competitors and parameter settings

The 19 competitors for AHC are summarized as follows:

  • 3 plain hypergraph clustering methods including HNCut [52], HyperAdj [53], and KaHyPar [54];

  • the extended AHC versions of the 3 methods above (dubbed as ATHNCut, ATHyperAdj, and ATKaHyPar), which work on an augmented hypergraph with attribute-KNN hyperedges of all nodes merged into the input hypergraph;

  • ATMetis that applies the traditional graph clustering algorithm Metis [55] over a graph constructed by clique expansion of the input hypergraph and attribute KNN graph augmentation; Infomap [56], Louvain [57], k-MQI and k-Nibble (extended from MQI [58] and PageRank-Nibble [59] for k-way clustering via \(k-1\) consecutive bisections as described in technical report [40]) on the same KNN-augmented clique-expansion graph;

  • 3 AHC algorithms including the recent GRAC [11] and NMF-based approaches (GNMF [11, 60] and JNMF [1]);

  • ACMin-C and ACMin-S, obtained by applying an attributed graph clustering method ACMin [13] over the graphs reduced from hypergraphs by clique expansion and star expansion, respectively; probabilistic model CESNA [3] with clique-expansion;

  • k-means and HAC (hierarchical agglomerative clustering [61]) algorithms applied to the node attribute matrix.

To evaluate the ANCKA framework, we compare 16 competitors for AGC, including k-means, HAC and the following:

  • 6 AGC approaches including NMF-based algorithm GNMF [60], graph convolution algorithm AGCGCN [50], probabilistic model CESNA [3], spectral clustering on fine-grained graphs method FGC [62], attributed random walk approach ACMin [13], and the clustering framework GRACE [27] generalized from GRAC.

  • NCut [37] and Metis [55] that are conventional graph clustering methods applied to the input graph;

  • ATNCut and ATMetis that are NCut and Metis applied to the augmented graph with attribute KNN; Infomap [56], Louvain [57], k-MQI [58] and k-Nibble [59] on the augmented graph with attribute KNN.

We compare ANCKA with 16 competitors for AMGC task, including k-means, HAC and the following:

  • 5 AMGC methods: a multi-view graph auto-encoder model O2MAC [25], HDMI [24] that learns node embeddings via higher-order mutual information loss, MCGC [22] and MAGC [23] which perform graph filtering and find a consensus graph for spectral clustering, and GRACE [27] that is a general graph convolution clustering method;

  • NCut [37] and Metis [55] that apply traditional graph clustering methods over the aggregation of the adjacency matrices of all graph layers in the input multiplex graph;

  • ATNCut and ATMetis that apply NCut and Metis to the aggregated matrix of all layers’ adjacency matrices and the attribute KNN graph; Infomap [56], Louvain [57], k-MQI [58] and k-Nibble [59] in the same way;

  • CESNA [3] that treats the aggregated adjacency matrix of all layers as an attributed graph.

For all competitors, we adopt the default parameter settings as suggested in their respective papers. Hyperparameters for AMGC algorithms MCGC and MAGC are tuned as instructed in the corresponding papers, and we report the best results acquired. As for ANCKA on attributed hypergraphs, i.e., AHCKA [6], unless otherwise specified, we set parameters on all datasets: \(\alpha =0.2\), \(\beta =0.5\), and \(\gamma =3\), parameter \(K=10\) for KNN construction, the convergence threshold \(\epsilon _Q = 0.005\), and the numbers of iterations \(T_a=1000\), \(T_i=25\). The interval parameter \(\tau \) is set to 5 on all datasets except the large and dense hypergraph Amazon, where we set \(\tau =1\) to expedite early termination in light of the immense per-iteration overhead when processing Amazon. On large datasets (i.e., Amazon and MAG-PM), \(T_i\) is set to 1 and \(\beta =0.4\). In ANCKA, for attributed graphs and multiplex graphs, we fix \(K=50\), except for large datasets TWeibo and Amazon2M with \(K=10\). In particular, we find it necessary to adjust the \(\beta \) parameter for certain instances following the practice in recent works [22, 23, 27]. \(\beta \) is set to 0.5 for Cora and Wiki and 0.4 on Citeseer-UG, Citeseer-DG, TWeibo, and Amazon2M. We tune \(\beta \) in [0.1, 0.9] by step size 0.1 for multiplex graphs. All the remaining hyperparameters in ANCKA follow the default setting of AHCKA. The parameter settings in GPU-based ANCKA-GPU are identical to ANCKA.

8.2 Performance evaluation

In this section, we report clustering quality and efficiency of all methods on all datasets. For each method, we repeat 10 times and report the average performance.

8.2.1 Quality evaluation

The clustering quality is measured by 4 classic metrics including overall accuracy (Acc), average per-class F1 score (F1), normalized mutual information (NMI), and adjusted Rand index (ARI). The former three metrics are in the range [0, 1], whereas ARI ranges from \(-\)0.5 to 1. We also sort all methods by each metric and calculate their average Quality Rank for AHC, AGC, and AMGC, provided in the last column of Tables 3, 5 and 6.

AHC. Tables 2 and 3 present the Acc, F1, NMI, and ARI scores of each method on small and medium/large attributed hypergraph datasets, respectively. The first observation from Tables 2 and 3 is that ANCKA on attributed hyergraphs (i.e., AHCKA) consistently achieves outstanding performance over all competitors on all datasets under almost all metrics, often by a significant margin. ANCKA has a quality rank of 1.3, much higher than the runner-up ATMetis (4.9) and GRAC (5.2). On all the four small datasets (i.e., Query, Cora-CA, Cora-CC, and Citeseer), ANCKA outperforms the best competitors (underlined in Table 2) by at least 1.9% in terms of Acc and NMI. On all the four medium/large attributed hypergraphs (i.e., 20News, DBLP, Amazon, and MAG-PM), ANCKA also yields remarkable improvements upon the competitors, with percentages up to 12.6%, 10.4%, 6.5%, 13.6% in Acc, F1, NMI, and ARI respectively. Few exceptions exist, where ANCKA still leads in three out of the four metrics, demonstrating the best overall performance. The results in Tables 2 and 3 also confirm the effectiveness of ANCKA over various attributed hypergraphs from different application domains, e.g., web queries, news messages, and review data. The performance of ANCKA is ascribed to our optimizations based on KNN augmentation and MHC in Sects. 3 and  4, and the framework for generating high-quality BCM matrices in Sect. 5.

AGC. Tables 4 and 5 present the Acc, F1, NMI, and ARI scores of each method on all attributed graphs for AGC task. ANCKA consistently outperforms existing competitors under most metrics, though few exceptions exist where ANCKA is comparable to the best. ANCKA has a quality rank of 1.3, much higher than the runner-up with quality rank 4.0. For example, on Citeseer-UG in Table 4, ANCKA achieves higher Acc, F1, NMI and ARI than the runner-up performance underlined. On the two large datasets, TWeibo and Amazon2M in Table 5, ANCKA also produces clusters with high quality, while GNMF, AGCGCN, and FGC run out of memory or cannot finish within 12 h. Notably, on Amazon2M, ANCKA surpasses all methods on all metrics except F1 (0.006 behind ATMetis) while achieving 0.494 accuracy (runner-up is Louvain at 0.463) and 0.545 ARI (runner-up is Louvain at 0.520). The effectiveness of ANCKA validates the versatility of the proposed techniques for different clustering tasks, e.g., AGC. Besides, ATMetis and ATNCut generally outperform Metis and NCut in AGC performance, respectively, exhibiting the efficacy of the proposed KNN augmentation.

AMGC. Table 6 reports the Acc, F1, NMI, and ARI scores of all methods on all attributed multiplex graphs. ANCKA has the best quality rank. As shown, on ACM and DBLP-MG, ANCKA achieves the best clustering quality among all methods under all metrics, with NMI and ARI leading by at least 3% on ACM, while being the second best in three metrics on IMDB. As shown later in Table 9, on these datasets, ANCKA is faster than existing native AMGC methods by at least an order of magnitude. With the intuitive design of random walk transition matrix \(\textbf{P}_{N} \) on multiplex graphs in Sect. 6.1, ANCKA can utilize the proposed KNN augmentation, clustering objective, and optimization techniques to maintain its excellent performance on the AMGC task.

8.2.2 Efficiency evaluation

Tables 78 and 9 report the runtime (in seconds, with KNN construction included) and memory overhead (in Gigabytes), for AHC, AGC, and AMGC, respectively. For ease of comparing the trade-off between quality and efficiency, the last column of Tables 78 and 9 contains the corresponding quality ranks from Tables 35 and 6, respectively. In each table, the methods are separated into two categories: non-native methods extended from other clustering problems and native methods for the corresponding task. For instance, in Table 7, there are 4 native AHC methods in the last 4 rows, while the non-native methods are in the rows above.

In Tables 78, and 9, although certain non-native methods are efficient, their quality ranks in terms of clustering quality are typically low. Hence, in the following, we mainly compare the efficiency of ANCKA against the native methods for each task. A method is terminated early if it runs out of memory (OOM) or cannot finish within 12 h.

AHC. In Table 7, compared with native AHC methods, we can observe that ANCKA is significantly faster on most datasets, often by orders of magnitude. For example, on a small graph Citeseer, ANCKA takes 0.635 seconds, while the fastest AHC competitor GRAC needs 13.15 seconds, meaning that ANCKA is \(20.7\times \) faster. On large attributed hypergraphs including Amazon and MAG-PM, most existing AHC solutions fail to finish due to the OOM errors, whereas ANCKA achieves \(11.4\times \) and \(2.6\times \) speedup over the only viable native AHC competitor GRAC on Amazon and MAG-PM, respectively. An exception is 20News, which contains a paucity of hyperedges (100 hyperedges), where ANCKA is slower than GRAC. Recall that in Table 3, compared to ANCKA, GRAC yields far inferior accuracy in terms of clustering on 20News, which highlights the advantages of ANCKA over GRAC. Additionally, while ATMetis is fast, it achieves an average quality rank of 4.9, which falls short of the 1.7 quality rank attained by ANCKA. As shown in Tables 2 and 3, ANCKA surpasses ATMetis in all metrics but one. Moreover, ATMetis encounters OOM on Amazon. As for the memory consumption (including the space to store hypergraphs), observe that ANCKA has comparable memory overheads with the native AHC competitors on small graphs and up to \(3.1\times \) memory reduction on medium/large graphs.

Table 2 Attributed hypergraph clustering (AHC) quality on small datasets
Table 3 Attributed hypergraph clustering (AHC) quality on medium/large datasets
Table 4 Attributed graph clustering (AGC) quality on cora, Citeseer-UG & Wiki

AGC. In Table 8 for AGC, ANCKA has comparable running time to ACMin, a recent AGC method that is optimized for efficiency, while being faster than the other native AGC methods. However, the quality rank of ANCKA is 1.3, much higher than 5.6 of ACMin. Specifically, in Tables 4 and 5, ANCKA consistently achieves better clustering quality than ACMin on all six attributed graphs under all metrics. Moreover, ANCKA remains to be the runner-up in terms of running time on the first five datasets, and is the fastest on the largest Amazon2M for clustering. Memory-wise, ANCKA consumes a moderate amount of memory that stays below 1GB over the first four small datasets and achieves decent performance on two large datasets, TWeibo and Amazon2M.

AMGC. In Table 9, ANCKA achieves a significant speedup ratio over the native AMGC baselines, often by an order of magnitude, while being memory efficient. Specifically, ANCKA achieves a speedup of \(15.0\times \), \(13.9\times \), and \(9.5\times \), compared to the runner-up native AMGC methods MAGC and GRACE. The memory consumption of ANCKA is also less than the majority of existing native AMGC methods.

8.2.3 Evaluation on ANCKA-GPU

We compare the cluster quality and efficiency of the CPU-based ANCKA against ANCKA-GPU in Sect. 7, with results reported in Table 10 for the three tasks (AHC, AGC, and AMGC) over all datasets. First, observe that ANCKA-GPU achieves similarly high-quality cluster results as the CPU-based ANCKA across all datasets for all three tasks, and the quality difference between ANCKA-GPU and ANCKA are often negligible, in terms of Acc, F1, NMI, and ARI.

The last column of Table 10 provides the running time of ANCKA-GPU and ANCKA with 16 CPU threads. For the AHC task, the speedup of ANCKA-GPU is less significant on the small attributed hypergraphs (Query, Cora-CA, Cora-CC, and Citeseer). We ascribe this to the numerous SVD operations on small \(k\times k\) matrices in Discretize-GPU, as it has been known that small dimensions of input matrices may hurt the efficiency of GPU-based SVD [63]. On medium/large attributed hypergraphs (20News, DBLP, Amazon, and MAG-PM), the GPU-accelerated version, ANCKA-GPU, achieves speedup ratios of 30.5, 70.2, 8.44, and 11.4, respectively, over the CPU version ANCKA. The high speedup ratios of ANCKA-GPU, often exceeding an order of magnitude, validate the efficiency of the technical designs elaborated in Sect. 7, especially on large-scale hypergraphs. For the AGC task, similarly, on small attributed graphs, Cora, Citeseer-UG, Wiki, and Citeseer-DG, ANCKA-GPU is faster than ANCKA while the speedup ratio is usually below 10, due to the same reason explained above. On large attributed graphs (TWeibo and Amazon2M), ANCKA-GPU is more efficient than ANCKA by an order of magnitude. For the AMGC task, ANCKA-GPU is also consistently faster than ANCKA on all attributed multiplex graphs. The memory consumption of ANCKA-GPU is measured by GPU video memory (VRAM), while that of ANCKA is by RAM, and the consumption is reported in the second last column of Table 10 in GBs. The memory usage of ANCKA-GPU and ANCKA is not directly comparable, due to the different computational architectures and libraries used on GPUs and CPUs. Note that the major memory consumption of our implementations is in the KNN augmentation step. On small or medium-sized datasets, e.g., Query and Cora-CA, VRAM usage by ANCKA-GPU is higher than the RAM usage by ANCKA. The reason is that ANCKA-GPU uses GPU-based Faiss for nearest-neighbor search and Faiss allocates about 700MB of VRAM for temporary storage. On large datasets, ANCKA requires a substantial RAM space due to the implementation of the ScaNN algorithm for KNN, while GPU-based Faiss in ANCKA-GPU requires less VRAM space.

Table 5 Attributed graph clustering (AGC) quality on Citeseer-DG, Tweibo & Amazon2M

Then we enhance GRACE  [27] with GPU acceleration using CuPy and cuML libraries, resulting in GRACE-GPU for comparison. We also compare with the GPU-based implementation of the Spectral Modularity Maximization [64] clustering method dubbed as SMM-GPU, which operates on the graph adjacency matrix for AGC (or clique expansion of the hypergraph for AHC, or the sum of multiplex adjacency matrices for AMGC) with the attribute KNN augmentation. The results for AHC, AGC, and AMGC are presented in Tables 11, 12 and 13, respectively. On the first six smaller datasets in Table 11 for AHC, SMM-GPU exhibits lower quality in terms of Acc, F1, NMI, and ARI, despite comparable efficiency to ANCKA-GPU, which delivers significantly better clustering quality. ANCKA-GPU outperforms GRACE-GPU in both quality and efficiency across all AHC datasets. Notably, on large datasets Amazon and MAG-PM in Table 11, ANCKA-GPU efficiently produces satisfactory clusters, whereas GRACE-GPU and SMM-GPU encounter out-of-memory due to their requirement to expand hypergraphs into graphs. Similar observations are made for AGC and AMGC in Tables 12 and 13. Similar patterns are observed for AGC and AMGC in Tables 12 and 13. In these tasks, ANCKA-GPU delivers superior clustering quality and efficiency on most datasets, except IMDB where ANCKA-GPU is the second best, while SMM-GPU yields lower-quality outcomes and GRACE-GPU falls behind our method in speed. We conclude that ANCKA-GPU offers high clustering quality with remarkable efficiency.

8.3 Experimental analysis

Varying \({\varvec{K}}\). Figure 7 depicts the Acc, F1, NMI scores, and the KNN computation time of ANCKA on 8 attributed hypergraphs (AHC) when varying K from 2 to 1000. We can make the following observations. First, on most hypergraphs, the clustering accuracies of ANCKA first grow when K is increased from 2 to 10 and then decline, especially when K is beyond 50. The reasons are as follows. When K is small, the KNN graph \(\mathcal {G}_K\) in ANCKA fails to capture the key information in the attribute matrix \(\textbf{X} \), leading to limited result quality. On the other hand, when K is large, more noisy or distorted information will be introduced in \(\mathcal {G}_K\), and hence, causes accuracy loss. This coincides with our observation in the preliminary study in Fig. 2. Moreover, as K goes up, the time of KNN construction increases on all datasets. Figures 8 and 9 show the Acc, F1, NMI scores and KNN computation time of ANCKA on the 6 attributed graphs and 3 attributed multiplex graphs for AGC and AMGC, respectively, when varying K from 2 to 1000. On small graphs in Fig. 8a–d and  9, the cluster quality increases from 2 to 50, and then declines on datasets such as Citeseer-UG, Wiki, and ACM. On large datasets TWeibo and Amazon2M in Fig. 8e and f, a turning point appears around \(K=10\). Therefore, we set K to be 50 and 10 on these small and large datasets, respectively.

Table 6 Attributed multiplex graph clustering (AMGC) quality
Table 7 Efficiency of attributed hypergraph clustering (AHC) (time in seconds, RAM in GBs)
Table 8 Efficiency of attributed graph clustering (AGC) algorithms (time in seconds, RAM in GBs)
Table 9 Efficiency of attributed multiplex graph clustering (AMGC) algorithms (time in seconds, RAM in GBs)
Table 10 Evaluation between ANCKA and ANCKA-GPU
Table 11 Additional GPU baselines for AHC
Table 12 Additional GPU baselines for AGC
Table 13 Additional GPU baselines for AMGC

Varying \(\pmb {\beta }\). Recall that in the generalized \((\alpha ,\beta ,\gamma )\)-random walk model, the parameter \(\beta \) is used to balance the combination of topological proximities from graph topology \(\mathcal {N}_O\) and the attribute similarities from KNN graph \(\mathcal {G}_K\). Figure 10 displays the AHC performance of ANCKA on 8 attributed hypergraph datasets when \(\beta \) varies from 0 to 1. When \(\beta =0\), ANCKA degrades to a hypergraph clustering method without the consideration of any attribute information, whereas ANCKA only clusters the KNN graph \(\mathcal {G}_K\) regardless of the topology structure in \(\mathcal {H}\) if \(\beta =1\). From Fig. 10, we can see a large \(\beta \) (e.g., 0.7-0.8) on small/medium datasets (Query, Cora-CA, Cora-CC, Citeseer, 20News, and DBLP) bring more performance enhancements, meaning that attribute information plays more important roles on those datasets. This is because they have limited amounts of connections (or are too dense to be informative, e.g., on Query) in the original hypergraph structure as listed in Table 1 and rely on attribute similarities from the augmented KNN graph \(\mathcal {G}_K\) for improved clustering. By contrast, on Amazon and MAG-PM, ANCKA achieves the best clustering quality with small \(\beta \) in [0.1, 0.4], indicating graph topology has higher weights on Amazon and MAG-PM. Figures 11 and 12 report the Acc, F1, and NMI scores on AGC and AMGC tasks respectively. Similarly, when \(\beta \) increases from 0, the cluster quality generally improves, then becomes stable around 0.4 and 0.5, and decreases when \(\beta \) is large and close to 1. On DBLP-MG in Fig. 12, the highest clustering quality can be acquired with a small \(\beta \) around 0.1. We infer that node attributes in this dataset are of limited significance for clustering, while on ACM and IMDB, the best quality is achieved when \(\beta \) appropriately balances graph topology and attributes.

Table 14 Ablation Analysis on AHC (Time in Seconds)

Varying \(\varvec{\gamma }\). We evaluate ANCKA in terms of AHC quality and running time when varying \(\gamma \). Figure 13 displays the Acc, F1, NMI, and time on two representative datasets when \(\gamma \) varies from 1 to 5. The results on other datasets are similar and thus are omitted for space. Observe that in practice the Acc, F1, and NMI scores obtained by ANCKA first increase and then remain stable when \(\gamma \) is beyond 3 and 2 on Cora-CC and Citeseer, respectively. By contrast, the running time goes up as \(\gamma \) increases. Therefore, we set \(\gamma =3\) in experiments.

Table 15 Ablation analysis on AGC (time in seconds)
Table 16 Ablation Analysis on AMGC (Time in Seconds)

Effectiveness Evaluation of InitBCM and Discretize. On attributed hypergraphs, to verify the effectiveness of \(\texttt {InitBCM} \) for the BCM initialization, we compare ANCKA with the ablated version ANCKA-random-init, where the BCM matrix \(\textbf{Y} ^{(0)}\) is initialized at random. In Table 14, ANCKA obtains remarkable improvements over ANCKA-random-init in Acc, F1, and NMI in comparable processing time. For instance, on Amazon, ANCKA outperforms ANCKA-random-init by a large margin of 3.7% Acc, 19.5% F1, and 6.8% NMI with 24 s less to process. On MAG-PM, ANCKA needs additional time compared to ANCKA-random-init. The reason is that ANCKA-random-init starts with a low-quality BCM and converges to local optimum solutions with suboptimal MHC, whereas ANCKA can bypass such pitfalls with a good initial BCM from InitBCM and continue searching for the optimal solution with more iterations, which in turn results in a considerable gap in clustering quality. In addition, we validate the effectiveness of Discretize used in ANCKA to transform k leading eigenvectors \(\textbf{Q} \) to BCM matrix \(\textbf{Y} \). Table 14 reports the accuracy of ANCKA and a variant ANCKA-k-means obtained by replacing Discretize in ANCKA with k-means on all datasets. It can be observed that compared with ANCKA-k-means, ANCKA is able to output high-quality BCM matrices \(\textbf{Y} \) with substantially higher clustering accuracy scores while being up to \(3.2\times \) faster. The ablation results on AGC and AMGC are in Tables 15 and 16, respectively. Regarding clustering quality (Acc, F1, NMI), Table 15 shows that for AGC, ANCKA surpasses its ablated counterparts on all datasets across most effectiveness metrics, except for the Citeseer datasets. For example, ANCKA with InitBCM achieves an Acc that is 4.2% higher than ANCKA-random-init on Amazon2M. In Table 16 for AMGC, ANCKA performs the best on all the three datasets. For efficiency in Tables 15 and 16, ANCKA is similar to ANCKA-random-init, while ANCKA-k-means is slower. These results confirm the effectiveness of the proposed techniques for AGC and AMGC.

8.4 Convergence analysis

We provide an empirical analysis pertinent to the convergence of ANCKA for attributed hypergraph clustering. To do so, we first disable the early termination strategies at Line 10 in Algorithm 1. We also set \(\tau =1\) so as to evaluate the MHC (denoted as \(\phi _t\)) of the BCM matrix \(\textbf{Y} ^{(t)}\) generated in each t-th iteration of ANCKA and ANCKA-random-init, where t starts from 0 till convergence. Furthermore, we calculate the Acc, F1, and NMI scores with the ground truth for each BCM matrix \(\textbf{Y} ^{(t)}\) generated throughout the iterative procedures of ANCKA. Figure 14 shows the MHC \(\phi _t\), Acc, F1, and NMI scores based on the BCM matrix of each iteration in ANCKA, as well as the MHC of ANCKA-random-init over all datasets. Notably, MHC \(\phi _t\) experiences a sharp decline when t increases from 0 to 50 on most hypergraphs, while the Acc, F1, and NMI results have significant growth. Moreover, compared to MHC with random init, MHC curves of ANCKA are mostly lower (better) on all datasets under the same t-th iteration. These phenomena demonstrate the effectiveness of InitBCM in facilitating fast convergence of ANCKA. However, when we keep increasing t, these scores either remain stable or deteriorate. For instance, MHC scores grow significantly after 10 iterations on Amazon, while there is a big drop in Acc and F1 scores when \(t\ge 45\) on DBLP. This indicates that adding more iterations does not necessarily ensure better solutions. Hence, the early termination proposed in ANCKA can serve as an effective approach to remedy this issue.

Fig. 7
figure 7

Varying K for AHC (best viewed in color)

8.5 Runtime analysis

Figure 15 reports time breakdown of ANCKA and ANCKA-GPU into four parts: KNN construction, orthogonal iterations, discretization, and greedy initialization and MHC evaluation on all attributed hypergraphs. We first explain the results of ANCKA on CPUs. On all datasets, the four parts in ANCKA all take considerable time to process, except 20News and DBLP, where KNN construction dominates, since 20News and DBLP contain many nodes but relatively few edges. Then, we compare the time breakdown of ANCKA-GPU with ANCKA. On small attributed hypergraphs (Query, Cora-CA, Cora-CC, and Citeseer) in Figs. 15a, b, c, and d, observe that ANCKA-GPU significantly reduces the time for KNN, while the other time costs are on par with that of ANCKA, which is consistent with the results in Sect. 8.2.3. On medium-sized/large attributed hypergraphs in Fig. 15e, f, g, and h, ANCKA-GPU significantly improves the efficiency on all of KNN construction, orthogonal iterations, discretization, greedy initialization and MHC evaluation. From the results on Amazon and MAG-PM, we observe that the scalability of ANCKA-GPU is primarily constrained by KNN construction, while the overhead of the CPU-based ANCKA is more evenly distributed across the four parts.

Fig. 8
figure 8

Varying \(\beta \) for AHC (best viewed in color)

Fig. 9
figure 9

Varying K for AGC (best viewed in color)

Fig. 10
figure 10

Varying K for AMGC (best viewed in color)

9 Related work

Hypergraph Clustering. Motivated by the applications in circuit manufacturing, partitioning algorithms have been developed to divide hypergraphs into partitions/clusters, such as hMetis [65] and KaHyPar [14]. These methods typically adopt a three-stage framework consisting of coarsening, initial clustering, and refinement stages. These algorithms directly perform clustering on a coarsened hypergraph with a relatively small size. In addition, they run a portfolio of clustering algorithms and select the best outcome. These algorithms rely on a set of clustering heuristics and lack the extensibility for exploiting node attribute information. Hypergraph Normalized Cut (HNCut) [52] is a conductance measure for hypergraph clusters from which the normalized hypergraph Laplacian \(\varDelta =\textbf{I}- \varTheta \) is derived for spectral clustering, where \( \varTheta = \textbf{D} _V^{-1/2} \textbf{H} ^T \textbf{D} _E^{-1} \textbf{H} \textbf{D} _V^{-1/2}. \) Alternatively, hGraclus [5] optimizes the HNCut objective using a multi-level kernel K-means algorithm. Non-negative matrix factorization has also been applied to hypergraph clustering [15]. Despite the theoretical soundness, these algorithms are less efficient than the aforementioned partitioning algorithms and they do not utilize node attributes either. For the problem of hypergraph local clustering, which is to find a high-quality cluster containing a specified node, a sweep cut method is proposed [66] to find the cluster based on hypergraph Personalized PageRank (PPR) values. In this paper, we focus on global clustering, a different problem from local clustering.

Fig. 11
figure 11

Varying \(\beta \) for AGC (best viewed in color)

Fig. 12
figure 12

Varying \(\beta \) for AMGC (best viewed in color)

Fig. 13
figure 13

Varying \(\gamma \) on Attributed Hypergraphs

Fig. 14
figure 14

Convergence Analysis (best viewed in color)

Fig. 15
figure 15

Runtime breakdown of CPU-based ANCKA and ANCKA-GPU in seconds

Attributed Hypergraph Clustering. There exist studies designing dedicated clustering algorithms on attributed hypergraphs. JNMF [1] is an AHC algorithm based on non-negative matrix factorization (NMF). With normalized hypergraph Laplacian [52] matrix \(\varDelta =\textbf{I}- \varTheta \) and attribute matrix \(\textbf{X} \), JNMF optimizes the following joint objective that includes a basic NMF part as well as a symmetric NMF part: \(\min _{\textbf{W}, \textbf{M}, \tilde{\textbf{M}}\ge 0} ||\textbf{X}-\textbf{W} \textbf{M} ||_F^2 + \alpha ||\varTheta -\tilde{\textbf{M}}^\textsf{T}\textbf{M} ||_F^2 + \beta ||\tilde{\textbf{M}}-\textbf{M} ||_F^2\). With optimization using block coordinate descent (BCD) scheme, the matrix \(\textbf{M} \) is expected to encode cluster memberships. MEGA [5] extends the formulation of JNMF clustering objective for semi-supervised clustering of multi-view data containing hypergraph, node attributes as well as pair-wise similarity graph. MEGA’s clustering performance is further enhanced by initialization with hGraclus algorithm. GNMF [60] algorithm is originally proposed for high dimensional data clustering, while the authors of [11] extend its objective with the hypergraph normalized Laplacian [52] so that it spawns baseline methods for AHC. Although NMF-based algorithms sometimes produce clusters with good quality, their scalability is underwhelming as shown in our experiments. As the state-of-the-art algorithm for attributed hypergraph clustering, GRAC [11] performs hypergraph convolution [46] on node attributes, which resembles the hypergraph diffusion process with mediators [67]. Then clusters are predicted from the propagated features via a spectral algorithm.

Attributed Graph Clustering. There exists a collection of studies on attributed graph clustering. Some studies perform attributed graph clustering by adopting probabilistic models to combine graph structure with attributes, including discriminative models such as PCL-DC [68] and generative models such as BAGC [2]. Nevertheless, these methods are typically limited to handling categorical attributes. Moreover, inference over the probability distribution of \(O(2^n)\) hyperedges poses a significant challenge against their generalization to hypergraph. GNMF [60] is an NMF-based algorithm that enhances performance by modifying the Laplacian regularizer used in traditional NMF to utilize the Laplacian matrix constructed from the graph structure. Within the random walk framework, SA-Cluster [12] algorithm augments the original graph with virtual nodes representing each possible attribute-value pair and performs k-Medroids clustering using a random walk distance measure. ACMin [13] defines attributed random walk by adding virtual attribute nodes as bridges and combines it with graph random walk into a joint transition matrix. In a fashion similar to GCN [17], AGCGCN [50] performs graph convolution on node attributes to produce smooth feature representations that incorporate network structure information and subsequently applies spectral clustering. For their spectral algorithm, the authors also design heuristics to prevent propagated features from over-smoothing that undermines cluster quality. GRACE [27] adopts graph convolution on node attributes to fuse all available information and perform a spectral algorithm based on GRAC [11]. FGC [62] exploits both node features and structure information via graph convolution and applies spectral clustering on a fine-grained graph that encodes higher-order relations.

Attributed Multiplex Graph Clustering. Via unsupervised learning on attributed multiplex graphs, neural network models can learn node embeddings for clustering, e.g., O2MAC [25] and HDMI [24]. GRACE [27] constructs a multiplex graph Laplacian and uses this matrix for graph convolution. Other methods find a single graph that encodes the node proximity relations in all graph layers and attributes. MCGC [22] performs graph filtering on attributes and learns a consensus graph leveraging contrastive regularization, while MAGC [23] exploits higher-order proximity to learn consensus graphs without deep neural networks.

10 Conclusion

This paper presents ANCKA, a versatile, effective, and efficient attributed network clustering method for AHC, AGC, and AMGC computation. The improvements of ANCKA over existing solutions in terms of efficiency and effectiveness is attributed to: (i) an effective KNN augmentation strategy to exploit useful attribute information, (ii) a novel problem formulation based on a random walk model, and (iii) an efficient iterative optimization framework with speedup techniques. To further boost the efficiency, we leverage GPUs and develop ANCKA-GPU that is faster than its CPU-parallel counterpart ANCKA on large datasets, while retaining high cluster quality. We conduct extensive experiments over real-world data to validate the outstanding performance of our methods. In the future, we plan to extend ANCKA to cope with evolving attributed networks and enhance its scalability via distributed KNN construction and matrix computation.