-
HeteroMILE: a Multi-Level Graph Representation Learning Framework for Heterogeneous Graphs
Authors:
Yue Zhang,
Yuntian He,
Saket Gurukar,
Srinivasan Parthasarathy
Abstract:
Heterogeneous graphs are ubiquitous in real-world applications because they can represent various relationships between different types of entities. Therefore, learning embeddings in such graphs is a critical problem in graph machine learning. However, existing solutions for this problem fail to scale to large heterogeneous graphs due to their high computational complexity. To address this issue,…
▽ More
Heterogeneous graphs are ubiquitous in real-world applications because they can represent various relationships between different types of entities. Therefore, learning embeddings in such graphs is a critical problem in graph machine learning. However, existing solutions for this problem fail to scale to large heterogeneous graphs due to their high computational complexity. To address this issue, we propose a Multi-Level Embedding framework of nodes on a heterogeneous graph (HeteroMILE) - a generic methodology that allows contemporary graph embedding methods to scale to large graphs. HeteroMILE repeatedly coarsens the large sized graph into a smaller size while preserving the backbone structure of the graph before embedding it, effectively reducing the computational cost by avoiding time-consuming processing operations. It then refines the coarsened embedding to the original graph using a heterogeneous graph convolution neural network. We evaluate our approach using several popular heterogeneous graph datasets. The experimental results show that HeteroMILE can substantially reduce computational time (approximately 20x speedup) and generate an embedding of better quality for link prediction and node classification.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
PolicyClusterGCN: Identifying Efficient Clusters for Training Graph Convolutional Networks
Authors:
Saket Gurukar,
Shaileshh Bojja Venkatakrishnan,
Balaraman Ravindran,
Srinivasan Parthasarathy
Abstract:
Graph convolutional networks (GCNs) have achieved huge success in several machine learning (ML) tasks on graph-structured data. Recently, several sampling techniques have been proposed for the efficient training of GCNs and to improve the performance of GCNs on ML tasks. Specifically, the subgraph-based sampling approaches such as ClusterGCN and GraphSAINT have achieved state-of-the-art performanc…
▽ More
Graph convolutional networks (GCNs) have achieved huge success in several machine learning (ML) tasks on graph-structured data. Recently, several sampling techniques have been proposed for the efficient training of GCNs and to improve the performance of GCNs on ML tasks. Specifically, the subgraph-based sampling approaches such as ClusterGCN and GraphSAINT have achieved state-of-the-art performance on the node classification tasks. These subgraph-based sampling approaches rely on heuristics -- such as graph partitioning via edge cuts -- to identify clusters that are then treated as minibatches during GCN training. In this work, we hypothesize that rather than relying on such heuristics, one can learn a reinforcement learning (RL) policy to compute efficient clusters that lead to effective GCN performance. To that end, we propose PolicyClusterGCN, an online RL framework that can identify good clusters for GCN training. We develop a novel Markov Decision Process (MDP) formulation that allows the policy network to predict ``importance" weights on the edges which are then utilized by a clustering algorithm (Graclus) to compute the clusters. We train the policy network using a standard policy gradient algorithm where the rewards are computed from the classification accuracies while training GCN using clusters given by the policy. Experiments on six real-world datasets and several synthetic datasets show that PolicyClusterGCN outperforms existing state-of-the-art models on node classification task.
△ Less
Submitted 25 June, 2023;
originally announced June 2023.
-
FairMILE: Towards an Efficient Framework for Fair Graph Representation Learning
Authors:
Yuntian He,
Saket Gurukar,
Srinivasan Parthasarathy
Abstract:
Graph representation learning models have demonstrated great capability in many real-world applications. Nevertheless, prior research indicates that these models can learn biased representations leading to discriminatory outcomes. A few works have been proposed to mitigate the bias in graph representations. However, most existing works require exceptional time and computing resources for training…
▽ More
Graph representation learning models have demonstrated great capability in many real-world applications. Nevertheless, prior research indicates that these models can learn biased representations leading to discriminatory outcomes. A few works have been proposed to mitigate the bias in graph representations. However, most existing works require exceptional time and computing resources for training and fine-tuning. To this end, we study the problem of efficient fair graph representation learning and propose a novel framework FairMILE. FairMILE is a multi-level paradigm that can efficiently learn graph representations while enforcing fairness and preserving utility. It can work in conjunction with any unsupervised embedding approach and accommodate various fairness constraints. Extensive experiments across different downstream tasks demonstrate that FairMILE significantly outperforms state-of-the-art baselines in terms of running time while achieving a superior trade-off between fairness and utility.
△ Less
Submitted 17 October, 2023; v1 submitted 17 November, 2022;
originally announced November 2022.
-
MultiBiSage: A Web-Scale Recommendation System Using Multiple Bipartite Graphs at Pinterest
Authors:
Saket Gurukar,
Nikil Pancha,
Andrew Zhai,
Eric Kim,
Samson Hu,
Srinivasan Parthasarathy,
Charles Rosenberg,
Jure Leskovec
Abstract:
Graph Convolutional Networks (GCN) can efficiently integrate graph structure and node features to learn high-quality node embeddings. These embeddings can then be used for several tasks such as recommendation and search. At Pinterest, we have developed and deployed PinSage, a data-efficient GCN that learns pin embeddings from the Pin-Board graph. The Pin-Board graph contains pin and board entities…
▽ More
Graph Convolutional Networks (GCN) can efficiently integrate graph structure and node features to learn high-quality node embeddings. These embeddings can then be used for several tasks such as recommendation and search. At Pinterest, we have developed and deployed PinSage, a data-efficient GCN that learns pin embeddings from the Pin-Board graph. The Pin-Board graph contains pin and board entities and the graph captures the pin belongs to a board interaction. However, there exist several entities at Pinterest such as users, idea pins, creators, and there exist heterogeneous interactions among these entities such as add-to-cart, follow, long-click.
In this work, we show that training deep learning models on graphs that captures these diverse interactions would result in learning higher-quality pin embeddings than training PinSage on only the Pin-Board graph. To that end, we model the diverse entities and their diverse interactions through multiple bipartite graphs and propose a novel data-efficient MultiBiSage model. MultiBiSage can capture the graph structure of multiple bipartite graphs to learn high-quality pin embeddings. We take this pragmatic approach as it allows us to utilize the existing infrastructure developed at Pinterest -- such as Pixie system that can perform optimized random-walks on billion node graphs, along with existing training and deployment workflows. We train MultiBiSage on six bipartite graphs including our Pin-Board graph. Our offline metrics show that MultiBiSage significantly outperforms the deployed latest version of PinSage on multiple user engagement metrics.
△ Less
Submitted 21 May, 2022;
originally announced May 2022.
-
FairEGM: Fair Link Prediction and Recommendation via Emulated Graph Modification
Authors:
Sean Current,
Yuntian He,
Saket Gurukar,
Srinivasan Parthasarathy
Abstract:
As machine learning becomes more widely adopted across domains, it is critical that researchers and ML engineers think about the inherent biases in the data that may be perpetuated by the model. Recently, many studies have shown that such biases are also imbibed in Graph Neural Network (GNN) models if the input graph is biased, potentially to the disadvantage of underserved and underrepresented co…
▽ More
As machine learning becomes more widely adopted across domains, it is critical that researchers and ML engineers think about the inherent biases in the data that may be perpetuated by the model. Recently, many studies have shown that such biases are also imbibed in Graph Neural Network (GNN) models if the input graph is biased, potentially to the disadvantage of underserved and underrepresented communities. In this work, we aim to mitigate the bias learned by GNNs by jointly optimizing two different loss functions: one for the task of link prediction and one for the task of demographic parity. We further implement three different techniques inspired by graph modification approaches: the Global Fairness Optimization (GFO), Constrained Fairness Optimization (CFO), and Fair Edge Weighting (FEW) models. These techniques mimic the effects of changing underlying graph structures within the GNN and offer a greater degree of interpretability over more integrated neural network methods. Our proposed models emulate microscopic or macroscopic edits to the input graph while training GNNs and learn node embeddings that are both accurate and fair under the context of link recommendations. We demonstrate the effectiveness of our approach on four real world datasets and show that we can improve the recommendation fairness by several factors at negligible cost to link prediction accuracy.
△ Less
Submitted 20 October, 2022; v1 submitted 27 January, 2022;
originally announced January 2022.
-
A Machine Learning Model for Nowcasting Epidemic Incidence
Authors:
Saumya Yashmohini Sahai,
Saket Gurukar,
Wasiur R. KhudaBukhsh,
Srinivasan Parthasarathy,
Grzegorz A. Rempala
Abstract:
Due to delay in reporting, the daily national and statewide COVID-19 incidence counts are often unreliable and need to be estimated from recent data. This process is known in economics as nowcasting. We describe in this paper a simple random forest statistical model for nowcasting the COVID - 19 daily new infection counts based on historic data along with a set of simple covariates, such as the cu…
▽ More
Due to delay in reporting, the daily national and statewide COVID-19 incidence counts are often unreliable and need to be estimated from recent data. This process is known in economics as nowcasting. We describe in this paper a simple random forest statistical model for nowcasting the COVID - 19 daily new infection counts based on historic data along with a set of simple covariates, such as the currently reported infection counts, day of the week, and time since first reporting. We apply the model to adjust the daily infection counts in Ohio, and show that the predictions from this simple data-driven method compare favorably both in quality and computational burden to those obtained from the state-of-the-art hierarchical Bayesian model employing a complex statistical algorithm.
△ Less
Submitted 5 April, 2021;
originally announced April 2021.
-
Towards Quantifying the Distance between Opinions
Authors:
Saket Gurukar,
Deepak Ajwani,
Sourav Dutta,
Juho Lauri,
Srinivasan Parthasarathy,
Alessandra Sala
Abstract:
Increasingly, critical decisions in public policy, governance, and business strategy rely on a deeper understanding of the needs and opinions of constituent members (e.g. citizens, shareholders). While it has become easier to collect a large number of opinions on a topic, there is a necessity for automated tools to help navigate the space of opinions. In such contexts understanding and quantifying…
▽ More
Increasingly, critical decisions in public policy, governance, and business strategy rely on a deeper understanding of the needs and opinions of constituent members (e.g. citizens, shareholders). While it has become easier to collect a large number of opinions on a topic, there is a necessity for automated tools to help navigate the space of opinions. In such contexts understanding and quantifying the similarity between opinions is key. We find that measures based solely on text similarity or on overall sentiment often fail to effectively capture the distance between opinions. Thus, we propose a new distance measure for capturing the similarity between opinions that leverages the nuanced observation -- similar opinions express similar sentiment polarity on specific relevant entities-of-interest. Specifically, in an unsupervised setting, our distance measure achieves significantly better Adjusted Rand Index scores (up to 56x) and Silhouette coefficients (up to 21x) compared to existing approaches. Similarly, in a supervised setting, our opinion distance measure achieves considerably better accuracy (up to 20% increase) compared to extant approaches that rely on text similarity, stance similarity, and sentiment similarity
△ Less
Submitted 27 January, 2020;
originally announced January 2020.
-
Twitter Watch: Leveraging Social Media to Monitor and Predict Collective-Efficacy of Neighborhoods
Authors:
Moniba Keymanesh,
Saket Gurukar,
Bethany Boettner,
Christopher Browning,
Catherine Calder,
Srinivasan Parthasarathy
Abstract:
Sociologists associate the spatial variation of crime within an urban setting, with the concept of collective efficacy. The collective efficacy of a neighborhood is defined as social cohesion among neighbors combined with their willingness to intervene on behalf of the common good. Sociologists measure collective efficacy by conducting survey studies designed to measure individuals' perception of…
▽ More
Sociologists associate the spatial variation of crime within an urban setting, with the concept of collective efficacy. The collective efficacy of a neighborhood is defined as social cohesion among neighbors combined with their willingness to intervene on behalf of the common good. Sociologists measure collective efficacy by conducting survey studies designed to measure individuals' perception of their community. In this work, we employ the curated data from a survey study (ground truth) and examine the effectiveness of substituting costly survey questionnaires with proxies derived from social media. We enrich a corpus of tweets mentioning a local venue with several linguistic and topological features. We then propose a pairwise learning to rank model with the goal of identifying a ranking of neighborhoods that is similar to the ranking obtained from the ground truth collective efficacy values. In our experiments, we find that our generated ranking of neighborhoods achieves 0.77 Kendall tau-x ranking agreement with the ground truth ranking. Overall, our results are up to 37% better than traditional baselines.
△ Less
Submitted 14 November, 2019;
originally announced November 2019.
-
Network Representation Learning: Consolidation and Renewed Bearing
Authors:
Saket Gurukar,
Priyesh Vijayan,
Aakash Srinivasan,
Goonmeet Bajaj,
Chen Cai,
Moniba Keymanesh,
Saravana Kumar,
Pranav Maneriker,
Anasua Mitra,
Vedang Patel,
Balaraman Ravindran,
Srinivasan Parthasarathy
Abstract:
Graphs are a natural abstraction for many problems where nodes represent entities and edges represent a relationship across entities. An important area of research that has emerged over the last decade is the use of graphs as a vehicle for non-linear dimensionality reduction in a manner akin to previous efforts based on manifold learning with uses for downstream database processing, machine learni…
▽ More
Graphs are a natural abstraction for many problems where nodes represent entities and edges represent a relationship across entities. An important area of research that has emerged over the last decade is the use of graphs as a vehicle for non-linear dimensionality reduction in a manner akin to previous efforts based on manifold learning with uses for downstream database processing, machine learning and visualization. In this systematic yet comprehensive experimental survey, we benchmark several popular network representation learning methods operating on two key tasks: link prediction and node classification. We examine the performance of 12 unsupervised embedding methods on 15 datasets. To the best of our knowledge, the scale of our study -- both in terms of the number of methods and number of datasets -- is the largest to date.
Our results reveal several key insights about work-to-date in this space. First, we find that certain baseline methods (task-specific heuristics, as well as classic manifold methods) that have often been dismissed or are not considered by previous efforts can compete on certain types of datasets if they are tuned appropriately. Second, we find that recent methods based on matrix factorization offer a small but relatively consistent advantage over alternative methods (e.g., random-walk based methods) from a qualitative standpoint. Specifically, we find that MNMF, a community preserving embedding method, is the most competitive method for the link prediction task. While NetMF is the most competitive baseline for node classification. Third, no single method completely outperforms other embedding methods on both node classification and link prediction tasks. We also present several drill-down analysis that reveals settings under which certain algorithms perform well (e.g., the role of neighborhood context on performance) -- guiding the end-user.
△ Less
Submitted 15 June, 2019; v1 submitted 2 May, 2019;
originally announced May 2019.
-
MILE: A Multi-Level Framework for Scalable Graph Embedding
Authors:
Jiongqian Liang,
Saket Gurukar,
Srinivasan Parthasarathy
Abstract:
Recently there has been a surge of interest in designing graph embedding methods. Few, if any, can scale to a large-sized graph with millions of nodes due to both computational complexity and memory requirements. In this paper, we relax this limitation by introducing the MultI-Level Embedding (MILE) framework -- a generic methodology allowing contemporary graph embedding methods to scale to large…
▽ More
Recently there has been a surge of interest in designing graph embedding methods. Few, if any, can scale to a large-sized graph with millions of nodes due to both computational complexity and memory requirements. In this paper, we relax this limitation by introducing the MultI-Level Embedding (MILE) framework -- a generic methodology allowing contemporary graph embedding methods to scale to large graphs. MILE repeatedly coarsens the graph into smaller ones using a hybrid matching technique to maintain the backbone structure of the graph. It then applies existing embedding methods on the coarsest graph and refines the embeddings to the original graph through a graph convolution neural network that it learns. The proposed MILE framework is agnostic to the underlying graph embedding techniques and can be applied to many existing graph embedding methods without modifying them. We employ our framework on several popular graph embedding techniques and conduct embedding for real-world graphs. Experimental results on five large-scale datasets demonstrate that MILE significantly boosts the speed (order of magnitude) of graph embedding while generating embeddings of better quality, for the task of node classification. MILE can comfortably scale to a graph with 9 million nodes and 40 million edges, on which existing methods run out of memory or take too long to compute on a modern workstation. Our code and data are publicly available with detailed instructions for adding new base embedding methods: \url{https://github.com/jiongqian/MILE}.
△ Less
Submitted 13 August, 2020; v1 submitted 26 February, 2018;
originally announced February 2018.