-
Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation
Authors:
Gavin Butts,
Pegah Emdad,
Jethro Lee,
Shannon Song,
Chiman Salavati,
Willmar Sosa Diaz,
Shiri Dori-Hacohen,
Fabricio Murai
Abstract:
There have been growing concerns around high-stake applications that rely on models trained with biased data, which consequently produce biased predictions, often harming the most vulnerable. In particular, biased medical data could cause health-related applications and recommender systems to create outputs that jeopardize patient care and widen disparities in health outcomes. A recent framework t…
▽ More
There have been growing concerns around high-stake applications that rely on models trained with biased data, which consequently produce biased predictions, often harming the most vulnerable. In particular, biased medical data could cause health-related applications and recommender systems to create outputs that jeopardize patient care and widen disparities in health outcomes. A recent framework titled Fairness via AI posits that, instead of attempting to correct model biases, researchers must focus on their root causes by using AI to debias data. Inspired by this framework, we tackle bias detection in medical curricula using NLP models, including LLMs, and evaluate them on a gold standard dataset containing 4,105 excerpts annotated by medical experts for bias from a large corpus. We build on previous work by coauthors which augments the set of negative samples with non-annotated text containing social identifier terms. However, some of these terms, especially those related to race and ethnicity, can carry different meanings (e.g., "white matter of spinal cord"). To address this issue, we propose the use of Word Sense Disambiguation models to refine dataset quality by removing irrelevant sentences. We then evaluate fine-tuned variations of BERT models as well as GPT models with zero- and few-shot prompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for bias detection, while fine-tuned BERT models generally perform well across all evaluated metrics.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
Hidden or Inferred: Fair Learning-To-Rank with Unknown Demographics
Authors:
Oluseun Olulana,
Kathleen Cachel,
Fabricio Murai,
Elke Rundensteiner
Abstract:
As learning-to-rank models are increasingly deployed for decision-making in areas with profound life implications, the FairML community has been developing fair learning-to-rank (LTR) models. These models rely on the availability of sensitive demographic features such as race or sex. However, in practice, regulatory obstacles and privacy concerns protect this data from collection and use. As a res…
▽ More
As learning-to-rank models are increasingly deployed for decision-making in areas with profound life implications, the FairML community has been developing fair learning-to-rank (LTR) models. These models rely on the availability of sensitive demographic features such as race or sex. However, in practice, regulatory obstacles and privacy concerns protect this data from collection and use. As a result, practitioners may either need to promote fairness despite the absence of these features or turn to demographic inference tools to attempt to infer them. Given that these tools are fallible, this paper aims to further understand how errors in demographic inference impact the fairness performance of popular fair LTR strategies. In which cases would it be better to keep such demographic attributes hidden from models versus infer them? We examine a spectrum of fair LTR strategies ranging from fair LTR with and without demographic features hidden versus inferred to fairness-unaware LTR followed by fair re-ranking. We conduct a controlled empirical investigation modeling different levels of inference errors by systematically perturbing the inferred sensitive attribute. We also perform three case studies with real-world datasets and popular open-source inference methods. Our findings reveal that as inference noise grows, LTR-based methods that incorporate fairness considerations into the learning process may increase bias. In contrast, fair re-ranking strategies are more robust to inference errors. All source code, data, and experimental artifacts of our experimental study are available here: https://github.com/sewen007/hoiltr.git
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Reducing Biases towards Minoritized Populations in Medical Curricular Content via Artificial Intelligence for Fairer Health Outcomes
Authors:
Chiman Salavati,
Shannon Song,
Willmar Sosa Diaz,
Scott A. Hale,
Roberto E. Montenegro,
Fabricio Murai,
Shiri Dori-Hacohen
Abstract:
Biased information (recently termed bisinformation) continues to be taught in medical curricula, often long after having been debunked. In this paper, we introduce BRICC, a firstin-class initiative that seeks to mitigate medical bisinformation using machine learning to systematically identify and flag text with potential biases, for subsequent review in an expert-in-the-loop fashion, thus greatly…
▽ More
Biased information (recently termed bisinformation) continues to be taught in medical curricula, often long after having been debunked. In this paper, we introduce BRICC, a firstin-class initiative that seeks to mitigate medical bisinformation using machine learning to systematically identify and flag text with potential biases, for subsequent review in an expert-in-the-loop fashion, thus greatly accelerating an otherwise labor-intensive process. A gold-standard BRICC dataset was developed throughout several years, and contains over 12K pages of instructional materials. Medical experts meticulously annotated these documents for bias according to comprehensive coding guidelines, emphasizing gender, sex, age, geography, ethnicity, and race. Using this labeled dataset, we trained, validated, and tested medical bias classifiers. We test three classifier approaches: a binary type-specific classifier, a general bias classifier; an ensemble combining bias type-specific classifiers independently-trained; and a multitask learning (MTL) model tasked with predicting both general and type-specific biases. While MTL led to some improvement on race bias detection in terms of F1-score, it did not outperform binary classifiers trained specifically on each task. On general bias detection, the binary classifier achieves up to 0.923 of AUC, a 27.8% improvement over the baseline. This work lays the foundations for debiasing medical curricula by exploring a novel dataset and evaluating different training model strategies. Hence, it offers new pathways for more nuanced and effective mitigation of bisinformation.
△ Less
Submitted 21 May, 2024;
originally announced July 2024.
-
General Performance Evaluation for Competitive Resource Allocation Games via Unseen Payoff Estimation
Authors:
N'yoma Diamond,
Fabricio Murai
Abstract:
Many high-stakes decision-making problems, such as those found within cybersecurity and economics, can be modeled as competitive resource allocation games. In these games, multiple players must allocate limited resources to overcome their opponent(s), while minimizing any induced individual losses. However, existing means of assessing the performance of resource allocation algorithms are highly di…
▽ More
Many high-stakes decision-making problems, such as those found within cybersecurity and economics, can be modeled as competitive resource allocation games. In these games, multiple players must allocate limited resources to overcome their opponent(s), while minimizing any induced individual losses. However, existing means of assessing the performance of resource allocation algorithms are highly disparate and problem-dependent. As a result, evaluating such algorithms is unreliable or impossible in many contexts and applications, especially when considering differing levels of feedback. To resolve this problem, we propose a generalized definition of payoff which uses an arbitrary user-provided function. This unifies performance evaluation under all contexts and levels of feedback. Using this definition, we develop metrics for evaluating player performance, and estimators to approximate them under uncertainty (i.e., bandit or semi-bandit feedback). These metrics and their respective estimators provide a problem-agnostic means to contextualize and evaluate algorithm performance. To validate the accuracy of our estimator, we explore the Colonel Blotto ($\mathcal{CB}$) game as an example. To this end, we propose a graph-pruning approach to efficiently identify feasible opponent decisions, which are used in computing our estimation metrics. Using various resource allocation algorithms and game parameters, a suite of $\mathcal{CB}$ games are simulated and used to compute and evaluate the quality of our estimates. These simulations empirically show our approach to be highly accurate at estimating the metrics associated with the unseen outcomes of an opponent's latent behavior.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Towards Detecting Cascades of Biased Medical Claims on Twitter
Authors:
Libby Tiderman,
Juan Sanchez Mercedes,
Fiona Romanoschi,
Fabricio Murai
Abstract:
Social media may disseminate medical claims that highlight misleading correlations between social identifiers and diseases due to not accounting for structural determinants of health. Our research aims to identify biased medical claims on Twitter and measure their spread. We propose a machine learning framework that uses two models in tandem: RoBERTa to detect medical claims and DistilBERT to clas…
▽ More
Social media may disseminate medical claims that highlight misleading correlations between social identifiers and diseases due to not accounting for structural determinants of health. Our research aims to identify biased medical claims on Twitter and measure their spread. We propose a machine learning framework that uses two models in tandem: RoBERTa to detect medical claims and DistilBERT to classify bias. After identifying original biased medical claims, we conducted a retweet cascade analysis, computing their individual reach and rate of spread. Tweets containing biased claims were found to circulate faster and further than unbiased claims.
△ Less
Submitted 22 December, 2023;
originally announced December 2023.
-
Helping Fact-Checkers Identify Fake News Stories Shared through Images on WhatsApp
Authors:
Julio C. S. Reis,
Philipe Melo,
Fabiano Belém,
Fabricio Murai,
Jussara M. Almeida,
Fabricio Benevenuto
Abstract:
WhatsApp has introduced a novel avenue for smartphone users to engage with and disseminate news stories. The convenience of forming interest-based groups and seamlessly sharing content has rendered WhatsApp susceptible to the exploitation of misinformation campaigns. While the process of fact-checking remains a potent tool in identifying fabricated news, its efficacy falters in the face of the unp…
▽ More
WhatsApp has introduced a novel avenue for smartphone users to engage with and disseminate news stories. The convenience of forming interest-based groups and seamlessly sharing content has rendered WhatsApp susceptible to the exploitation of misinformation campaigns. While the process of fact-checking remains a potent tool in identifying fabricated news, its efficacy falters in the face of the unprecedented deluge of information generated on the Internet today. In this work, we explore automatic ranking-based strategies to propose a "fakeness score" model as a means to help fact-checking agencies identify fake news stories shared through images on WhatsApp. Based on the results, we design a tool and integrate it into a real system that has been used extensively for monitoring content during the 2018 Brazilian general election. Our experimental evaluation shows that this tool can reduce by up to 40% the amount of effort required to identify 80% of the fake news in the data when compared to current mechanisms practiced by the fact-checking agencies for the selection of news stories to be checked.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
DELATOR: Money Laundering Detection via Multi-Task Learning on Large Transaction Graphs
Authors:
Henrique S. Assumpção,
Fabrício Souza,
Leandro Lacerda Campos,
Vinícius T. de Castro Pires,
Paulo M. Laurentys de Almeida,
Fabricio Murai
Abstract:
Money laundering has become one of the most relevant criminal activities in modern societies, as it causes massive financial losses for governments, banks and other institutions. Detecting such activities is among the top priorities when it comes to financial analysis, but current approaches are often costly and labor intensive partly due to the sheer amount of data to be analyzed. Hence, there is…
▽ More
Money laundering has become one of the most relevant criminal activities in modern societies, as it causes massive financial losses for governments, banks and other institutions. Detecting such activities is among the top priorities when it comes to financial analysis, but current approaches are often costly and labor intensive partly due to the sheer amount of data to be analyzed. Hence, there is a growing need for automatic anti-money laundering systems to assist experts. In this work, we propose DELATOR, a novel framework for detecting money laundering activities based on graph neural networks that learn from large-scale temporal graphs. DELATOR provides an effective and efficient method for learning from heavily imbalanced graph data, by adapting concepts from the GraphSMOTE framework and incorporating elements of multi-task learning to obtain rich node embeddings for node classification. DELATOR outperforms all considered baselines, including an off-the-shelf solution from Amazon AWS by 23% with respect to AUC-ROC. We also conducted real experiments that led to the discovery of 7 new suspicious cases among the 50 analyzed ones, which have been reported to the authorities.
△ Less
Submitted 24 October, 2022; v1 submitted 20 May, 2022;
originally announced May 2022.
-
Top-Down Deep Clustering with Multi-generator GANs
Authors:
Daniel de Mello,
Renato Assunção,
Fabricio Murai
Abstract:
Deep clustering (DC) leverages the representation power of deep architectures to learn embedding spaces that are optimal for cluster analysis. This approach filters out low-level information irrelevant for clustering and has proven remarkably successful for high dimensional data spaces. Some DC methods employ Generative Adversarial Networks (GANs), motivated by the powerful latent representations…
▽ More
Deep clustering (DC) leverages the representation power of deep architectures to learn embedding spaces that are optimal for cluster analysis. This approach filters out low-level information irrelevant for clustering and has proven remarkably successful for high dimensional data spaces. Some DC methods employ Generative Adversarial Networks (GANs), motivated by the powerful latent representations these models are able to learn implicitly. In this work, we propose HC-MGAN, a new technique based on GANs with multiple generators (MGANs), which have not been explored for clustering. Our method is inspired by the observation that each generator of a MGAN tends to generate data that correlates with a sub-region of the real data distribution. We use this clustered generation to train a classifier for inferring from which generator a given image came from, thus providing a semantically meaningful clustering for the real distribution. Additionally, we design our method so that it is performed in a top-down hierarchical clustering tree, thus proposing the first hierarchical DC method, to the best of our knowledge. We conduct several experiments to evaluate the proposed method against recent DC methods, obtaining competitive results. Last, we perform an exploratory analysis of the hierarchical clustering tree that highlights how accurately it organizes the data in a hierarchy of semantically coherent patterns.
△ Less
Submitted 24 December, 2021; v1 submitted 6 December, 2021;
originally announced December 2021.
-
On the Dynamics of Political Discussions on Instagram: A Network Perspective
Authors:
Carlos H. G. Ferreira,
Fabricio Murai,
Ana P. C. Silva,
Jussara M. Almeida,
Martino Trevisan,
Luca Vassio,
Marco Mellia,
Idilio Drago
Abstract:
Instagram has been increasingly used as a source of information especially among the youth. As a result, political figures now leverage the platform to spread opinions and political agenda. We here analyze online discussions on Instagram, notably in political topics, from a network perspective. Specifically, we investigate the emergence of communities of co-commenters, that is, groups of users who…
▽ More
Instagram has been increasingly used as a source of information especially among the youth. As a result, political figures now leverage the platform to spread opinions and political agenda. We here analyze online discussions on Instagram, notably in political topics, from a network perspective. Specifically, we investigate the emergence of communities of co-commenters, that is, groups of users who often interact by commenting on the same posts and may be driving the ongoing online discussions. In particular, we are interested in salient co-interactions, i.e., interactions of co-commenters that occur more often than expected by chance and under independent behavior. Unlike casual and accidental co-interactions which normally happen in large volumes, salient co-interactions are key elements driving the online discussions and, ultimately, the information dissemination. We base our study on the analysis of 10 weeks of data centered around major elections in Brazil and Italy, following both politicians and other celebrities. We extract and characterize the communities of co-commenters in terms of topological structure, properties of the discussions carried out by community members, and how some community properties, notably community membership and topics, evolve over time. We show that communities discussing political topics tend to be more engaged in the debate by writing longer comments, using more emojis, hashtags and negative words than in other subjects. Also, communities built around political discussions tend to be more dynamic, although top commenters remain active and preserve community membership over time. Moreover, we observe a great diversity in discussed topics over time: whereas some topics attract attention only momentarily, others, centered around more fundamental political discussions, remain consistently active over time.
△ Less
Submitted 13 September, 2022; v1 submitted 19 September, 2021;
originally announced September 2021.
-
Heterogeneous download times in bandwidth-homogeneous BitTorrent swarms
Authors:
Fabricio Murai,
Antonio A. de A. Rocha,
Daniel R. Figueiredo,
Edmundo A. de Souza e Silva
Abstract:
Modeling and understanding BitTorrent (BT) dynamics is a recurrent research topic mainly due to its high complexity and tremendous practical efficiency. Over the years, different models have uncovered various phenomena exhibited by the system, many of which have direct impact on its performance. In this paper we identify and characterize a phenomenon that has not been previously observed: homogene…
▽ More
Modeling and understanding BitTorrent (BT) dynamics is a recurrent research topic mainly due to its high complexity and tremendous practical efficiency. Over the years, different models have uncovered various phenomena exhibited by the system, many of which have direct impact on its performance. In this paper we identify and characterize a phenomenon that has not been previously observed: homogeneous peers (with respect to their upload capacities) experience heterogeneous download times. This behavior has direct impact on peer and system performance, such as high variability of download times, unfairness with respect to peer arrival order, bursty departures and content synchronization. Detailed packet-level simulations and prototype-based experiments on the Internet were performed to characterize this phenomenon. We also develop a mathematical model that accurately predicts the heterogeneous download rates of the homogeneous peers as a function of their content. In addition, we apply the model to calculate lower and upper bounds to the number of departures that occur in a burst. The heterogeneous download rates are more prevalent in unpopular swarms (very few peers). Although few works have addressed this kind of swarm, these by far represent the most common type of swarm in BT.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
Fairness via AI: Bias Reduction in Medical Information
Authors:
Shiri Dori-Hacohen,
Roberto Montenegro,
Fabricio Murai,
Scott A. Hale,
Keen Sung,
Michela Blain,
Jennifer Edwards-Johnson
Abstract:
Most Fairness in AI research focuses on exposing biases in AI systems. A broader lens on fairness reveals that AI can serve a greater aspiration: rooting out societal inequities from their source. Specifically, we focus on inequities in health information, and aim to reduce bias in that domain using AI. The AI algorithms under the hood of search engines and social media, many of which are based on…
▽ More
Most Fairness in AI research focuses on exposing biases in AI systems. A broader lens on fairness reveals that AI can serve a greater aspiration: rooting out societal inequities from their source. Specifically, we focus on inequities in health information, and aim to reduce bias in that domain using AI. The AI algorithms under the hood of search engines and social media, many of which are based on recommender systems, have an outsized impact on the quality of medical and health information online. Therefore, embedding bias detection and reduction into these recommender systems serving up medical and health content online could have an outsized positive impact on patient outcomes and wellbeing.
In this position paper, we offer the following contributions: (1) we propose a novel framework of Fairness via AI, inspired by insights from medical education, sociology and antiracism; (2) we define a new term, bisinformation, which is related to, but distinct from, misinformation, and encourage researchers to study it; (3) we propose using AI to study, detect and mitigate biased, harmful, and/or false health information that disproportionately hurts minority groups in society; and (4) we suggest several pillars and pose several open problems in order to seed inquiry in this new space. While part (3) of this work specifically focuses on the health domain, the fundamental computer science advances and contributions stemming from research efforts in bias reduction and Fairness via AI have broad implications in all areas of society.
△ Less
Submitted 5 September, 2021;
originally announced September 2021.
-
Machine Learning for Performance Prediction of Spark Cloud Applications
Authors:
Alexandre Maros,
Fabricio Murai,
Ana Paula Couto da Silva,
Jussara M. Almeida,
Marco Lattuada,
Eugenio Gianniti,
Marjan Hosseini,
Danilo Ardagna
Abstract:
Big data applications and analytics are employed in many sectors for a variety of goals: improving customers satisfaction, predicting market behavior or improving processes in public health. These applications consist of complex software stacks that are often run on cloud systems. Predicting execution times is important for estimating the cost of cloud services and for effectively managing the und…
▽ More
Big data applications and analytics are employed in many sectors for a variety of goals: improving customers satisfaction, predicting market behavior or improving processes in public health. These applications consist of complex software stacks that are often run on cloud systems. Predicting execution times is important for estimating the cost of cloud services and for effectively managing the underlying resources at runtime. Machine Learning (ML), providing black box solutions to model the relationship between application performance and system configuration without requiring in-detail knowledge of the system, has become a popular way of predicting the performance of big data applications. We investigate the cost-benefits of using supervised ML models for predicting the performance of applications on Spark, one of today's most widely used frameworks for big data analysis. We compare our approach with \textit{Ernest} (an ML-based technique proposed in the literature by the Spark inventors) on a range of scenarios, application workloads, and cloud system configurations. Our experiments show that Ernest can accurately estimate the performance of very regular applications, but it fails when applications exhibit more irregular patterns and/or when extrapolating on bigger data set sizes. Results show that our models match or exceed Ernest's performance, sometimes enabling us to reduce the prediction error from 126-187% to only 5-19%.
△ Less
Submitted 27 August, 2021;
originally announced August 2021.
-
How effective are Graph Neural Networks in Fraud Detection for Network Data?
Authors:
Ronald D. R. Pereira,
Fabrício Murai
Abstract:
Graph-based Neural Networks (GNNs) are recent models created for learning representations of nodes (and graphs), which have achieved promising results when detecting patterns that occur in large-scale data relating different entities. Among these patterns, financial fraud stands out for its socioeconomic relevance and for presenting particular challenges, such as the extreme imbalance between the…
▽ More
Graph-based Neural Networks (GNNs) are recent models created for learning representations of nodes (and graphs), which have achieved promising results when detecting patterns that occur in large-scale data relating different entities. Among these patterns, financial fraud stands out for its socioeconomic relevance and for presenting particular challenges, such as the extreme imbalance between the positive (fraud) and negative (legitimate transactions) classes, and the concept drift (i.e., statistical properties of the data change over time). Since GNNs are based on message propagation, the representation of a node is strongly impacted by its neighbors and by the network's hubs, amplifying the imbalance effects. Recent works attempt to adapt undersampling and oversampling strategies for GNNs in order to mitigate this effect without, however, accounting for concept drift. In this work, we conduct experiments to evaluate existing techniques for detecting network fraud, considering the two previous challenges. For this, we use real data sets, complemented by synthetic data created from a new methodology introduced here. Based on this analysis, we propose a series of improvement points that should be investigated in future research.
△ Less
Submitted 30 May, 2021;
originally announced May 2021.
-
Evaluating the state-of-the-art in mapping research spaces: a Brazilian case study
Authors:
Francisco Galuppo Azevedo,
Fabricio Murai
Abstract:
Scientific knowledge cannot be seen as a set of isolated fields, but as a highly connected network. Understanding how research areas are connected is of paramount importance for adequately allocating funding and human resources (e.g., assembling teams to tackle multidisciplinary problems). The relationship between disciplines can be drawn from data on the trajectory of individual scientists, as re…
▽ More
Scientific knowledge cannot be seen as a set of isolated fields, but as a highly connected network. Understanding how research areas are connected is of paramount importance for adequately allocating funding and human resources (e.g., assembling teams to tackle multidisciplinary problems). The relationship between disciplines can be drawn from data on the trajectory of individual scientists, as researchers often make contributions in a small set of interrelated areas. Two recent works propose methods for creating research maps from scientists' publication records: by using a frequentist approach to create a transition probability matrix; and by learning embeddings (vector representations). Surprisingly, these models were evaluated on different datasets and have never been compared in the literature. In this work, we compare both models in a systematic way, using a large dataset of publication records from Brazilian researchers. We evaluate these models' ability to predict whether a given entity (scientist, institution or region) will enter a new field w.r.t. the area under the ROC curve. Moreover, we analyze how sensitive each method is to the number of publications and the number of fields associated to one entity. Last, we conduct a case study to showcase how these models can be used to characterize science dynamics in the context of Brazil.
△ Less
Submitted 7 April, 2021;
originally announced April 2021.
-
Am I fit for this physical activity? Neural embedding of physical conditioning from inertial sensors
Authors:
Davi Pedrosa de Aguiar,
Fabricio Murai
Abstract:
Inertial Measurement Unit (IMU) sensors are present in everyday devices such as smartphones and fitness watches. As a result, the array of health-related research and applications that tap onto this data has been growing, but little attention has been devoted to the prediction of an individual's heart rate (HR) from IMU data, when undergoing a physical activity. Would that be even possible? If so,…
▽ More
Inertial Measurement Unit (IMU) sensors are present in everyday devices such as smartphones and fitness watches. As a result, the array of health-related research and applications that tap onto this data has been growing, but little attention has been devoted to the prediction of an individual's heart rate (HR) from IMU data, when undergoing a physical activity. Would that be even possible? If so, this could be used to design personalized sets of aerobic exercises, for instance. In this work, we show that it is viable to obtain accurate HR predictions from IMU data using Recurrent Neural Networks, provided only access to HR and IMU data from a short-lived, previously executed activity. We propose a novel method for initializing an RNN's hidden state vectors, using a specialized network that attempts to extract an embedding of the physical conditioning (PCE) of a subject. We show that using a discriminator in the training phase to help the model learn whether two PCEs belong to the same individual further reduces the prediction error. We evaluate the proposed model when predicting the HR of 23 subjects performing a variety of physical activities from IMU data available in public datasets (PAMAP2, PPG-DaLiA). For comparison, we use as baselines the only model specifically proposed for this task and an adapted state-of-the-art model for Human Activity Recognition (HAR), a closely related task. Our method, PCE-LSTM, yields over 10% lower mean absolute error. We demonstrate empirically that this error reduction is in part due to the use of the PCE. Last, we use the two datasets (PPG-DaLiA, WESAD) to show that PCE-LSTM can also be successfully applied when photoplethysmography (PPG) sensors are available, outperforming the state-of-the-art deep learning baselines by more than 30%.
△ Less
Submitted 19 August, 2021; v1 submitted 22 March, 2021;
originally announced March 2021.
-
Characterizing (Un)moderated Textual Data in Social Systems
Authors:
Lucas Henrique Costa de Lima,
Julio Reis,
Philipe Melo,
Fabricio Murai,
Fabricio Benevenuto
Abstract:
Despite the valuable social interactions that online media promote, these systems provide space for speech that would be potentially detrimental to different groups of people. The moderation of content imposed by many social media has motivated the emergence of a new social system for free speech named Gab, which lacks moderation of content. This article characterizes and compares moderated textua…
▽ More
Despite the valuable social interactions that online media promote, these systems provide space for speech that would be potentially detrimental to different groups of people. The moderation of content imposed by many social media has motivated the emergence of a new social system for free speech named Gab, which lacks moderation of content. This article characterizes and compares moderated textual data from Twitter with a set of unmoderated data from Gab. In particular, we analyze distinguishing characteristics of moderated and unmoderated content in terms of linguistic features, evaluate hate speech and its different forms in both environments. Our work shows that unmoderated content presents different psycholinguistic features, more negative sentiment and higher toxicity. Our findings support that unmoderated environments may have proportionally more online hate speech. We hope our analysis and findings contribute to the debate about hate speech and benefit systems aiming at deploying hate speech detection approaches.
△ Less
Submitted 4 January, 2021;
originally announced January 2021.
-
Predicting User Emotional Tone in Mental Disorder Online Communities
Authors:
Bárbara Silveira,
Henrique S. Silva,
Fabricio Murai,
Ana Paula Couto da Silva
Abstract:
In recent years, Online Social Networks have become an important medium for people who suffer from mental disorders to share moments of hardship, and receive emotional and informational support. In this work, we analyze how discussions in Reddit communities related to mental disorders can help improve the health conditions of their users. Using the emotional tone of users' writing as a proxy for e…
▽ More
In recent years, Online Social Networks have become an important medium for people who suffer from mental disorders to share moments of hardship, and receive emotional and informational support. In this work, we analyze how discussions in Reddit communities related to mental disorders can help improve the health conditions of their users. Using the emotional tone of users' writing as a proxy for emotional state, we uncover relationships between user interactions and state changes. First, we observe that authors of negative posts often write rosier comments after engaging in discussions, indicating that users' emotional state can improve due to social support. Second, we build models based on SOTA text embedding techniques and RNNs to predict shifts in emotional tone. This differs from most of related work, which focuses primarily on detecting mental disorders from user activity. We demonstrate the feasibility of accurately predicting the users' reactions to the interactions experienced in these platforms, and present some examples which illustrate that the models are correctly capturing the effects of comments on the author's emotional tone. Our models hold promising implications for interventions to provide support for people struggling with mental illnesses.
△ Less
Submitted 27 July, 2021; v1 submitted 15 May, 2020;
originally announced May 2020.
-
Towards Understanding Political Interactions on Instagram
Authors:
Martino Trevisan,
Luca Vassio,
Idilio Drago,
Marco Mellia,
Fabricio Murai,
Flavio Figueiredo,
Ana Paula Couto da Silva,
Jussara M. Almeida
Abstract:
Online Social Networks (OSNs) allow personalities and companies to communicate directly with the public, bypassing filters of traditional medias. As people rely on OSNs to stay up-to-date, the political debate has moved online too. We witness the sudden explosion of harsh political debates and the dissemination of rumours in OSNs. Identifying such behaviour requires a deep understanding on how peo…
▽ More
Online Social Networks (OSNs) allow personalities and companies to communicate directly with the public, bypassing filters of traditional medias. As people rely on OSNs to stay up-to-date, the political debate has moved online too. We witness the sudden explosion of harsh political debates and the dissemination of rumours in OSNs. Identifying such behaviour requires a deep understanding on how people interact via OSNs during political debates. We present a preliminary study of interactions in a popular OSN, namely Instagram. We take Italy as a case study in the period before the 2019 European Elections. We observe the activity of top Italian Instagram profiles in different categories: politics, music, sport and show. We record their posts for more than two months, tracking "likes" and comments from users. Results suggest that profiles of politicians attract markedly different interactions than other categories. People tend to comment more, with longer comments, debating for longer time, with a large number of replies, most of which are not explicitly solicited. Moreover, comments tend to come from a small group of very active users. Finally, we witness substantial differences when comparing profiles of different parties.
△ Less
Submitted 4 May, 2021; v1 submitted 26 April, 2019;
originally announced April 2019.
-
Inside the Right-Leaning Echo Chambers: Characterizing Gab, an Unmoderated Social System
Authors:
Lucas Lima,
Julio C. S. Reis,
Philipe Melo,
Fabricio Murai,
Leandro Araújo,
Pantelis Vikatos,
Fabrício Benevenuto
Abstract:
The moderation of content in many social media systems, such as Twitter and Facebook, motivated the emergence of a new social network system that promotes free speech, named Gab. Soon after that, Gab has been removed from Google Play Store for violating the company's hate speech policy and it has been rejected by Apple for similar reasons. In this paper we characterize Gab, aiming at understanding…
▽ More
The moderation of content in many social media systems, such as Twitter and Facebook, motivated the emergence of a new social network system that promotes free speech, named Gab. Soon after that, Gab has been removed from Google Play Store for violating the company's hate speech policy and it has been rejected by Apple for similar reasons. In this paper we characterize Gab, aiming at understanding who are the users who joined it and what kind of content they share in this system. Our findings show that Gab is a very politically oriented system that hosts banned users from other social networks, some of them due to possible cases of hate speech and association with extremism. We provide the first measurement of news dissemination inside a right-leaning echo chamber, investigating a social media where readers are rarely exposed to content that cuts across ideological lines, but rather are fed with content that reinforces their current political or social views.
△ Less
Submitted 10 July, 2018;
originally announced July 2018.
-
Modelos de Resposta para Experimentos Randomizados em Redes Sociais de Larga Escala
Authors:
Francisco Galuppo Azevedo,
Bruno Demattos Nogueira,
Fabricio Murai,
Ana Paula Couto da Silva
Abstract:
A/B tests are randomized experiments frequently used by companies that offer services on the Web for assessing the impact of new features. During an experiment, each user is randomly redirected to one of two versions of the website, called treatments. Several response models were proposed to describe the behavior of a user in a social network website, where the treatment assigned to her neighbors…
▽ More
A/B tests are randomized experiments frequently used by companies that offer services on the Web for assessing the impact of new features. During an experiment, each user is randomly redirected to one of two versions of the website, called treatments. Several response models were proposed to describe the behavior of a user in a social network website, where the treatment assigned to her neighbors must be taken into account. However, there is no consensus as to which model should be applied to a given dataset. In this work, we propose a new response model, derive theoretical limits for the estimation error of several models, and obtain empirical results for cases where the response model was misspecified.
△ Less
Submitted 9 March, 2018;
originally announced March 2018.
-
Characterizing Directed and Undirected Networks via Multidimensional Walks with Jumps
Authors:
Fabricio Murai,
Bruno Ribeiro,
Don Towsley,
Pinghui Wang
Abstract:
Estimating distributions of node characteristics (labels) such as number of connections or citizenship of users in a social network via edge and node sampling is a vital part of the study of complex networks. Due to its low cost, sampling via a random walk (RW) has been proposed as an attractive solution to this task. Most RW methods assume either that the network is undirected or that walkers can…
▽ More
Estimating distributions of node characteristics (labels) such as number of connections or citizenship of users in a social network via edge and node sampling is a vital part of the study of complex networks. Due to its low cost, sampling via a random walk (RW) has been proposed as an attractive solution to this task. Most RW methods assume either that the network is undirected or that walkers can traverse edges regardless of their direction. Some RW methods have been designed for directed networks where edges coming into a node are not directly observable. In this work, we propose Directed Unbiased Frontier Sampling (DUFS), a sampling method based on a large number of coordinated walkers, each starting from a node chosen uniformly at random. It is applicable to directed networks with invisible incoming edges because it constructs, in real-time, an undirected graph consistent with the walkers trajectories, and due to the use of random jumps which prevent walkers from being trapped. DUFS generalizes previous RW methods and is suited for undirected networks and to directed networks regardless of in-edges visibility. We also propose an improved estimator of node label distributions that combines information from the initial walker locations with subsequent RW observations. We evaluate DUFS, compare it to other RW methods, investigate the impact of its parameters on estimation accuracy and provide practical guidelines for choosing them. In estimating out-degree distributions, DUFS yields significantly better estimates of the head of the distribution than other methods, while matching or exceeding estimation accuracy of the tail. Last, we show that DUFS outperforms uniform node sampling when estimating distributions of node labels of the top 10% largest degree nodes, even when sampling a node uniformly has the same cost as RW steps.
△ Less
Submitted 13 July, 2018; v1 submitted 23 March, 2017;
originally announced March 2017.
-
Selective Harvesting over Networks
Authors:
Fabricio Murai,
Diogo Rennó,
Bruno Ribeiro,
Gisele L. Pappa,
Don Towsley,
Krista Gile
Abstract:
Active search (AS) on graphs focuses on collecting certain labeled nodes (targets) given global knowledge of the network topology and its edge weights under a query budget. However, in most networks, nodes, topology and edge weights are all initially unknown. We introduce selective harvesting, a variant of AS where the next node to be queried must be chosen among the neighbors of the current queri…
▽ More
Active search (AS) on graphs focuses on collecting certain labeled nodes (targets) given global knowledge of the network topology and its edge weights under a query budget. However, in most networks, nodes, topology and edge weights are all initially unknown. We introduce selective harvesting, a variant of AS where the next node to be queried must be chosen among the neighbors of the current queried node set; the available training data for deciding which node to query is restricted to the subgraph induced by the queried set (and their node attributes) and their neighbors (without any node or edge attributes). Therefore, selective harvesting is a sequential decision problem, where we must decide which node to query at each step. A classifier trained in this scenario suffers from a tunnel vision effect: without recourse to independent sampling, the urge to query promising nodes forces classifiers to gather increasingly biased training data, which we show significantly hurts the performance of AS methods and standard classifiers. We find that it is possible to collect a much larger set of targets by using multiple classifiers, not by combining their predictions as an ensemble, but switching between classifiers used at each step, as a way to ease the tunnel vision effect. We discover that switching classifiers collects more targets by (a) diversifying the training data and (b) broadening the choices of nodes that can be queried next. This highlights an exploration, exploitation, and diversification trade-off in our problem that goes beyond the exploration and exploitation duality found in classic sequential decision problems. From these observations we propose D3TS, a method based on multi-armed bandits for non-stationary stochastic processes that enforces classifier diversity, matching or exceeding the performance of competing methods on seven real network datasets in our evaluation.
△ Less
Submitted 15 March, 2017;
originally announced March 2017.
-
Characterizing Branching Processes from Sampled Data
Authors:
Fabricio Murai,
Bruno Ribeiro,
Don Towsley,
Krista Gile
Abstract:
Branching processes model the evolution of populations of agents that randomly generate offsprings. These processes, more patently Galton-Watson processes, are widely used to model biological, social, cognitive, and technological phenomena, such as the diffusion of ideas, knowledge, chain letters, viruses, and the evolution of humans through their Y-chromosome DNA or mitochondrial RNA. A practical…
▽ More
Branching processes model the evolution of populations of agents that randomly generate offsprings. These processes, more patently Galton-Watson processes, are widely used to model biological, social, cognitive, and technological phenomena, such as the diffusion of ideas, knowledge, chain letters, viruses, and the evolution of humans through their Y-chromosome DNA or mitochondrial RNA. A practical challenge of modeling real phenomena using a Galton-Watson process is the offspring distribution, which must be measured from the population. In most cases, however, directly measuring the offspring distribution is unrealistic due to lack of resources or the death of agents. So far, researchers have relied on informed guesses to guide their choice of offspring distribution. In this work we propose two methods to estimate the offspring distribution from real sampled data. Using a small sampled fraction of the agents and instrumented with the identity of the ancestors of the sampled agents, we show that accurate offspring distribution estimates can be obtained by sampling as little as 14% of the population.
△ Less
Submitted 23 February, 2013;
originally announced February 2013.
-
On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling
Authors:
Fabricio Murai,
Bruno Ribeiro,
Don Towsley,
Pinghui Wang
Abstract:
In this work we study the set size distribution estimation problem, where elements are randomly sampled from a collection of non-overlapping sets and we seek to recover the original set size distribution from the samples. This problem has applications to capacity planning, network theory, among other areas. Examples of real-world applications include characterizing in-degree distributions in large…
▽ More
In this work we study the set size distribution estimation problem, where elements are randomly sampled from a collection of non-overlapping sets and we seek to recover the original set size distribution from the samples. This problem has applications to capacity planning, network theory, among other areas. Examples of real-world applications include characterizing in-degree distributions in large graphs and uncovering TCP/IP flow size distributions on the Internet. We demonstrate that it is hard to estimate the original set size distribution. The recoverability of original set size distributions presents a sharp threshold with respect to the fraction of elements that remain in the sets. If this fraction remains below a threshold, typically half of the elements in power-law and heavier-than-exponential-tailed distributions, then the original set size distribution is unrecoverable. We also discuss practical implications of our findings.
△ Less
Submitted 2 December, 2012; v1 submitted 4 September, 2012;
originally announced September 2012.
-
Heterogeneous download times in a homogeneous BitTorrent swarm
Authors:
Fabricio Murai,
Antonio A de A Rocha,
Daniel R. Figueiredo,
Edmundo de Souza e Silva
Abstract:
Modeling and understanding BitTorrent (BT) dynamics is a recurrent research topic mainly due to its high complexity and tremendous practical efficiency. Over the years, different models have uncovered various phenomena exhibited by the system, many of which have direct impact on its performance. In this paper we identify and characterize a phenomenon that has not been previously observed: homogene…
▽ More
Modeling and understanding BitTorrent (BT) dynamics is a recurrent research topic mainly due to its high complexity and tremendous practical efficiency. Over the years, different models have uncovered various phenomena exhibited by the system, many of which have direct impact on its performance. In this paper we identify and characterize a phenomenon that has not been previously observed: homogeneous peers (with respect to their upload capacities) experience heterogeneous download rates. The consequences of this phenomenon have direct impact on peer and system performance, such as high variability of download times, unfairness with respect to peer arrival order, bursty departures and content synchronization. Detailed packet-level simulations and prototype-based experiments on the Internet were performed to characterize this phenomenon. We also develop a mathematical model that accurately predicts the heterogeneous download rates of the homogeneous peers as a function of their content. Although this phenomenon is more prevalent in unpopular swarms (very few peers), these by far represent the most common type of swarm in BT.
△ Less
Submitted 18 February, 2011; v1 submitted 17 February, 2011;
originally announced February 2011.