subscribe to arXiv mailings

Artificial Intuition: Efficient Classification of Scientific Abstracts

Authors: Harsh Sakhrani, Naseela Pervez, Anirudh Ravi Kumar, Fred Morstatter, Alexandra Graddy Reed, Andrea Belz

Abstract: It is desirable to coarsely classify short scientific texts, such as grant or publication abstracts, for strategic insight or research portfolio management. These texts efficiently transmit dense information to experts possessing a rich body of knowledge to aid interpretation. Yet this task is remarkably difficult to automate because of brevity and the absence of context. To address this gap, we h… ▽ More It is desirable to coarsely classify short scientific texts, such as grant or publication abstracts, for strategic insight or research portfolio management. These texts efficiently transmit dense information to experts possessing a rich body of knowledge to aid interpretation. Yet this task is remarkably difficult to automate because of brevity and the absence of context. To address this gap, we have developed a novel approach to generate and appropriately assign coarse domain-specific labels. We show that a Large Language Model (LLM) can provide metadata essential to the task, in a process akin to the augmentation of supplemental knowledge representing human intuition, and propose a workflow. As a pilot study, we use a corpus of award abstracts from the National Aeronautics and Space Administration (NASA). We develop new assessment tools in concert with established performance metrics. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.03594 [pdf, other]

UniPlane: Unified Plane Detection and Reconstruction from Posed Monocular Videos

Authors: Yuzhong Huang, Chen Liu, Ji Hou, Ke Huo, Shiyu Dong, Fred Morstatter

Abstract: We present UniPlane, a novel method that unifies plane detection and reconstruction from posed monocular videos. Unlike existing methods that detect planes from local observations and associate them across the video for the final reconstruction, UniPlane unifies both the detection and the reconstruction tasks in a single network, which allows us to directly optimize final reconstruction quality an… ▽ More We present UniPlane, a novel method that unifies plane detection and reconstruction from posed monocular videos. Unlike existing methods that detect planes from local observations and associate them across the video for the final reconstruction, UniPlane unifies both the detection and the reconstruction tasks in a single network, which allows us to directly optimize final reconstruction quality and fully leverage temporal information. Specifically, we build a Transformers-based deep neural network that jointly constructs a 3D feature volume for the environment and estimates a set of per-plane embeddings as queries. UniPlane directly reconstructs the 3D planes by taking dot products between voxel embeddings and the plane embeddings followed by binary thresholding. Extensive experiments on real-world datasets demonstrate that UniPlane outperforms state-of-the-art methods in both plane detection and reconstruction tasks, achieving +4.6 in F-score in geometry as well as consistent improvements in other geometry and segmentation metrics. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2206.07710 by other authors

arXiv:2406.10000 [pdf, other]

OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control

Authors: Yuzhong Huang, Zhong Li, Zhang Chen, Zhiyuan Ren, Guosheng Lin, Fred Morstatter, Yi Xu

Abstract: In the evolving landscape of text-to-3D technology, Dreamfusion has showcased its proficiency by utilizing Score Distillation Sampling (SDS) to optimize implicit representations such as NeRF. This process is achieved through the distillation of pretrained large-scale text-to-image diffusion models. However, Dreamfusion encounters fidelity and efficiency constraints: it faces the multi-head Janus i… ▽ More In the evolving landscape of text-to-3D technology, Dreamfusion has showcased its proficiency by utilizing Score Distillation Sampling (SDS) to optimize implicit representations such as NeRF. This process is achieved through the distillation of pretrained large-scale text-to-image diffusion models. However, Dreamfusion encounters fidelity and efficiency constraints: it faces the multi-head Janus issue and exhibits a relatively slow optimization process. To circumvent these challenges, we introduce OrientDream, a camera orientation conditioned framework designed for efficient and multi-view consistent 3D generation from textual prompts. Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module. This feature effectively utilizes data from MVImgNet, an extensive external multi-view dataset, to refine and bolster its functionality. Subsequently, we utilize the pre-conditioned 2D images as a basis for optimizing a randomly initialized implicit representation (NeRF). This process is significantly expedited by a decoupled back-propagation technique, allowing for multiple updates of implicit parameters per optimization cycle. Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods, as quantified by comparative metrics. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.00020 [pdf, other]

Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

Authors: Rebecca Dorn, Lee Kezar, Fred Morstatter, Kristina Lerman

Abstract: Content moderation on social media platforms shapes the dynamics of online discourse, influencing whose voices are amplified and whose are suppressed. Recent studies have raised concerns about the fairness of content moderation practices, particularly for aggressively flagging posts from transgender and non-binary individuals as toxic. In this study, we investigate the presence of bias in harmful… ▽ More Content moderation on social media platforms shapes the dynamics of online discourse, influencing whose voices are amplified and whose are suppressed. Recent studies have raised concerns about the fairness of content moderation practices, particularly for aggressively flagging posts from transgender and non-binary individuals as toxic. In this study, we investigate the presence of bias in harmful speech classification of gender-queer dialect online, focusing specifically on the treatment of reclaimed slurs. We introduce a novel dataset, QueerReclaimLex, based on 109 curated templates exemplifying non-derogatory uses of LGBTQ+ slurs. Dataset instances are scored by gender-queer annotators for potential harm depending on additional context about speaker identity. We systematically evaluate the performance of five off-the-shelf language models in assessing the harm of these texts and explore the effectiveness of chain-of-thought prompting to teach large language models (LLMs) to leverage author identity context. We reveal a tendency for these models to inaccurately flag texts authored by gender-queer individuals as harmful. Strikingly, across all LLMs the performance is poorest for texts that show signs of being written by individuals targeted by the featured slur (F1 <= 0.24). We highlight an urgent need for fairness and inclusivity in content moderation systems. By uncovering these biases, this work aims to inform the development of more equitable content moderation practices and contribute to the creation of inclusive online spaces for all users. △ Less

Submitted 21 June, 2024; v1 submitted 23 May, 2024; originally announced June 2024.

arXiv:2405.20457 [pdf, other]

Online network topology shapes personal narratives and hashtag generation

Authors: J. Hunter Priniski, Bryce Linford, Sai Krishna, Fred Morstatter, Jeff Brantingham, Hongjing Lu

Abstract: While narratives have shaped cognition and cultures for centuries, digital media and online social networks have introduced new narrative phenomena. With increased narrative agency, networked groups of individuals can directly contribute and steer narratives that center our collective discussions of politics, science, and morality. We report the results of an online network experiment on narrative… ▽ More While narratives have shaped cognition and cultures for centuries, digital media and online social networks have introduced new narrative phenomena. With increased narrative agency, networked groups of individuals can directly contribute and steer narratives that center our collective discussions of politics, science, and morality. We report the results of an online network experiment on narrative and hashtag generation, in which networked groups of participants interpreted a text-based narrative of a disaster event, and were incentivized to produce matching hashtags with their network neighbors. We found that network structure not only influences the emergence of dominant beliefs through coordination with network neighbors, but also impacts participants' use of causal language in their personal narratives. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: Will be published in the 2024 Proceedings of the Cognitive Science Society

arXiv:2404.11045 [pdf, other]

Offset Unlearning for Large Language Models

Authors: James Y. Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, Muhao Chen

Abstract: Despite the strong capabilities of Large Language Models (LLMs) to acquire knowledge from their training corpora, the memorization of sensitive information in the corpora such as copyrighted, harmful, and private content has led to ethical and legal concerns. In response to these challenges, unlearning has emerged as a potential remedy for LLMs affected by problematic training data. However, previ… ▽ More Despite the strong capabilities of Large Language Models (LLMs) to acquire knowledge from their training corpora, the memorization of sensitive information in the corpora such as copyrighted, harmful, and private content has led to ethical and legal concerns. In response to these challenges, unlearning has emerged as a potential remedy for LLMs affected by problematic training data. However, previous unlearning techniques are either not applicable to black-box LLMs due to required access to model internal weights, or violate data protection principles by retaining sensitive data for inference-time correction. We propose $δ$-unlearning, an offset unlearning framework for black-box LLMs. Instead of tuning the black-box LLM itself, $δ$-unlearning learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller models. Experiments demonstrate that $δ$-unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks. $δ$-unlearning also effectively incorporates different unlearning algorithms, making our approach a versatile solution to adapting various existing unlearning algorithms to black-box LLMs. △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.00267 [pdf, other]

Secret Keepers: The Impact of LLMs on Linguistic Markers of Personal Traits

Authors: Zhivar Sourati, Meltem Ozcan, Colin McDaniel, Alireza Ziabari, Nuan Wen, Ala Tak, Fred Morstatter, Morteza Dehghani

Abstract: Prior research has established associations between individuals' language usage and their personal traits; our linguistic patterns reveal information about our personalities, emotional states, and beliefs. However, with the increasing adoption of Large Language Models (LLMs) as writing assistants in everyday writing, a critical question emerges: are authors' linguistic patterns still predictive of… ▽ More Prior research has established associations between individuals' language usage and their personal traits; our linguistic patterns reveal information about our personalities, emotional states, and beliefs. However, with the increasing adoption of Large Language Models (LLMs) as writing assistants in everyday writing, a critical question emerges: are authors' linguistic patterns still predictive of their personal traits when LLMs are involved in the writing process? We investigate the impact of LLMs on the linguistic markers of demographic and psychological traits, specifically examining three LLMs - GPT3.5, Llama 2, and Gemini - across six different traits: gender, age, political affiliation, personality, empathy, and morality. Our findings indicate that although the use of LLMs slightly reduces the predictive power of linguistic patterns over authors' personal traits, the significant changes are infrequent, and the use of LLMs does not fully diminish the predictive power of authors' linguistic patterns over their personal traits. We also note that some theoretically established lexical-based linguistic markers lose their reliability as predictors when LLMs are used in the writing process. Our findings have important implications for the study of linguistic markers of personal traits in the age of LLMs. △ Less

Submitted 3 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

arXiv:2403.14988 [pdf, other]

Risk and Response in Large Language Models: Evaluating Key Threat Categories

Authors: Bahareh Harandizadeh, Abel Salinas, Fred Morstatter

Abstract: This paper explores the pressing issue of risk assessment in Large Language Models (LLMs) as they become increasingly prevalent in various applications. Focusing on how reward models, which are designed to fine-tune pretrained LLMs to align with human values, perceive and categorize different types of risks, we delve into the challenges posed by the subjective nature of preference-based training d… ▽ More This paper explores the pressing issue of risk assessment in Large Language Models (LLMs) as they become increasingly prevalent in various applications. Focusing on how reward models, which are designed to fine-tune pretrained LLMs to align with human values, perceive and categorize different types of risks, we delve into the challenges posed by the subjective nature of preference-based training data. By utilizing the Anthropic Red-team dataset, we analyze major risk categories, including Information Hazards, Malicious Uses, and Discrimination/Hateful content. Our findings indicate that LLMs tend to consider Information Hazards less harmful, a finding confirmed by a specially developed regression model. Additionally, our analysis shows that LLMs respond less stringently to Information Hazards compared to other risks. The study further reveals a significant vulnerability of LLMs to jailbreaking attacks in Information Hazard scenarios, highlighting a critical security concern in LLM risk assessment and emphasizing the need for improved AI safety measures. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: 19 pages, 14 figures

arXiv:2403.04085 [pdf, other]

Don't Blame the Data, Blame the Model: Understanding Noise and Bias When Learning from Subjective Annotations

Authors: Abhishek Anand, Negar Mokhberian, Prathyusha Naresh Kumar, Anweasha Saha, Zihao He, Ashwin Rao, Fred Morstatter, Kristina Lerman

Abstract: Researchers have raised awareness about the harms of aggregating labels especially in subjective tasks that naturally contain disagreements among human annotators. In this work we show that models that are only provided aggregated labels show low confidence on high-disagreement data instances. While previous studies consider such instances as mislabeled, we argue that the reason the high-disagreem… ▽ More Researchers have raised awareness about the harms of aggregating labels especially in subjective tasks that naturally contain disagreements among human annotators. In this work we show that models that are only provided aggregated labels show low confidence on high-disagreement data instances. While previous studies consider such instances as mislabeled, we argue that the reason the high-disagreement text instances have been hard-to-learn is that the conventional aggregated models underperform in extracting useful signals from subjective tasks. Inspired by recent studies demonstrating the effectiveness of learning from raw annotations, we investigate classifying using Multiple Ground Truth (Multi-GT) approaches. Our experiments show an improvement of confidence for the high-disagreement instances. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2402.13273 [pdf, ps, other]

Operational Collective Intelligence of Humans and Machines

Authors: Nikolos Gurney, Fred Morstatter, David V. Pynadath, Adam Russell, Gleb Satyukov

Abstract: We explore the use of aggregative crowdsourced forecasting (ACF) as a mechanism to help operationalize ``collective intelligence'' of human-machine teams for coordinated actions. We adopt the definition for Collective Intelligence as: ``A property of groups that emerges from synergies among data-information-knowledge, software-hardware, and individuals (those with new insights as well as recognize… ▽ More We explore the use of aggregative crowdsourced forecasting (ACF) as a mechanism to help operationalize ``collective intelligence'' of human-machine teams for coordinated actions. We adopt the definition for Collective Intelligence as: ``A property of groups that emerges from synergies among data-information-knowledge, software-hardware, and individuals (those with new insights as well as recognized authorities) that enables just-in-time knowledge for better decisions than these three elements acting alone.'' Collective Intelligence emerges from new ways of connecting humans and AI to enable decision-advantage, in part by creating and leveraging additional sources of information that might otherwise not be included. Aggregative crowdsourced forecasting (ACF) is a recent key advancement towards Collective Intelligence wherein predictions (X\% probability that Y will happen) and rationales (why I believe it is this probability that X will happen) are elicited independently from a diverse crowd, aggregated, and then used to inform higher-level decision-making. This research asks whether ACF, as a key way to enable Operational Collective Intelligence, could be brought to bear on operational scenarios (i.e., sequences of events with defined agents, components, and interactions) and decision-making, and considers whether such a capability could provide novel operational capabilities to enable new forms of decision-advantage. △ Less

Submitted 16 February, 2024; originally announced February 2024.

arXiv:2402.03221 [pdf, other]

"Define Your Terms" : Enhancing Efficient Offensive Speech Classification with Definition

Authors: Huy Nghiem, Umang Gupta, Fred Morstatter

Abstract: The propagation of offensive content through social media channels has garnered attention of the research community. Multiple works have proposed various semantically related yet subtle distinct categories of offensive speech. In this work, we explore meta-earning approaches to leverage the diversity of offensive speech corpora to enhance their reliable and efficient detection. We propose a joint… ▽ More The propagation of offensive content through social media channels has garnered attention of the research community. Multiple works have proposed various semantically related yet subtle distinct categories of offensive speech. In this work, we explore meta-earning approaches to leverage the diversity of offensive speech corpora to enhance their reliable and efficient detection. We propose a joint embedding architecture that incorporates the input's label and definition for classification via Prototypical Network. Our model achieves at least 75% of the maximal F1-score while using less than 10% of the available training data across 4 datasets. Our experimental findings also provide a case study of training strategies valuable to combat resource scarcity. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: Accepted to Main Conference, EACL 2024

arXiv:2401.12117 [pdf, other]

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

Authors: Kian Ahrabian, Zhivar Sourati, Kexuan Sun, Jiarui Zhang, Yifan Jiang, Fred Morstatter, Jay Pujara

Abstract: While large language models (LLMs) are still being adopted to new domains and utilized in novel applications, we are experiencing an influx of the new generation of foundation models, namely multi-modal large language models (MLLMs). These models integrate verbal and visual information, opening new possibilities to demonstrate more complex reasoning abilities at the intersection of the two modalit… ▽ More While large language models (LLMs) are still being adopted to new domains and utilized in novel applications, we are experiencing an influx of the new generation of foundation models, namely multi-modal large language models (MLLMs). These models integrate verbal and visual information, opening new possibilities to demonstrate more complex reasoning abilities at the intersection of the two modalities. However, despite the revolutionizing prospect of MLLMs, our understanding of their reasoning abilities is limited. In this study, we assess the nonverbal abstract reasoning abilities of open-source and closed-source MLLMs using variations of Raven's Progressive Matrices. Our experiments reveal the challenging nature of such problems for MLLMs while showcasing the immense gap between open-source and closed-source models. We also uncover critical shortcomings of visual and textual perceptions, subjecting the models to low-performance ceilings. Finally, to improve MLLMs' performance, we experiment with different methods, such as Chain-of-Thought prompting, leading to a significant (up to 100%) boost in performance. Our code and datasets are available at https://github.com/usc-isi-i2/isi-mmlm-rpm. △ Less

Submitted 22 August, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

Comments: 21 pages

arXiv:2401.06275 [pdf, other]

The Pulse of Mood Online: Unveiling Emotional Reactions in a Dynamic Social Media Landscape

Authors: Siyi Guo, Zihao He, Ashwin Rao, Fred Morstatter, Jeffrey Brantingham, Kristina Lerman

Abstract: The rich and dynamic information environment of social media provides researchers, policy makers, and entrepreneurs with opportunities to learn about social phenomena in a timely manner. However, using these data to understand social behavior is difficult due to heterogeneity of topics and events discussed in the highly dynamic online information environment. To address these challenges, we presen… ▽ More The rich and dynamic information environment of social media provides researchers, policy makers, and entrepreneurs with opportunities to learn about social phenomena in a timely manner. However, using these data to understand social behavior is difficult due to heterogeneity of topics and events discussed in the highly dynamic online information environment. To address these challenges, we present a method for systematically detecting and measuring emotional reactions to offline events using change point detection on the time series of collective affect, and further explaining these reactions using a transformer-based topic model. We demonstrate the utility of the method by successfully detecting major and smaller events on three different datasets, including (1) a Los Angeles Tweet dataset between Jan. and Aug. 2020, in which we revealed the complex psychological impact of the BlackLivesMatter movement and the COVID-19 pandemic, (2) a dataset related to abortion rights discussions in USA, in which we uncovered the strong emotional reactions to the overturn of Roe v. Wade and state abortion bans, and (3) a dataset about the 2022 French presidential election, in which we discovered the emotional and moral shift from positive before voting to fear and criticism after voting. The capability of our method allows for better sensing and monitoring of population's reactions during crises using online data. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2307.10245

arXiv:2401.03729 [pdf, other]

The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance

Authors: Abel Salinas, Fred Morstatter

Abstract: Large Language Models (LLMs) are regularly being used to label data across many domains and for myriad tasks. By simply asking the LLM for an answer, or ``prompting,'' practitioners are able to use LLMs to quickly get a response for an arbitrary task. This prompting is done through a series of decisions by the practitioner, from simple wording of the prompt, to requesting the output in a certain d… ▽ More Large Language Models (LLMs) are regularly being used to label data across many domains and for myriad tasks. By simply asking the LLM for an answer, or ``prompting,'' practitioners are able to use LLMs to quickly get a response for an arbitrary task. This prompting is done through a series of decisions by the practitioner, from simple wording of the prompt, to requesting the output in a certain data format, to jailbreaking in the case of prompts that address more sensitive topics. In this work, we ask: do variations in the way a prompt is constructed change the ultimate decision of the LLM? We answer this using a series of prompt variations across a variety of text classification tasks. We find that even the smallest of perturbations, such as adding a space at the end of a prompt, can cause the LLM to change its answer. Further, we find that requesting responses in XML and commonly used jailbreaks can have cataclysmic effects on the data labeled by LLMs. △ Less

Submitted 1 April, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

arXiv:2311.09743 [pdf, other]

Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks

Authors: Negar Mokhberian, Myrl G. Marmarelis, Frederic R. Hopp, Valerio Basile, Fred Morstatter, Kristina Lerman

Abstract: Supervised classification heavily depends on datasets annotated by humans. However, in subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters. Annotations have commonly been aggregated by employing methods like majority voting to determine a single ground truth label. In subjective tasks, aggregating labels will result in biased labeling and, c… ▽ More Supervised classification heavily depends on datasets annotated by humans. However, in subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters. Annotations have commonly been aggregated by employing methods like majority voting to determine a single ground truth label. In subjective tasks, aggregating labels will result in biased labeling and, consequently, biased models that can overlook minority opinions. Previous studies have shed light on the pitfalls of label aggregation and have introduced a handful of practical approaches to tackle this issue. Recently proposed multi-annotator models, which predict labels individually per annotator, are vulnerable to under-determination for annotators with few samples. This problem is exacerbated in crowdsourced datasets. In this work, we propose \textbf{Annotator Aware Representations for Texts (AART)} for subjective classification tasks. Our approach involves learning representations of annotators, allowing for exploration of annotation behaviors. We show the improvement of our method on metrics that assess the performance on capturing individual annotators' perspectives. Additionally, we demonstrate fairness metrics to evaluate our model's equability of performance for marginalized annotators compared to others. △ Less

Submitted 16 May, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2310.08780 [pdf, other]

"Im not Racist but...": Discovering Bias in the Internal Knowledge of Large Language Models

Authors: Abel Salinas, Louis Penafiel, Robert McCormack, Fred Morstatter

Abstract: Large language models (LLMs) have garnered significant attention for their remarkable performance in a continuously expanding set of natural language processing tasks. However, these models have been shown to harbor inherent societal biases, or stereotypes, which can adversely affect their performance in their many downstream applications. In this paper, we introduce a novel, purely prompt-based a… ▽ More Large language models (LLMs) have garnered significant attention for their remarkable performance in a continuously expanding set of natural language processing tasks. However, these models have been shown to harbor inherent societal biases, or stereotypes, which can adversely affect their performance in their many downstream applications. In this paper, we introduce a novel, purely prompt-based approach to uncover hidden stereotypes within any arbitrary LLM. Our approach dynamically generates a knowledge representation of internal stereotypes, enabling the identification of biases encoded within the LLM's internal knowledge. By illuminating the biases present in LLMs and offering a systematic methodology for their analysis, our work contributes to advancing transparency and promoting fairness in natural language processing systems. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: Warning: This paper discusses and contains content that is offensive or upsetting

arXiv:2308.02053 [pdf, other]

doi 10.1145/3617694.3623257

The Unequal Opportunities of Large Language Models: Revealing Demographic Bias through Job Recommendations

Authors: Abel Salinas, Parth Vipul Shah, Yuzhong Huang, Robert McCormack, Fred Morstatter

Abstract: Large Language Models (LLMs) have seen widespread deployment in various real-world applications. Understanding these biases is crucial to comprehend the potential downstream consequences when using LLMs to make decisions, particularly for historically disadvantaged groups. In this work, we propose a simple method for analyzing and comparing demographic bias in LLMs, through the lens of job recomme… ▽ More Large Language Models (LLMs) have seen widespread deployment in various real-world applications. Understanding these biases is crucial to comprehend the potential downstream consequences when using LLMs to make decisions, particularly for historically disadvantaged groups. In this work, we propose a simple method for analyzing and comparing demographic bias in LLMs, through the lens of job recommendations. We demonstrate the effectiveness of our method by measuring intersectional biases within ChatGPT and LLaMA, two cutting-edge LLMs. Our experiments primarily focus on uncovering gender identity and nationality bias; however, our method can be extended to examine biases associated with any intersection of demographic identities. We identify distinct biases in both models toward various demographic identities, such as both models consistently suggesting low-paying jobs for Mexican workers or preferring to recommend secretarial roles to women. Our study highlights the importance of measuring the bias of LLMs in downstream applications to understand the potential for harm and inequitable outcomes. △ Less

Submitted 9 January, 2024; v1 submitted 3 August, 2023; originally announced August 2023.

Comments: Accepted to EAAMO 2023

arXiv:2307.10245 [pdf, other]

doi 10.1145/3625007.3627477

Measuring Online Emotional Reactions to Events

Authors: Siyi Guo, Zihao He, Ashwin Rao, Eugene Jang, Yuanfeixue Nan, Fred Morstatter, Jeffrey Brantingham, Kristina Lerman

Abstract: The rich and dynamic information environment of social media provides researchers, policy makers, and entrepreneurs with opportunities to learn about social phenomena in a timely manner. However, using this data to understand social behavior is difficult due heterogeneity of topics and events discussed in the highly dynamic online information environment. To address these challenges, we present a… ▽ More The rich and dynamic information environment of social media provides researchers, policy makers, and entrepreneurs with opportunities to learn about social phenomena in a timely manner. However, using this data to understand social behavior is difficult due heterogeneity of topics and events discussed in the highly dynamic online information environment. To address these challenges, we present a method for systematically detecting and measuring emotional reactions to offline events using change point detection on the time series of collective affect, and further explaining these reactions using a transformer-based topic model. We demonstrate the utility of the method on a corpus of tweets from a large US metropolitan area between January and August, 2020, covering a period of great social change. We demonstrate that our method is able to disaggregate topics to measure population's emotional and moral reactions. This capability allows for better monitoring of population's reactions during crises using online data. △ Less

Submitted 28 March, 2024; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: Proceedings of the International Conference on Advances in Social Networks Analysis and Mining. 2023

arXiv:2306.09520 [pdf, other]

Ensembled Prediction Intervals for Causal Outcomes Under Hidden Confounding

Authors: Myrl G. Marmarelis, Greg Ver Steeg, Aram Galstyan, Fred Morstatter

Abstract: Causal inference of exact individual treatment outcomes in the presence of hidden confounders is rarely possible. Recent work has extended prediction intervals with finite-sample guarantees to partially identifiable causal outcomes, by means of a sensitivity model for hidden confounding. In deep learning, predictors can exploit their inductive biases for better generalization out of sample. We arg… ▽ More Causal inference of exact individual treatment outcomes in the presence of hidden confounders is rarely possible. Recent work has extended prediction intervals with finite-sample guarantees to partially identifiable causal outcomes, by means of a sensitivity model for hidden confounding. In deep learning, predictors can exploit their inductive biases for better generalization out of sample. We argue that the structure inherent to a deep ensemble should inform a tighter partial identification of the causal outcomes that they predict. We therefore introduce an approach termed Caus-Modens, for characterizing causal outcome intervals by modulated ensembles. We present a simple approach to partial identification using existing causal sensitivity models and show empirically that Caus-Modens gives tighter outcome intervals, as measured by the necessary interval size to achieve sufficient coverage. The last of our three diverse benchmarks is a novel usage of GPT-4 for observational experiments with unknown but probeable ground truth. △ Less

Submitted 1 November, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

arXiv:2306.02475 [pdf, other]

Modeling Cross-Cultural Pragmatic Inference with Codenames Duet

Authors: Omar Shaikh, Caleb Ziems, William Held, Aryan J. Pariani, Fred Morstatter, Diyi Yang

Abstract: Pragmatic reference enables efficient interpersonal communication. Prior work uses simple reference games to test models of pragmatic reasoning, often with unidentified speakers and listeners. In practice, however, speakers' sociocultural background shapes their pragmatic assumptions. For example, readers of this paper assume NLP refers to "Natural Language Processing," and not "Neuro-linguistic P… ▽ More Pragmatic reference enables efficient interpersonal communication. Prior work uses simple reference games to test models of pragmatic reasoning, often with unidentified speakers and listeners. In practice, however, speakers' sociocultural background shapes their pragmatic assumptions. For example, readers of this paper assume NLP refers to "Natural Language Processing," and not "Neuro-linguistic Programming." This work introduces the Cultural Codes dataset, which operationalizes sociocultural pragmatic inference in a simple word reference game. Cultural Codes is based on the multi-turn collaborative two-player game, Codenames Duet. Our dataset consists of 794 games with 7,703 turns, distributed across 153 unique players. Alongside gameplay, we collect information about players' personalities, values, and demographics. Utilizing theories of communication and pragmatics, we predict each player's actions via joint modeling of their sociocultural priors and the game context. Our experiments show that accounting for background characteristics significantly improves model performance for tasks related to both clue giving and guessing, indicating that sociocultural priors play a vital role in gameplay decisions. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: ACL 2023 Findings

arXiv:2305.18533 [pdf, other]

Pandemic Culture Wars: Partisan Differences in the Moral Language of COVID-19 Discussions

Authors: Ashwin Rao, Siyi Guo, Sze-Yuh Nina Wang, Fred Morstatter, Kristina Lerman

Abstract: Effective response to pandemics requires coordinated adoption of mitigation measures, like masking and quarantines, to curb a virus's spread. However, as the COVID-19 pandemic demonstrated, political divisions can hinder consensus on the appropriate response. To better understand these divisions, our study examines a vast collection of COVID-19-related tweets. We focus on five contentious issues:… ▽ More Effective response to pandemics requires coordinated adoption of mitigation measures, like masking and quarantines, to curb a virus's spread. However, as the COVID-19 pandemic demonstrated, political divisions can hinder consensus on the appropriate response. To better understand these divisions, our study examines a vast collection of COVID-19-related tweets. We focus on five contentious issues: coronavirus origins, lockdowns, masking, education, and vaccines. We describe a weakly supervised method to identify issue-relevant tweets and employ state-of-the-art computational methods to analyze moral language and infer political ideology. We explore how partisanship and moral language shape conversations about these issues. Our findings reveal ideological differences in issue salience and moral language used by different groups. We find that conservatives use more negatively-valenced moral language than liberals and that political elites use moral rhetoric to a greater extent than non-elites across most issues. Examining the evolution and moralization on divisive issues can provide valuable insights into the dynamics of COVID-19 discussions and assist policymakers in better understanding the emergence of ideological divisions. △ Less

Submitted 17 October, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

arXiv:2305.12280 [pdf, other]

Contextualizing Argument Quality Assessment with Relevant Knowledge

Authors: Darshan Deshpande, Zhivar Sourati, Filip Ilievski, Fred Morstatter

Abstract: Automatic assessment of the quality of arguments has been recognized as a challenging task with significant implications for misinformation and targeted speech. While real-world arguments are tightly anchored in context, existing computational methods analyze their quality in isolation, which affects their accuracy and generalizability. We propose SPARK: a novel method for scoring argument quality… ▽ More Automatic assessment of the quality of arguments has been recognized as a challenging task with significant implications for misinformation and targeted speech. While real-world arguments are tightly anchored in context, existing computational methods analyze their quality in isolation, which affects their accuracy and generalizability. We propose SPARK: a novel method for scoring argument quality based on contextualization via relevant knowledge. We devise four augmentations that leverage large language models to provide feedback, infer hidden assumptions, supply a similar-quality argument, or give a counter-argument. SPARK uses a dual-encoder Transformer architecture to enable the original argument and its augmentation to be considered jointly. Our experiments in both in-domain and zero-shot setups show that SPARK consistently outperforms existing techniques across multiple metrics. △ Less

Submitted 17 June, 2024; v1 submitted 20 May, 2023; originally announced May 2023.

Comments: Accepted at NAACL 2024

arXiv:2305.10613 [pdf, other]

Temporal Knowledge Graph Forecasting Without Knowledge Using In-Context Learning

Authors: Dong-Ho Lee, Kian Ahrabian, Woojeong Jin, Fred Morstatter, Jay Pujara

Abstract: Temporal knowledge graph (TKG) forecasting benchmarks challenge models to predict future facts using knowledge of past facts. In this paper, we apply large language models (LLMs) to these benchmarks using in-context learning (ICL). We investigate whether and to what extent LLMs can be used for TKG forecasting, especially without any fine-tuning or explicit modules for capturing structural and temp… ▽ More Temporal knowledge graph (TKG) forecasting benchmarks challenge models to predict future facts using knowledge of past facts. In this paper, we apply large language models (LLMs) to these benchmarks using in-context learning (ICL). We investigate whether and to what extent LLMs can be used for TKG forecasting, especially without any fine-tuning or explicit modules for capturing structural and temporal information. For our experiments, we present a framework that converts relevant historical facts into prompts and generates ranked predictions using token probabilities. Surprisingly, we observe that LLMs, out-of-the-box, perform on par with state-of-the-art TKG models carefully designed and trained for TKG forecasting. Our extensive evaluation presents performances across several models and datasets with different characteristics, compares alternative heuristics for preparing contextual information, and contrasts to prominent TKG methods and simple frequency and recency baselines. We also discover that using numerical indices instead of entity/relation names, i.e., hiding semantic information, does not significantly affect the performance ($\pm$0.4\% Hit@1). This shows that prior semantic knowledge is unnecessary; instead, LLMs can leverage the existing patterns in the context to achieve such performance. Our analysis also reveals that ICL enables LLMs to learn irregular patterns from the historical context, going beyond simple predictions based on common or recent information. △ Less

Submitted 20 October, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: Accepted to EMNLP 2023 main conference. 14 pages, 4 figures, 10 tables

arXiv:2303.04837 [pdf, other]

Non-Binary Gender Expression in Online Interactions

Authors: Rebecca Dorn, Negar Mokhberian, Julie Jiang, Jeremy Abramson, Fred Morstatter, Kristina Lerman

Abstract: Many openly non-binary gender individuals participate in social networks. However, the relationship between gender and online interactions is not well understood, which may result in disparate treatment by large language models. We investigate individual identity on Twitter, focusing on gender expression as represented by users chosen pronouns. We find that non-binary groups tend to receive less a… ▽ More Many openly non-binary gender individuals participate in social networks. However, the relationship between gender and online interactions is not well understood, which may result in disparate treatment by large language models. We investigate individual identity on Twitter, focusing on gender expression as represented by users chosen pronouns. We find that non-binary groups tend to receive less attention in the form of likes and followers. We also find that nonbinary users send and receive tweets with above-average toxicity. The study highlights the importance of considering gender as a spectrum, rather than a binary, in understanding online interactions and expression. △ Less

Submitted 12 September, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

arXiv:2301.11994 [pdf, other]

Gender and Prestige Bias in Coronavirus News Reporting

Authors: Rebecca Dorn, Yiwen Ma, Fred Morstatter, Kristina Lerman

Abstract: Journalists play a vital role in surfacing issues of societal importance, but their choices of what to highlight and who to interview are influenced by societal biases. In this work, we use natural language processing tools to measure these biases in a large corpus of news articles about the Covid-19 pandemic. Specifically, we identify when experts are quoted in news and extract their names and in… ▽ More Journalists play a vital role in surfacing issues of societal importance, but their choices of what to highlight and who to interview are influenced by societal biases. In this work, we use natural language processing tools to measure these biases in a large corpus of news articles about the Covid-19 pandemic. Specifically, we identify when experts are quoted in news and extract their names and institutional affiliations. We enrich the data by classifying each expert's gender, the type of organization they belong to, and for academic institutions, their ranking. Our analysis reveals disparities in the representation of experts in news. We find a substantial gender gap, where men are quoted three times more than women. The gender gap varies by partisanship of the news source, with conservative media exhibiting greater gender bias. We also identify academic prestige bias, where journalists turn to experts from highly-ranked academic institutions more than experts from less prestigious institutions, even if the latter group has more public health expertise. Liberal news sources exhibit slightly more prestige bias than conservative sources. Equality of representation is essential to enable voices from all groups to be heard. By auditing bias, our methods help identify blind spots in news coverage. △ Less

Submitted 27 January, 2023; originally announced January 2023.

arXiv:2301.11429 [pdf, other]

Just Another Day on Twitter: A Complete 24 Hours of Twitter Data

Authors: Juergen Pfeffer, Daniel Matter, Kokil Jaidka, Onur Varol, Afra Mashhadi, Jana Lasser, Dennis Assenmacher, Siqi Wu, Diyi Yang, Cornelia Brantner, Daniel M. Romero, Jahna Otterbacher, Carsten Schwemmer, Kenneth Joseph, David Garcia, Fred Morstatter

Abstract: At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site… ▽ More At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected all 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change. △ Less

Submitted 11 April, 2023; v1 submitted 26 January, 2023; originally announced January 2023.

arXiv:2211.16480 [pdf, other]

Retweets Amplify the Echo Chamber Effect

Authors: Ashwin Rao, Fred Morstatter, Kristina Lerman

Abstract: The growing prominence of social media in public discourse has led to a greater scrutiny of the quality of online information and the role it plays in amplifying political polarization. However, studies of polarization on social media platforms like Twitter have been hampered by the difficulty of collecting data about the social graph, specifically follow links that shape the echo chambers users j… ▽ More The growing prominence of social media in public discourse has led to a greater scrutiny of the quality of online information and the role it plays in amplifying political polarization. However, studies of polarization on social media platforms like Twitter have been hampered by the difficulty of collecting data about the social graph, specifically follow links that shape the echo chambers users join as well as what they see in their timelines. As a proxy of the follower graph, researchers use retweets, although it is not clear how this choice affects analysis. Using a sample of the Twitter follower graph and the tweets posted by users within it, we reconstruct the retweet graph and quantify its impact on the measures of echo chambers and exposure. While we find that echo chambers exist in both graphs, they are more pronounced in the retweet graph. We compare the information users see via their follower and retweet networks to show that retweeted accounts share systematically more polarized content. This bias cannot be explained by the activity or polarization within users' own follower graph neighborhoods but by the increased attention they pay to accounts that are ideologically aligned with their own views. Our results suggest that studies relying on the retweet graphs overestimate the echo chamber effects and exposure to polarized information. △ Less

Submitted 26 July, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: 8 pages, 8 figures

arXiv:2210.07415 [pdf, other]

Noise Audits Improve Moral Foundation Classification

Authors: Negar Mokhberian, Frederic R. Hopp, Bahareh Harandizadeh, Fred Morstatter, Kristina Lerman

Abstract: Morality plays an important role in culture, identity, and emotion. Recent advances in natural language processing have shown that it is possible to classify moral values expressed in text at scale. Morality classification relies on human annotators to label the moral expressions in text, which provides training data to achieve state-of-the-art performance. However, these annotations are inherentl… ▽ More Morality plays an important role in culture, identity, and emotion. Recent advances in natural language processing have shown that it is possible to classify moral values expressed in text at scale. Morality classification relies on human annotators to label the moral expressions in text, which provides training data to achieve state-of-the-art performance. However, these annotations are inherently subjective and some of the instances are hard to classify, resulting in noisy annotations due to error or lack of agreement. The presence of noise in training data harms the classifier's ability to accurately recognize moral foundations from text. We propose two metrics to audit the noise of annotations. The first metric is entropy of instance labels, which is a proxy measure of annotator disagreement about how the instance should be labeled. The second metric is the silhouette coefficient of a label assigned by an annotator to an instance. This metric leverages the idea that instances with the same label should have similar latent representations, and deviations from collective judgments are indicative of errors. Our experiments on three widely used moral foundations datasets show that removing noisy annotations based on the proposed metrics improves classification performance. △ Less

Submitted 13 October, 2022; originally announced October 2022.

arXiv:2205.02392 [pdf, other]

Robust Conversational Agents against Imperceptible Toxicity Triggers

Authors: Ninareh Mehrabi, Ahmad Beirami, Fred Morstatter, Aram Galstyan

Abstract: Warning: this paper contains content that maybe offensive or upsetting. Recent research in Natural Language Processing (NLP) has advanced the development of various toxicity detection models with the intention of identifying and mitigating toxic language from existing systems. Despite the abundance of research in this area, less attention has been given to adversarial attacks that force the system… ▽ More Warning: this paper contains content that maybe offensive or upsetting. Recent research in Natural Language Processing (NLP) has advanced the development of various toxicity detection models with the intention of identifying and mitigating toxic language from existing systems. Despite the abundance of research in this area, less attention has been given to adversarial attacks that force the system to generate toxic language and the defense against them. Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss. In this work, we propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language. We then propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow. Through automatic and human evaluations, we show that our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy. Lastly, we establish the generalizability of such a defense mechanism on language generation models beyond conversational agents. △ Less

Submitted 4 May, 2022; originally announced May 2022.

arXiv:2203.01350 [pdf, other]

Partisan Asymmetries in Exposure to Misinformation

Authors: Ashwin Rao, Fred Morstatter, Kristina Lerman

Abstract: Health misinformation is believed to have contributed to vaccine hesitancy during the Covid-19 pandemic, highlighting concerns about the role of social media in polarization and social stability. While previous research has identified a link between political partisanship and misinformation sharing online, the interaction between partisanship and how much misinformation people see within their soc… ▽ More Health misinformation is believed to have contributed to vaccine hesitancy during the Covid-19 pandemic, highlighting concerns about the role of social media in polarization and social stability. While previous research has identified a link between political partisanship and misinformation sharing online, the interaction between partisanship and how much misinformation people see within their social networks has not been well studied. As a result, we do not know whether partisanship drives exposure to misinformation or people selectively share misinformation despite being exposed to factual content. We study Twitter discussions about the Covid-19 pandemic, classifying users ideologically along political and factual dimensions. We find partisan asymmetries in both sharing behaviors and exposure, with conservatives more likely to see and share misinformation and moderate liberals seeing the most factual content. We identify multi-dimensional echo chambers that expose users to ideologically congruent content; however, the interaction between political and factual dimensions creates conditions for the highly polarized users -- hardline conservatives and liberals -- to amplify misinformation. Despite this, misinformation receives less attention than factual content and political moderates, who represent the bulk of users in our sample, help filter out misinformation, reducing the amount of low factuality content in the information ecosystem. Identifying the extent of polarization and how political ideology can exacerbate misinformation can potentially help public health experts and policy makers improve their messaging to promote consensus. △ Less

Submitted 2 March, 2022; originally announced March 2022.

Comments: 10 pages, 8 figures

arXiv:2112.03101 [pdf, other]

doi 10.1145/3488560.3498518

Keyword Assisted Embedded Topic Model

Authors: Bahareh Harandizadeh, J. Hunter Priniski, Fred Morstatter

Abstract: By illuminating latent structures in a corpus of text, topic models are an essential tool for categorizing, summarizing, and exploring large collections of documents. Probabilistic topic models, such as latent Dirichlet allocation (LDA), describe how words in documents are generated via a set of latent distributions called topics. Recently, the Embedded Topic Model (ETM) has extended LDA to utiliz… ▽ More By illuminating latent structures in a corpus of text, topic models are an essential tool for categorizing, summarizing, and exploring large collections of documents. Probabilistic topic models, such as latent Dirichlet allocation (LDA), describe how words in documents are generated via a set of latent distributions called topics. Recently, the Embedded Topic Model (ETM) has extended LDA to utilize the semantic information in word embeddings to derive semantically richer topics. As LDA and its extensions are unsupervised models, they aren't defined to make efficient use of a user's prior knowledge of the domain. To this end, we propose the Keyword Assisted Embedded Topic Model (KeyETM), which equips ETM with the ability to incorporate user knowledge in the form of informative topic-level priors over the vocabulary. Using both quantitative metrics and human responses on a topic intrusion task, we demonstrate that KeyETM produces better topics than other guided, generative models in the literature. △ Less

Submitted 22 November, 2021; originally announced December 2021.

Comments: 8 pages, 5 figures, WSDM 2022 Conference

arXiv:2112.02265 [pdf, other]

"Stop Asian Hate!" : Refining Detection of Anti-Asian Hate Speech During the COVID-19 Pandemic

Authors: Huy Nghiem, Fred Morstatter

Abstract: Content warning: This work displays examples of explicit and/or strongly offensive language. Fueled by a surge of anti-Asian xenophobia and prejudice during the COVID-19 pandemic, many have taken to social media to express these negative sentiments. Identifying these posts is crucial for moderation and understanding the nature of hate in online spaces. In this paper, we create and annotate a corpu… ▽ More Content warning: This work displays examples of explicit and/or strongly offensive language. Fueled by a surge of anti-Asian xenophobia and prejudice during the COVID-19 pandemic, many have taken to social media to express these negative sentiments. Identifying these posts is crucial for moderation and understanding the nature of hate in online spaces. In this paper, we create and annotate a corpus of tweets to explore anti-Asian hate speech with a finer level of granularity. Our analysis reveals that this emergent form of hate speech often eludes established approaches. To address this challenge, we develop a model and an accompanied efficient training regimen that incorporates agreement between annotators. Our approach produces up to 8.8% improvement in macro F1 scores over a strong established baseline, indicating its effectiveness even in settings where consensus among annotators is low. We demonstrate that we are able to identify hate speech that is systematically missed by established hate speech detectors. △ Less

Submitted 28 June, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

arXiv:2109.04726 [pdf, other]

AutoTriggER: Label-Efficient and Robust Named Entity Recognition with Auxiliary Trigger Extraction

Authors: Dong-Ho Lee, Ravi Kiran Selvam, Sheikh Muhammad Sarwar, Bill Yuchen Lin, Fred Morstatter, Jay Pujara, Elizabeth Boschee, James Allan, Xiang Ren

Abstract: Deep neural models for named entity recognition (NER) have shown impressive results in overcoming label scarcity and generalizing to unseen entities by leveraging distant supervision and auxiliary information such as explanations. However, the costs of acquiring such additional information are generally prohibitive. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER… ▽ More Deep neural models for named entity recognition (NER) have shown impressive results in overcoming label scarcity and generalizing to unseen entities by leveraging distant supervision and auxiliary information such as explanations. However, the costs of acquiring such additional information are generally prohibitive. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER performance by automatically generating and leveraging ``entity triggers'' which are human-readable cues in the text that help guide the model to make better decisions. Our framework leverages post-hoc explanation to generate rationales and strengthens a model's prior knowledge using an embedding interpolation technique. This approach allows models to exploit triggers to infer entity boundaries and types instead of solely memorizing the entity words themselves. Through experiments on three well-studied NER datasets, AutoTriggER shows strong label-efficiency, is capable of generalizing to unseen entities, and outperforms the RoBERTa-CRF baseline by nearly 0.5 F1 points on average. △ Less

Submitted 18 May, 2023; v1 submitted 10 September, 2021; originally announced September 2021.

Comments: 15 pages, 13 figures, EACL 2023

arXiv:2109.03952 [pdf, other]

Attributing Fair Decisions with Attention Interventions

Authors: Ninareh Mehrabi, Umang Gupta, Fred Morstatter, Greg Ver Steeg, Aram Galstyan

Abstract: The widespread use of Artificial Intelligence (AI) in consequential domains, such as healthcare and parole decision-making systems, has drawn intense scrutiny on the fairness of these methods. However, ensuring fairness is often insufficient as the rationale for a contentious decision needs to be audited, understood, and defended. We propose that the attention mechanism can be used to ensure fair… ▽ More The widespread use of Artificial Intelligence (AI) in consequential domains, such as healthcare and parole decision-making systems, has drawn intense scrutiny on the fairness of these methods. However, ensuring fairness is often insufficient as the rationale for a contentious decision needs to be audited, understood, and defended. We propose that the attention mechanism can be used to ensure fair outcomes while simultaneously providing feature attributions to account for how a decision was made. Toward this goal, we design an attention-based model that can be leveraged as an attribution framework. It can identify features responsible for both performance and fairness of the model through attention interventions and attention weight manipulation. Using this attribution framework, we then design a post-processing bias mitigation strategy and compare it with a suite of baselines. We demonstrate the versatility of our approach by conducting experiments on two distinct data types, tabular and textual. △ Less

Submitted 8 September, 2021; originally announced September 2021.

arXiv:2108.05412 [pdf, ps, other]

Analyzing Race and Country of Citizenship Bias in Wikidata

Authors: Zaina Shaik, Filip Ilievski, Fred Morstatter

Abstract: As an open and collaborative knowledge graph created by users and bots, it is possible that the knowledge in Wikidata is biased in regards to multiple factors such as gender, race, and country of citizenship. Previous work has mostly studied the representativeness of Wikidata knowledge in terms of genders of people. In this paper, we examine the race and citizenship bias in general and in regards… ▽ More As an open and collaborative knowledge graph created by users and bots, it is possible that the knowledge in Wikidata is biased in regards to multiple factors such as gender, race, and country of citizenship. Previous work has mostly studied the representativeness of Wikidata knowledge in terms of genders of people. In this paper, we examine the race and citizenship bias in general and in regards to STEM representation for scientists, software developers, and engineers. By comparing Wikidata queries to real-world datasets, we identify the differences in representation to characterize the biases present in Wikidata. Through this analysis, we discovered that there is an overrepresentation of white individuals and those with citizenship in Europe and North America; the rest of the groups are generally underrepresented. Based on these findings, we have found and linked to Wikidata additional data about STEM scientists from the minorities. This data is ready to be inserted into Wikidata with a bot. Increasing representation of minority race and country of citizenship groups can create a more accurate portrayal of individuals in STEM. △ Less

Submitted 11 August, 2021; originally announced August 2021.

arXiv:2105.14637 [pdf, other]

Organizational Artifacts of Code Development

Authors: Parisa Kaghazgaran, Nichola Lubold, Fred Morstatter

Abstract: Software is the outcome of active and effective communication between members of an organization. This has been noted with Conway's law, which states that ``organizations design systems that mirror their own communication structure.'' However, software developers are often members of multiple organizational groups (e.g., corporate, regional,) and it is unclear how association with groups beyond on… ▽ More Software is the outcome of active and effective communication between members of an organization. This has been noted with Conway's law, which states that ``organizations design systems that mirror their own communication structure.'' However, software developers are often members of multiple organizational groups (e.g., corporate, regional,) and it is unclear how association with groups beyond one's company influence the development process. In this paper, we study social effects of country by measuring differences in software repositories associated with different countries. Using a novel dataset we obtain from GitHub, we identify key properties that differentiate software repositories based upon the country of the developers. We propose a novel approach of modeling repositories based on their sequence of development activities as a sequence embedding task and coupled with repo profile features we achieve 79.2% accuracy in identifying the country of a repository. Finally, we conduct a case study on repos from well-known corporations and find that country can describe the differences in development better than the company affiliation itself. These results have larger implications for software development and indicate the importance of considering the multiple groups developers are associated with when considering the formation and structure of teams. △ Less

Submitted 30 May, 2021; originally announced May 2021.

arXiv:2104.09578 [pdf]

Mapping Moral Valence of Tweets Following the Killing of George Floyd

Authors: J. Hunter Priniski, Negar Mokhberian, Bahareh Harandizadeh, Fred Morstatter, Kristina Lerman, Hongjing Lu, P. Jeffrey Brantingham

Abstract: The viral video documenting the killing of George Floyd by Minneapolis police officer Derek Chauvin inspired nation-wide protests that brought national attention to widespread racial injustice and biased policing practices towards black communities in the United States. The use of social media by the Black Lives Matter movement was a primary route for activists to promote the cause and organize ov… ▽ More The viral video documenting the killing of George Floyd by Minneapolis police officer Derek Chauvin inspired nation-wide protests that brought national attention to widespread racial injustice and biased policing practices towards black communities in the United States. The use of social media by the Black Lives Matter movement was a primary route for activists to promote the cause and organize over 1,400 protests across the country. Recent research argues that moral discussions on social media are a catalyst for social change. This study sought to shed light on the moral dynamics shaping Black Lives Matter Twitter discussions by analyzing over 40,000 Tweets geo-located to Los Angeles. The goal of this study is to (1) develop computational techniques for mapping the structure of moral discourse on Twitter and (2) understand the connections between social media activism and protest. △ Less

Submitted 26 August, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

Comments: 6 pages, 4 figures

arXiv:2103.11320 [pdf, other]

Lawyers are Dishonest? Quantifying Representational Harms in Commonsense Knowledge Resources

Authors: Ninareh Mehrabi, Pei Zhou, Fred Morstatter, Jay Pujara, Xiang Ren, Aram Galstyan

Abstract: Warning: this paper contains content that may be offensive or upsetting. Numerous natural language processing models have tried injecting commonsense by using the ConceptNet knowledge base to improve performance on different tasks. ConceptNet, however, is mostly crowdsourced from humans and may reflect human biases such as "lawyers are dishonest." It is important that these biases are not confla… ▽ More Warning: this paper contains content that may be offensive or upsetting. Numerous natural language processing models have tried injecting commonsense by using the ConceptNet knowledge base to improve performance on different tasks. ConceptNet, however, is mostly crowdsourced from humans and may reflect human biases such as "lawyers are dishonest." It is important that these biases are not conflated with the notion of commonsense. We study this missing yet important problem by first defining and quantifying biases in ConceptNet as two types of representational harms: overgeneralization of polarized perceptions and representation disparity. We find that ConceptNet contains severe biases and disparities across four demographic categories. In addition, we analyze two downstream models that use ConceptNet as a source for commonsense knowledge and find the existence of biases in those models as well. We further propose a filtered-based bias-mitigation approach and examine its effectiveness. We show that our mitigation approach can reduce the issues in both resource and models but leads to a performance drop, leaving room for future work to build fairer and stronger commonsense models. △ Less

Submitted 10 September, 2021; v1 submitted 21 March, 2021; originally announced March 2021.

arXiv:2102.04936 [pdf, other]

Models, Markets, and the Forecasting of Elections

Authors: Rajiv Sethi, Julie Seager, Emily Cai, Daniel M. Benjamin, Fred Morstatter

Abstract: We examine probabilistic forecasts for battleground states in the 2020 US presidential election, using daily data from two sources over seven months: a model published by The Economist, and prices from the PredictIt exchange. We find systematic differences in accuracy over time, with markets performing better several months before the election, and the model performing better as the election appro… ▽ More We examine probabilistic forecasts for battleground states in the 2020 US presidential election, using daily data from two sources over seven months: a model published by The Economist, and prices from the PredictIt exchange. We find systematic differences in accuracy over time, with markets performing better several months before the election, and the model performing better as the election approached. A simple average of the two forecasts performs better than either one of them overall, even though no average can outperform both component forecasts for any given state-date pair. This effect arises because the model and the market make different kinds of errors in different states: the model was confidently wrong in some cases, while the market was excessively uncertain in others. We conclude that there is value in using hybrid forecasting methods, and propose a market design that incorporates model forecasts via a trading bot to generate synthetic predictions. We also propose and conduct a profitability test that can be used as a novel criterion for the evaluation of forecasting performance. △ Less

Submitted 25 May, 2021; v1 submitted 6 February, 2021; originally announced February 2021.

arXiv:2012.08723 [pdf, other]

Exacerbating Algorithmic Bias through Fairness Attacks

Authors: Ninareh Mehrabi, Muhammad Naveed, Fred Morstatter, Aram Galstyan

Abstract: Algorithmic fairness has attracted significant attention in recent years, with many quantitative measures suggested for characterizing the fairness of different machine learning algorithms. Despite this interest, the robustness of those fairness measures with respect to an intentional adversarial attack has not been properly addressed. Indeed, most adversarial machine learning has focused on the i… ▽ More Algorithmic fairness has attracted significant attention in recent years, with many quantitative measures suggested for characterizing the fairness of different machine learning algorithms. Despite this interest, the robustness of those fairness measures with respect to an intentional adversarial attack has not been properly addressed. Indeed, most adversarial machine learning has focused on the impact of malicious attacks on the accuracy of the system, without any regard to the system's fairness. We propose new types of data poisoning attacks where an adversary intentionally targets the fairness of a system. Specifically, we propose two families of attacks that target fairness measures. In the anchoring attack, we skew the decision boundary by placing poisoned points near specific target points to bias the outcome. In the influence attack on fairness, we aim to maximize the covariance between the sensitive attributes and the decision outcome and affect the fairness of the model. We conduct extensive experiments that indicate the effectiveness of our proposed attacks. △ Less

Submitted 15 December, 2020; originally announced December 2020.

arXiv:2011.08498 [pdf, other]

Political Partisanship and Anti-Science Attitudes in Online Discussions about Covid-19

Authors: Ashwin Rao, Fred Morstatter, Minda Hu, Emily Chen, Keith Burghardt, Emilio Ferrara, Kristina Lerman

Abstract: The novel coronavirus pandemic continues to ravage communities across the US. Opinion surveys identified importance of political ideology in shaping perceptions of the pandemic and compliance with preventive measures. Here, we use social media data to study complexity of polarization. We analyze a large dataset of tweets related to the pandemic collected between January and May of 2020, and develo… ▽ More The novel coronavirus pandemic continues to ravage communities across the US. Opinion surveys identified importance of political ideology in shaping perceptions of the pandemic and compliance with preventive measures. Here, we use social media data to study complexity of polarization. We analyze a large dataset of tweets related to the pandemic collected between January and May of 2020, and develop methods to classify the ideological alignment of users along the moderacy (hardline vs moderate), political (liberal vs conservative) and science (anti-science vs pro-science) dimensions. While polarization along the science and political dimensions are correlated, politically moderate users are more likely to be aligned with the pro-science views, and politically hardline users with anti-science views. Contrary to expectations, we do not find that polarization grows over time; instead, we see increasing activity by moderate pro-science users. We also show that anti-science conservatives tend to tweet from the Southern US, while anti-science moderates from the Western states. Our findings shed light on the multi-dimensional nature of polarization, and the feasibility of tracking polarized opinions about the pandemic across time and space through social media data. △ Less

Submitted 17 November, 2020; originally announced November 2020.

Comments: 10 pages, 5 figures

arXiv:2010.12144 [pdf, other]

One-shot Learning for Temporal Knowledge Graphs

Authors: Mehrnoosh Mirtaheri, Mohammad Rostami, Xiang Ren, Fred Morstatter, Aram Galstyan

Abstract: Most real-world knowledge graphs are characterized by a long-tail relation frequency distribution where a significant fraction of relations occurs only a handful of times. This observation has given rise to recent interest in low-shot learning methods that are able to generalize from only a few examples. The existing approaches, however, are tailored to static knowledge graphs and not easily gener… ▽ More Most real-world knowledge graphs are characterized by a long-tail relation frequency distribution where a significant fraction of relations occurs only a handful of times. This observation has given rise to recent interest in low-shot learning methods that are able to generalize from only a few examples. The existing approaches, however, are tailored to static knowledge graphs and not easily generalized to temporal settings, where data scarcity poses even bigger problems, e.g., due to occurrence of new, previously unseen relations. We address this shortcoming by proposing a one-shot learning framework for link prediction in temporal knowledge graphs. Our proposed method employs a self-attention mechanism to effectively encode temporal interactions between entities, and a network to compute a similarity score between a given query and a (one-shot) example. Our experiments show that the proposed algorithm outperforms the state of the art baselines for two well-studied benchmarks while achieving significantly better performance for sparse relations. △ Less

Submitted 22 October, 2020; originally announced October 2020.

arXiv:2009.01966 [pdf, other]

Leveraging Clickstream Trajectories to Reveal Low-Quality Workers in Crowdsourced Forecasting Platforms

Authors: Akira Matsui, Emilio Ferrara, Fred Morstatter, Andres Abeliuk, Aram Galstyan

Abstract: Crowdwork often entails tackling cognitively-demanding and time-consuming tasks. Crowdsourcing can be used for complex annotation tasks, from medical imaging to geospatial data, and such data powers sensitive applications, such as health diagnostics or autonomous driving. However, the existence and prevalence of underperforming crowdworkers is well-recognized, and can pose a threat to the validity… ▽ More Crowdwork often entails tackling cognitively-demanding and time-consuming tasks. Crowdsourcing can be used for complex annotation tasks, from medical imaging to geospatial data, and such data powers sensitive applications, such as health diagnostics or autonomous driving. However, the existence and prevalence of underperforming crowdworkers is well-recognized, and can pose a threat to the validity of crowdsourcing. In this study, we propose the use of a computational framework to identify clusters of underperforming workers using clickstream trajectories. We focus on crowdsourced geopolitical forecasting. The framework can reveal different types of underperformers, such as workers with forecasts whose accuracy is far from the consensus of the crowd, those who provide low-quality explanations for their forecasts, and those who simply copy-paste their forecasts from other users. Our study suggests that clickstream clustering and analysis are fundamental tools to diagnose the performance of crowdworkers in platforms leveraging the wisdom of crowds. △ Less

Submitted 3 September, 2020; originally announced September 2020.

Comments: 12 pages, 8 figures

arXiv:2005.07293 [pdf, other]

Statistical Equity: A Fairness Classification Objective

Authors: Ninareh Mehrabi, Yuzhong Huang, Fred Morstatter

Abstract: Machine learning systems have been shown to propagate the societal errors of the past. In light of this, a wealth of research focuses on designing solutions that are "fair." Even with this abundance of work, there is no singular definition of fairness, mainly because fairness is subjective and context dependent. We propose a new fairness definition, motivated by the principle of equity, that consi… ▽ More Machine learning systems have been shown to propagate the societal errors of the past. In light of this, a wealth of research focuses on designing solutions that are "fair." Even with this abundance of work, there is no singular definition of fairness, mainly because fairness is subjective and context dependent. We propose a new fairness definition, motivated by the principle of equity, that considers existing biases in the data and attempts to make equitable decisions that account for these previous historical biases. We formalize our definition of fairness, and motivate it with its appropriate contexts. Next, we operationalize it for equitable classification. We perform multiple automatic and human evaluations to show the effectiveness of our definition and demonstrate its utility for aspects of fairness, such as the feedback loop. △ Less

Submitted 14 May, 2020; originally announced May 2020.

arXiv:2005.00792 [pdf, other]

ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data

Authors: Woojeong Jin, Rahul Khanna, Suji Kim, Dong-Ho Lee, Fred Morstatter, Aram Galstyan, Xiang Ren

Abstract: Event forecasting is a challenging, yet important task, as humans seek to constantly plan for the future. Existing automated forecasting studies rely mostly on structured data, such as time-series or event-based knowledge graphs, to help predict future events. In this work, we aim to formulate a task, construct a dataset, and provide benchmarks for developing methods for event forecasting with lar… ▽ More Event forecasting is a challenging, yet important task, as humans seek to constantly plan for the future. Existing automated forecasting studies rely mostly on structured data, such as time-series or event-based knowledge graphs, to help predict future events. In this work, we aim to formulate a task, construct a dataset, and provide benchmarks for developing methods for event forecasting with large volumes of unstructured text data. To simulate the forecasting scenario on temporal news documents, we formulate the problem as a restricted-domain, multiple-choice, question-answering (QA) task. Unlike existing QA tasks, our task limits accessible information, and thus a model has to make a forecasting judgement. To showcase the usefulness of this task formulation, we introduce ForecastQA, a question-answering dataset consisting of 10,392 event forecasting questions, which have been collected and verified via crowdsourcing efforts. We present our experiments on ForecastQA using BERT-based models and find that our best model achieves 60.1% accuracy on the dataset, which still lags behind human performance by about 19%. We hope ForecastQA will support future research efforts in bridging this gap. △ Less

Submitted 7 June, 2021; v1 submitted 2 May, 2020; originally announced May 2020.

Comments: Accepted to ACL 2021. Project page: https://inklab.usc.edu/ForecastQA/

arXiv:2004.04938 [pdf, other]

Identifying Distributional Perspective Differences from Colingual Groups

Authors: Yufei Tian, Tuhin Chakrabarty, Fred Morstatter, Nanyun Peng

Abstract: Perspective differences exist among different cultures or languages. A lack of mutual understanding among different groups about their perspectives on specific values or events may lead to uninformed decisions or biased opinions. Automatically understanding the group perspectives can provide essential background for many downstream applications of natural language processing techniques. In this pa… ▽ More Perspective differences exist among different cultures or languages. A lack of mutual understanding among different groups about their perspectives on specific values or events may lead to uninformed decisions or biased opinions. Automatically understanding the group perspectives can provide essential background for many downstream applications of natural language processing techniques. In this paper, we study colingual groups and use language corpora as a proxy to identify their distributional perspectives. We present a novel computational approach to learn shared understandings, and benchmark our method by building culturally-aware models for the English, Chinese, and Japanese languages. On a held out set of diverse topics including marriage, corruption, democracy, our model achieves high correlation with human judgements regarding intra-group values and inter-group differences. △ Less

Submitted 12 April, 2021; v1 submitted 10 April, 2020; originally announced April 2020.

arXiv:2004.01820 [pdf, other]

Aggressive, Repetitive, Intentional, Visible, and Imbalanced: Refining Representations for Cyberbullying Classification

Authors: Caleb Ziems, Ymir Vigfusson, Fred Morstatter

Abstract: Cyberbullying is a pervasive problem in online communities. To identify cyberbullying cases in large-scale social networks, content moderators depend on machine learning classifiers for automatic cyberbullying detection. However, existing models remain unfit for real-world applications, largely due to a shortage of publicly available training data and a lack of standard criteria for assigning grou… ▽ More Cyberbullying is a pervasive problem in online communities. To identify cyberbullying cases in large-scale social networks, content moderators depend on machine learning classifiers for automatic cyberbullying detection. However, existing models remain unfit for real-world applications, largely due to a shortage of publicly available training data and a lack of standard criteria for assigning ground truth labels. In this study, we address the need for reliable data using an original annotation framework. Inspired by social sciences research into bullying behavior, we characterize the nuanced problem of cyberbullying using five explicit factors to represent its social and linguistic aspects. We model this behavior using social network and language-based features, which improve classifier performance. These results demonstrate the importance of representing and modeling cyberbullying as a social phenomenon. △ Less

Submitted 3 April, 2020; originally announced April 2020.

Comments: 12 pages, 5 figures, 22 tables, Accepted to the 14th International AAAI Conference on Web and Social Media, ICWSM'20

arXiv:2003.12447 [pdf, other]

Anchor Attention for Hybrid Crowd Forecasts Aggregation

Authors: Yuzhong Huang, Andres Abeliuk, Fred Morstatter, Pavel Atanasov, Aram Galstyan

Abstract: In a crowd forecasting system, aggregation is an algorithm that returns aggregated probabilities for each question based on the probabilities provided per question by each individual in the crowd. Various aggregation methods have been proposed, but simple strategies like linear averaging or selecting the best-performing individual remain competitive. With the recent advance in neural networks, we… ▽ More In a crowd forecasting system, aggregation is an algorithm that returns aggregated probabilities for each question based on the probabilities provided per question by each individual in the crowd. Various aggregation methods have been proposed, but simple strategies like linear averaging or selecting the best-performing individual remain competitive. With the recent advance in neural networks, we model forecasts aggregation as a machine translation task, that translates from a sequence of individual forecasts into aggregated forecasts, based on proposed Anchor Attention between questions and forecasters. We evaluate our approach using data collected on our forecasting platform and publicly available Good Judgement Project dataset, and show that our method outperforms current state-of-the-art aggregation approaches by learning a good representation of forecaster and question. △ Less

Submitted 16 March, 2022; v1 submitted 3 March, 2020; originally announced March 2020.

arXiv:1910.10872 [pdf, other]

Man is to Person as Woman is to Location: Measuring Gender Bias in Named Entity Recognition

Authors: Ninareh Mehrabi, Thamme Gowda, Fred Morstatter, Nanyun Peng, Aram Galstyan

Abstract: We study the bias in several state-of-the-art named entity recognition (NER) models---specifically, a difference in the ability to recognize male and female names as PERSON entity types. We evaluate NER models on a dataset containing 139 years of U.S. census baby names and find that relatively more female names, as opposed to male names, are not recognized as PERSON entities. We study the extent o… ▽ More We study the bias in several state-of-the-art named entity recognition (NER) models---specifically, a difference in the ability to recognize male and female names as PERSON entity types. We evaluate NER models on a dataset containing 139 years of U.S. census baby names and find that relatively more female names, as opposed to male names, are not recognized as PERSON entities. We study the extent of this bias in several NER systems that are used prominently in industry and academia. In addition, we also report a bias in the datasets on which these models were trained. The result of this analysis yields a new benchmark for gender bias evaluation in named entity recognition systems. The data and code for the application of this benchmark will be publicly available for researchers to use. △ Less

Submitted 23 October, 2019; originally announced October 2019.

arXiv:1908.09635 [pdf, other]

A Survey on Bias and Fairness in Machine Learning

Authors: Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, Aram Galstyan

Abstract: With the widespread use of AI systems and applications in our everyday lives, it is important to take fairness issues into consideration while designing and engineering these types of systems. Such systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that the decisions do not reflect discriminatory behavior toward certain g… ▽ More With the widespread use of AI systems and applications in our everyday lives, it is important to take fairness issues into consideration while designing and engineering these types of systems. Such systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that the decisions do not reflect discriminatory behavior toward certain groups or populations. We have recently seen work in machine learning, natural language processing, and deep learning that addresses such challenges in different subdomains. With the commercialization of these systems, researchers are becoming aware of the biases that these applications can contain and have attempted to address them. In this survey we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined in order to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and how they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields. △ Less

Submitted 25 January, 2022; v1 submitted 22 August, 2019; originally announced August 2019.

Showing 1–50 of 60 results for author: Morstatter, F