subscribe to arXiv mailings

Foregrounding Artist Opinions: A Survey Study on Transparency, Ownership, and Fairness in AI Generative Art

Authors: Juniper Lovato, Julia Zimmerman, Isabelle Smith, Peter Dodds, Jennifer Karson

Abstract: Generative AI tools are used to create art-like outputs and sometimes aid in the creative process. These tools have potential benefits for artists, but they also have the potential to harm the art workforce and infringe upon artistic and intellectual property rights. Without explicit consent from artists, Generative AI creators scrape artists' digital work to train Generative AI models and produce… ▽ More Generative AI tools are used to create art-like outputs and sometimes aid in the creative process. These tools have potential benefits for artists, but they also have the potential to harm the art workforce and infringe upon artistic and intellectual property rights. Without explicit consent from artists, Generative AI creators scrape artists' digital work to train Generative AI models and produce art-like outputs at scale. These outputs are now being used to compete with human artists in the marketplace as well as being used by some artists in their generative processes to create art. We surveyed 459 artists to investigate the tension between artists' opinions on Generative AI art's potential utility and harm. This study surveys artists' opinions on the utility and threat of Generative AI art models, fair practices in the disclosure of artistic works in AI art training models, ownership and rights of AI art derivatives, and fair compensation. Results show that a majority of artists believe creators should disclose what art is being used in AI training, that AI outputs should not belong to model creators, and express concerns about AI's impact on the art workforce and who profits from their art. We hope the results of this work will further meaningful collaboration and alignment between the art community and Generative AI researchers and developers. △ Less

Submitted 1 October, 2024; v1 submitted 27 January, 2024; originally announced January 2024.

arXiv:2307.08580 [pdf, other]

The Resume Paradox: Greater Language Differences, Smaller Pay Gaps

Authors: Joshua R. Minot, Marc Maier, Bradford Demarest, Nicholas Cheney, Christopher M. Danforth, Peter Sheridan Dodds, Morgan R. Frank

Abstract: Over the past decade, the gender pay gap has remained steady with women earning 84 cents for every dollar earned by men on average. Many studies explain this gap through demand-side bias in the labor market represented through employers' job postings. However, few studies analyze potential bias from the worker supply-side. Here, we analyze the language in millions of US workers' resumes to investi… ▽ More Over the past decade, the gender pay gap has remained steady with women earning 84 cents for every dollar earned by men on average. Many studies explain this gap through demand-side bias in the labor market represented through employers' job postings. However, few studies analyze potential bias from the worker supply-side. Here, we analyze the language in millions of US workers' resumes to investigate how differences in workers' self-representation by gender compare to differences in earnings. Across US occupations, language differences between male and female resumes correspond to 11% of the variation in gender pay gap. This suggests that females' resumes that are semantically similar to males' resumes may have greater wage parity. However, surprisingly, occupations with greater language differences between male and female resumes have lower gender pay gaps. A doubling of the language difference between female and male resumes results in an annual wage increase of $2,797 for the average female worker. This result holds with controls for gender-biases of resume text and we find that per-word bias poorly describes the variance in wage gap. The results demonstrate that textual data and self-representation are valuable factors for improving worker representations and understanding employment inequities. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: 24 pages, 15 figures

arXiv:2306.06794 [pdf, other]

A blind spot for large language models: Supradiegetic linguistic information

Authors: Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Jonathan St. Onge, Mikaela Fudolig, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Large Language Models (LLMs) like ChatGPT reflect profound changes in the field of Artificial Intelligence, achieving a linguistic fluency that is impressively, even shockingly, human-like. The extent of their current and potential capabilities is an active area of investigation by no means limited to scientific researchers. It is common for people to frame the training data for LLMs as "text" or… ▽ More Large Language Models (LLMs) like ChatGPT reflect profound changes in the field of Artificial Intelligence, achieving a linguistic fluency that is impressively, even shockingly, human-like. The extent of their current and potential capabilities is an active area of investigation by no means limited to scientific researchers. It is common for people to frame the training data for LLMs as "text" or even "language". We examine the details of this framing using ideas from several areas, including linguistics, embodied cognition, cognitive science, mathematics, and history. We propose that considering what it is like to be an LLM like ChatGPT, as Nagel might have put it, can help us gain insight into its capabilities in general, and in particular, that its exposure to linguistic training data can be productively reframed as exposure to the diegetic information encoded in language, and its deficits can be reframed as ignorance of extradiegetic information, including supradiegetic linguistic information. Supradiegetic linguistic information consists of those arbitrary aspects of the physical form of language that are not derivable from the one-dimensional relations of context -- frequency, adjacency, proximity, co-occurrence -- that LLMs like ChatGPT have access to. Roughly speaking, the diegetic portion of a word can be thought of as its function, its meaning, as the information in a theoretical vector in a word embedding, while the supradiegetic portion of the word can be thought of as its form, like the shapes of its letters or the sounds of its syllables. We use these concepts to investigate why LLMs like ChatGPT have trouble handling palindromes, the visual characteristics of symbols, translating Sumerian cuneiform, and continuing integer sequences. △ Less

Submitted 16 May, 2024; v1 submitted 11 June, 2023; originally announced June 2023.

Comments: 21 pages, 6 figures, 3 tables. Accepted at IC2S2 2024. arXiv admin note: text overlap with arXiv:2206.02608, arXiv:2303.12712, arXiv:2305.10601, arXiv:2305.06424, arXiv:1908.08530 by other authors

Journal ref: Plutonics, Volume 17, 2024, pages 107 - 156

arXiv:2305.12160 [pdf, other]

Park visitation and walkshed demographics in the United States

Authors: Kelsey Linnell, Mikaela Fudolig, Laura Bloomfield, Thomas McAndrew, Taylor H. Ricketts, Jarlath P. M. O'Neil-Dunne, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: A large and growing body of research demonstrates the value of local parks to mental and physical well-being. Recently, researchers have begun using passive digital data sources to investigate equity in usage; exactly who is benefiting from parks? Early studies suggest that park visitation differs according to demographic features, and that the demographic composition of a park's surrounding neigh… ▽ More A large and growing body of research demonstrates the value of local parks to mental and physical well-being. Recently, researchers have begun using passive digital data sources to investigate equity in usage; exactly who is benefiting from parks? Early studies suggest that park visitation differs according to demographic features, and that the demographic composition of a park's surrounding neighborhood may be related to the utilization a park receives. Employing a data set of park visitations generated by observations of roughly 50 million mobile devices in the US in 2019, we assess the ability of the demographic composition of a park's walkshed to predict its yearly visitation. Predictive models are constructed using Support Vector Regression, LASSO, Elastic Net, and Random Forests. Surprisingly, our results suggest that the demographic composition of a park's walkshed demonstrates little to no utility for predicting visitation. △ Less

Submitted 20 May, 2023; originally announced May 2023.

arXiv:2305.08978 [pdf, other]

An assessment of measuring local levels of homelessness through proxy social media signals

Authors: Yoshi Meke Bird, Sarah E. Grobe, Michael V. Arnold, Sean P. Rogers, Mikaela I. Fudolig, Julia Witte Zimmerman, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Recent studies suggest social media activity can function as a proxy for measures of state-level public health, detectable through natural language processing. We present results of our efforts to apply this approach to estimate homelessness at the state level throughout the US during the period 2010-2019 and 2022 using a dataset of roughly 1 million geotagged tweets containing the substring ``hom… ▽ More Recent studies suggest social media activity can function as a proxy for measures of state-level public health, detectable through natural language processing. We present results of our efforts to apply this approach to estimate homelessness at the state level throughout the US during the period 2010-2019 and 2022 using a dataset of roughly 1 million geotagged tweets containing the substring ``homeless.'' Correlations between homelessness-related tweet counts and ranked per capita homelessness volume, but not general-population densities, suggest a relationship between the likelihood of Twitter users to personally encounter or observe homelessness in their everyday lives and their likelihood to communicate about it online. An increase to the log-odds of ``homeless'' appearing in an English-language tweet, as well as an acceleration in the increase in average tweet sentiment, suggest that tweets about homelessness are also affected by trends at the nation-scale. Additionally, changes to the lexical content of tweets over time suggest that reversals to the polarity of national or state-level trends may be detectable through an increase in political or service-sector language over the semantics of charity or direct appeals. An analysis of user account type also revealed changes to Twitter-use patterns by accounts authored by individuals versus entities that may provide an additional signal to confirm changes to homelessness density in a given jurisdiction. While a computational approach to social media analysis may provide a low-cost, real-time dataset rich with information about nationwide and localized impacts of homelessness and homelessness policy, we find that practical issues abound, limiting the potential of social media as a proxy to complement other measures of homelessness. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: 29 pages, 21 figures

arXiv:2305.03092 [pdf, other]

Curating corpora with classifiers: A case study of clean energy sentiment online

Authors: Michael V. Arnold, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: Well curated, large-scale corpora of social media posts containing broad public opinion offer an alternative data source to complement traditional surveys. While surveys are effective at collecting representative samples and are capable of achieving high accuracy, they can be both expensive to run and lag public opinion by days or weeks. Both of these drawbacks could be overcome with a real-time,… ▽ More Well curated, large-scale corpora of social media posts containing broad public opinion offer an alternative data source to complement traditional surveys. While surveys are effective at collecting representative samples and are capable of achieving high accuracy, they can be both expensive to run and lag public opinion by days or weeks. Both of these drawbacks could be overcome with a real-time, high volume data stream and fast analysis pipeline. A central challenge in orchestrating such a data pipeline is devising an effective method for rapidly selecting the best corpus of relevant documents for analysis. Querying with keywords alone often includes irrelevant documents that are not easily disambiguated with bag-of-words natural language processing methods. Here, we explore methods of corpus curation to filter irrelevant tweets using pre-trained transformer-based models, fine-tuned for our binary classification task on hand-labeled tweets. We are able to achieve F1 scores of up to 0.95. The low cost and high performance of fine-tuning such a model suggests that our approach could be of broad benefit as a pre-processing step for social media datasets with uncertain corpus boundaries. △ Less

Submitted 9 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

Comments: 12 pages, 6 figures

arXiv:2302.08936 [pdf, other]

More Data Types More Problems: A Temporal Analysis of Complexity, Stability, and Sensitivity in Privacy Policies

Authors: Juniper Lovato, Philip Mueller, Parisa Suchdev, Peter S. Dodds

Abstract: Collecting personally identifiable information (PII) on data subjects has become big business. Data brokers and data processors are part of a multi-billion-dollar industry that profits from collecting, buying, and selling consumer data. Yet there is little transparency in the data collection industry which makes it difficult to understand what types of data are being collected, used, and sold, and… ▽ More Collecting personally identifiable information (PII) on data subjects has become big business. Data brokers and data processors are part of a multi-billion-dollar industry that profits from collecting, buying, and selling consumer data. Yet there is little transparency in the data collection industry which makes it difficult to understand what types of data are being collected, used, and sold, and thus the risk to individual data subjects. In this study, we examine a large textual dataset of privacy policies from 1997-2019 in order to investigate the data collection activities of data brokers and data processors. We also develop an original lexicon of PII-related terms representing PII data types curated from legislative texts. This mesoscale analysis looks at privacy policies overtime on the word, topic, and network levels to understand the stability, complexity, and sensitivity of privacy policies over time. We find that (1) privacy legislation correlates with changes in stability and turbulence of PII data types in privacy policies; (2) the complexity of privacy policies decreases over time and becomes more regularized; (3) sensitivity rises over time and shows spikes that are correlated with events when new privacy legislation is introduced. △ Less

Submitted 17 February, 2023; originally announced February 2023.

arXiv:2209.04473 [pdf, other]

Reconstructing the Dynamic Directivity of Unconstrained Speech

Authors: Camille Noufi, Dejan Markovic, Peter Dodds

Abstract: This article presents a method for estimating and reconstructing the spatial energy distribution pattern of natural speech, which is crucial for achieving realistic vocal presence in virtual communication settings. The method comprises two stages. First, recordings of speech captured by a real, static microphone array are used to create an egocentric virtual array that tracks the movement of the s… ▽ More This article presents a method for estimating and reconstructing the spatial energy distribution pattern of natural speech, which is crucial for achieving realistic vocal presence in virtual communication settings. The method comprises two stages. First, recordings of speech captured by a real, static microphone array are used to create an egocentric virtual array that tracks the movement of the speaker over time. This virtual array is used to measure and encode the high-resolution directivity pattern of the speech signal as it evolves dynamically with natural speech and movement. In the second stage, the encoded directivity representation is utilized to train a machine learning model that can estimate the full, dynamic directivity pattern given a limited set of speech signals, such as those recorded using the microphones on a head-mounted display. Our results show that neural networks can accurately estimate the full directivity pattern of natural, unconstrained speech from limited information. The proposed method for estimating and reconstructing the spatial energy distribution pattern of natural speech, along with the evaluation of various machine learning models and training paradigms, provides an important contribution to the development of realistic vocal presence in virtual communication settings. △ Less

Submitted 5 September, 2023; v1 submitted 9 September, 2022; originally announced September 2022.

Comments: In proceedings of I3DA 2023 - The 2023 International Conference on Immersive and 3D Audio. DOI coming soon

arXiv:2208.09496 [pdf, other]

doi 10.1057/s41599-023-01680-4

A decomposition of book structure through ousiometric fluctuations in cumulative word-time

Authors: Mikaela Irene Fudolig, Thayer Alshaabi, Kathryn Cramer, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: While quantitative methods have been used to examine changes in word usage in books, studies have focused on overall trends, such as the shapes of narratives, which are independent of book length. We instead look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book, completed at any given point; we define this measure as "cumulati… ▽ More While quantitative methods have been used to examine changes in word usage in books, studies have focused on overall trends, such as the shapes of narratives, which are independent of book length. We instead look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book, completed at any given point; we define this measure as "cumulative word-time". Using ousiometrics, a reinterpretation of the valence-arousal-dominance framework of meaning obtained from semantic differentials, we convert text into time series of power and danger scores in cumulative word-time. Each time series is then decomposed using empirical mode decomposition into a sum of constituent oscillatory modes and a non-oscillatory trend. By comparing the decomposition of the original power and danger time series with those derived from shuffled text, we find that shorter books exhibit only a general trend, while longer books have fluctuations in addition to the general trend. These fluctuations typically have a period of a few thousand words regardless of the book length or library classification code, but vary depending on the content and structure of the book. Our findings suggest that, in the ousiometric sense, longer books are not expanded versions of shorter books, but are more similar in structure to a concatenation of shorter texts. Further, they are consistent with editorial practices that require longer texts to be broken down into sections, such as chapters. Our method also provides a data-driven denoising approach that works for texts of various lengths, in contrast to the more traditional approach of using large window sizes that may inadvertently smooth out relevant information, especially for shorter texts. These results open up avenues for future work in computational literary analysis, particularly the measurement of a basic unit of narrative. △ Less

Submitted 11 May, 2023; v1 submitted 19 August, 2022; originally announced August 2022.

Comments: published in Humanities and Social Sciences Communications

Journal ref: Humanit Soc Sci Commun 10, 187 (2023)

arXiv:2205.15937 [pdf, other]

Spatial changes in park visitation at the onset of the pandemic

Authors: Kelsey Linnell, Mikaela Fudolig, Aaron Schwartz, Taylor H. Ricketts, Jarlath P. M. O'Neil-Dunne, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: The COVID-19 pandemic disrupted the mobility patterns of a majority of Americans beginning in March 2020. Despite the beneficial, socially distanced activity offered by outdoor recreation, confusing and contradictory public health messaging complicated access to natural spaces. Working with a dataset comprising the locations of roughly 50 million distinct mobile devices in 2019 and 2020, we analyz… ▽ More The COVID-19 pandemic disrupted the mobility patterns of a majority of Americans beginning in March 2020. Despite the beneficial, socially distanced activity offered by outdoor recreation, confusing and contradictory public health messaging complicated access to natural spaces. Working with a dataset comprising the locations of roughly 50 million distinct mobile devices in 2019 and 2020, we analyze weekly visitation patterns for 8,135 parks across the United States. Using Bayesian inference, we identify regions that experienced a substantial change in visitation in the first few weeks of the pandemic. We find that regions that did not exhibit a change were likely to have smaller populations, and to have voted more republican than democrat in the 2020 elections. Our study contributes to a growing body of literature using passive observations to explore who benefits from access to nature. △ Less

Submitted 1 April, 2022; originally announced May 2022.

arXiv:2202.03416 [pdf, other]

Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks

Authors: Alexander Richard, Peter Dodds, Vamsi Krishna Ithapu

Abstract: Impulse response estimation in high noise and in-the-wild settings, with minimal control of the underlying data distributions, is a challenging problem. We propose a novel framework for parameterizing and estimating impulse responses based on recent advances in neural representation learning. Our framework is driven by a carefully designed neural network that jointly estimates the impulse response… ▽ More Impulse response estimation in high noise and in-the-wild settings, with minimal control of the underlying data distributions, is a challenging problem. We propose a novel framework for parameterizing and estimating impulse responses based on recent advances in neural representation learning. Our framework is driven by a carefully designed neural network that jointly estimates the impulse response and the (apriori unknown) spectral noise characteristics of an observed signal given the source signal. We demonstrate robustness in estimation, even under low signal-to-noise ratios, and show strong results when learning from spatio-temporal real-world speech data. Our framework provides a natural way to interpolate impulse responses on a spatial grid, while also allowing for efficiently compressing and storing them for real-time rendering applications in augmented and virtual reality. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2110.06847 [pdf, other]

Ousiometrics and Telegnomics: The essence of meaning conforms to a two-dimensional powerful-weak and dangerous-safe framework with diverse corpora presenting a safety bias

Authors: P. S. Dodds, T. Alshaabi, M. I. Fudolig, J. W. Zimmerman, J. Lovato, S. Beaulieu, J. R. Minot, M. V. Arnold, A. J. Reagan, C. M. Danforth

Abstract: We define `ousiometrics' to be the study of essential meaning in whatever context that meaningful signals are communicated, and `telegnomics' as the study of remotely sensed knowledge. From work emerging through the middle of the 20th century, the essence of meaning has become generally accepted as being well captured by the three orthogonal dimensions of evaluation, potency, and activation (EPA).… ▽ More We define `ousiometrics' to be the study of essential meaning in whatever context that meaningful signals are communicated, and `telegnomics' as the study of remotely sensed knowledge. From work emerging through the middle of the 20th century, the essence of meaning has become generally accepted as being well captured by the three orthogonal dimensions of evaluation, potency, and activation (EPA). By re-examining first types and then tokens for the English language, and through the use of automatically annotated histograms -- `ousiograms' -- we find here that: 1. The essence of meaning conveyed by words is instead best described by a compass-like power-danger (PD) framework, and 2. Analysis of a disparate collection of large-scale English language corpora -- literature, news, Wikipedia, talk radio, and social media -- shows that natural language exhibits a systematic bias toward safe, low danger words -- a reinterpretation of the Pollyanna principle's positivity bias for written expression. To help justify our choice of dimension names and to help address the problems with representing observed ousiometric dimensions by bipolar adjective pairs, we introduce and explore `synousionyms' and `antousionyms' -- ousiometric counterparts of synonyms and antonyms. We further show that the PD framework revises the circumplex model of affect as a more general model of state of mind. Finally, we use our findings to construct and test a prototype `ousiometer', a telegnomic instrument that measures ousiometric time series for temporal corpora. We contend that our power-danger ousiometric framework provides a complement for entropy-based measurements, and may be of value for the study of a wide variety of communication across biological and artificial life. △ Less

Submitted 29 March, 2023; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: 40 pages (34 page main manuscript, 6 page appendix), 15 figures (9 main, 6 appendix), 4 tables

arXiv:2110.00587 [pdf, other]

doi 10.1007/s41109-022-00446-2

Sentiment and structure in word co-occurrence networks on Twitter

Authors: Mikaela Irene Fudolig, Thayer Alshaabi, Michael V. Arnold, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: We explore the relationship between context and happiness scores in political tweets using word co-occurrence networks, where nodes in the network are the words, and the weight of an edge is the number of tweets in the corpus for which the two connected words co-occur. In particular, we consider tweets with hashtags #imwithher and #crookedhillary, both relating to Hillary Clinton's presidential bi… ▽ More We explore the relationship between context and happiness scores in political tweets using word co-occurrence networks, where nodes in the network are the words, and the weight of an edge is the number of tweets in the corpus for which the two connected words co-occur. In particular, we consider tweets with hashtags #imwithher and #crookedhillary, both relating to Hillary Clinton's presidential bid in 2016. We then analyze the network properties in conjunction with the word scores by comparing with null models to separate the effects of the network structure and the score distribution. Neutral words are found to be dominant and most words, regardless of polarity, tend to co-occur with neutral words. We do not observe any score homophily among positive and negative words. However, when we perform network backboning, community detection results in word groupings with meaningful narratives, and the happiness scores of the words in each group correspond to its respective theme. Thus, although we observe no clear relationship between happiness scores and co-occurrence at the node or edge level, a community-centric approach can isolate themes of competing sentiments in a corpus. △ Less

Submitted 1 October, 2021; originally announced October 2021.

Journal ref: Applied Network Science 7, 9 (2022)

arXiv:2109.15173 [pdf]

The role of new nuclear power in the UK's net-zero emissions energy system

Authors: James Price, Ilkka Keppo, Paul Dodds

Abstract: Swift and deep decarbonisation of electricity generation is central to enabling a timely transition to net-zero emission energy systems. While future power systems will likely be dominated by variable renewables (VRE), studies have identified a need for low-carbon dispatchable power such as nuclear. We use a cost-optimising power system model to examine the technoeconomic case for investment in ne… ▽ More Swift and deep decarbonisation of electricity generation is central to enabling a timely transition to net-zero emission energy systems. While future power systems will likely be dominated by variable renewables (VRE), studies have identified a need for low-carbon dispatchable power such as nuclear. We use a cost-optimising power system model to examine the technoeconomic case for investment in new nuclear capacity in the UK's net-zero emissions energy system and consider four sensitivity dimensions: the capital cost of new nuclear, the availability of competing technologies, the expansion of interconnection and weather conditions. We conclude that new nuclear capacity is only cost-effective if ambitious cost and construction times are assumed, competing technologies are unavailable and interconnector expansion is not permitted. We find that BECCS and long-term storage could reduce electricity system costs by 5-21% and that synchronous condensers can provide cost-effective inertia in highly renewable systems with low amounts of synchronous generation. We show that a nearly 100% variable renewable system with very little fossil fuels, no new build nuclear and facilitated by long-term storage is the most cost-effective system design. This suggests that the current favourable UK Government policy towards nuclear is becoming increasingly difficult to justify. △ Less

Submitted 7 October, 2021; v1 submitted 30 September, 2021; originally announced September 2021.

Comments: 26 pages, includes supplementary materials, submitted to Energy. Overnight capital costs now correctly converted to full costs including interest during construction, results largely unchanged

arXiv:2109.09010 [pdf, other]

doi 10.3389/frai.2021.783778

Augmenting semantic lexicons using word embeddings and transfer learning

Authors: Thayer Alshaabi, Colin M. Van Oort, Mikaela Irene Fudolig, Michael V. Arnold, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Sentiment-aware intelligent systems are essential to a wide array of applications. These systems are driven by language models which broadly fall into two paradigms: Lexicon-based and contextual. Although recent contextual models are increasingly dominant, we still see demand for lexicon-based models because of their interpretability and ease of use. For example, lexicon-based models allow researc… ▽ More Sentiment-aware intelligent systems are essential to a wide array of applications. These systems are driven by language models which broadly fall into two paradigms: Lexicon-based and contextual. Although recent contextual models are increasingly dominant, we still see demand for lexicon-based models because of their interpretability and ease of use. For example, lexicon-based models allow researchers to readily determine which words and phrases contribute most to a change in measured sentiment. A challenge for any lexicon-based approach is that the lexicon needs to be routinely expanded with new words and expressions. Here, we propose two models for automatic lexicon expansion. Our first model establishes a baseline employing a simple and shallow neural network initialized with pre-trained word embeddings using a non-contextual approach. Our second model improves upon our baseline, featuring a deep Transformer-based network that brings to bear word definitions to estimate their lexical polarity. Our evaluation shows that both models are able to score new words with a similar accuracy to reviewers from Amazon Mechanical Turk, but at a fraction of the cost. △ Less

Submitted 2 November, 2021; v1 submitted 18 September, 2021; originally announced September 2021.

Comments: 17 pages, 8 figures

Journal ref: Front. Artif. Intell. 4:783778 (2022)

arXiv:2107.06096 [pdf, other]

Blending search queries with social media data to improve forecasts of economic indicators

Authors: Yi Li, Asieh Ahani, Haimao Zhan, Kevin Foley, Thayer Alshaabi, Kelsey Linnell, Peter Sheridan Dodds, Christopher M. Danforth, Adam Fox

Abstract: The forecasting of political, economic, and public health indicators using internet activity has demonstrated mixed results. For example, while some measures of explicitly surveyed public opinion correlate well with social media proxies, the opportunity for profitable investment strategies to be driven solely by sentiment extracted from social media appears to have expired. Nevertheless, the inter… ▽ More The forecasting of political, economic, and public health indicators using internet activity has demonstrated mixed results. For example, while some measures of explicitly surveyed public opinion correlate well with social media proxies, the opportunity for profitable investment strategies to be driven solely by sentiment extracted from social media appears to have expired. Nevertheless, the internet's space of potentially predictive input signals is combinatorially vast and will continue to invite careful exploration. Here, we combine unemployment related search data from Google Trends with economic language on Twitter to attempt to nowcast and forecast: 1. State and national unemployment claims for the US, and 2. Consumer confidence in G7 countries. Building off of a recently developed search-query-based model, we show that incorporating Twitter data improves forecasting of unemployment claims, while the original method remains marginally better at nowcasting. Enriching the input signal with temporal statistical features (e.g., moving average and rate of change) further reduces errors, and improves the predictive utility of the proposed method when applied to other economic indices, such as consumer confidence. △ Less

Submitted 9 July, 2021; originally announced July 2021.

Comments: 12 pages, 7 figures

arXiv:2107.04929 [pdf, other]

Computational Paremiology: Charting the temporal, ecological dynamics of proverb use in books, news articles, and tweets

Authors: E. Davis, C. M. Danforth, W. Mieder, P. S. Dodds

Abstract: Proverbs are an essential component of language and culture, and though much attention has been paid to their history and currency, there has been comparatively little quantitative work on changes in the frequency with which they are used over time. With wider availability of large corpora reflecting many diverse genres of documents, it is now possible to take a broad and dynamic view of the impor… ▽ More Proverbs are an essential component of language and culture, and though much attention has been paid to their history and currency, there has been comparatively little quantitative work on changes in the frequency with which they are used over time. With wider availability of large corpora reflecting many diverse genres of documents, it is now possible to take a broad and dynamic view of the importance of the proverb. Here, we measure temporal changes in the relevance of proverbs within three corpora, differing in kind, scale, and time frame: Millions of books over centuries; hundreds of millions of news articles over twenty years; and billions of tweets over a decade. We find that proverbs present heavy-tailed frequency-of-usage rank distributions in each venue; exhibit trends reflecting the cultural dynamics of the eras covered; and have evolved into contemporary forms on social media. △ Less

Submitted 10 July, 2021; originally announced July 2021.

Comments: Main paper: 16 pages, 9 figures, 1 table; Supplementary: 5 pages, 4 tables, 4 figures

arXiv:2106.10281 [pdf, other]

Say Their Names: Resurgence in the collective attention toward Black victims of fatal police violence following the death of George Floyd

Authors: Henry H. Wu, Ryan J. Gallagher, Thayer Alshaabi, Jane L. Adams, Joshua R. Minot, Michael V. Arnold, Brooke Foucault Welles, Randall Harp, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: The murder of George Floyd by police in May 2020 sparked international protests and renewed attention in the Black Lives Matter movement. Here, we characterize ways in which the online activity following George Floyd's death was unparalleled in its volume and intensity, including setting records for activity on Twitter, prompting the saddest day in the platform's history, and causing George Floyd'… ▽ More The murder of George Floyd by police in May 2020 sparked international protests and renewed attention in the Black Lives Matter movement. Here, we characterize ways in which the online activity following George Floyd's death was unparalleled in its volume and intensity, including setting records for activity on Twitter, prompting the saddest day in the platform's history, and causing George Floyd's name to appear among the ten most frequently used phrases in a day, where he is the only individual to have ever received that level of attention who was not known to the public earlier that same week. Further, we find this attention extended beyond George Floyd and that more Black victims of fatal police violence received attention following his death than during other past moments in Black Lives Matter's history. We place that attention within the context of prior online racial justice activism by showing how the names of Black victims of police violence have been lifted and memorialized over the last 12 years on Twitter. Our results suggest that the 2020 wave of attention to the Black Lives Matter movement centered past instances of police violence in an unprecedented way, demonstrating the impact of the movement's rhetorical strategy to "say their names." △ Less

Submitted 18 June, 2021; originally announced June 2021.

arXiv:2106.05260 [pdf, other]

Sirius: Visualization of Mixed Features as a Mutual Information Network Graph

Authors: Jane L. Adams, Todd F. Deluca, Christopher M. Danforth, Peter S. Dodds, Yuhang Zheng, Konstantinos Anastasakis, Boyoon Choi, Allison Min, Michael M. Bessey

Abstract: Data scientists across disciplines are increasingly in need of exploratory analysis tools for data sets with a high volume of features of mixed data type (quantitative continuous and discrete categorical). We introduce Sirius, a novel visualization package for researchers to explore feature relationships among mixed data types using mutual information. The visualization of feature relationships ai… ▽ More Data scientists across disciplines are increasingly in need of exploratory analysis tools for data sets with a high volume of features of mixed data type (quantitative continuous and discrete categorical). We introduce Sirius, a novel visualization package for researchers to explore feature relationships among mixed data types using mutual information. The visualization of feature relationships aids data scientists in finding meaningful dependence among features prior to the development of predictive modeling pipelines, which can inform downstream analysis such as feature selection, feature extraction, and early detection of potential proxy variables. Using an information theoretic approach, Sirius supports network visualization of heterogeneous data sets (consisting of continuous and discrete data types), and provides a user interface for exploring feature pairs with locally significant mutual information scores. Mutual information algorithm and bivariate chart types are assigned on a data type pairing basis (continuous-continuous, discrete-discrete, and discrete-continuous). We show how this tool can be used for tasks such as hypothesis confirmation, identification of predictive features, suggestions for feature extraction, or early warning of data abnormalities. The accompanying website for this paper can be accessed at https://sirius.universalities.com/. All code and supplemental materials can be accessed at https://osf.io/pdm9r/. △ Less

Submitted 13 August, 2022; v1 submitted 9 June, 2021; originally announced June 2021.

ACM Class: H.5.2; J.0

arXiv:2106.01481 [pdf, other]

Quantifying language changes surrounding mental health on Twitter

Authors: Anne Marie Stupinski, Thayer Alshaabi, Michael V. Arnold, Jane Lydia Adams, Joshua R. Minot, Matthew Price, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: Mental health challenges are thought to afflict around 10% of the global population each year, with many going untreated due to stigma and limited access to services. Here, we explore trends in words and phrases related to mental health through a collection of 1- , 2-, and 3-grams parsed from a data stream of roughly 10% of all English tweets since 2012. We examine temporal dynamics of mental heal… ▽ More Mental health challenges are thought to afflict around 10% of the global population each year, with many going untreated due to stigma and limited access to services. Here, we explore trends in words and phrases related to mental health through a collection of 1- , 2-, and 3-grams parsed from a data stream of roughly 10% of all English tweets since 2012. We examine temporal dynamics of mental health language, finding that the popularity of the phrase 'mental health' increased by nearly two orders of magnitude between 2012 and 2018. We observe that mentions of 'mental health' spike annually and reliably due to mental health awareness campaigns, as well as unpredictably in response to mass shootings, celebrities dying by suicide, and popular fictional stories portraying suicide. We find that the level of positivity of messages containing 'mental health', while stable through the growth period, has declined recently. Finally, we use the ratio of original tweets to retweets to quantify the fraction of appearances of mental health language due to social amplification. Since 2015, mentions of mental health have become increasingly due to retweets, suggesting that stigma associated with discussion of mental health on Twitter has diminished with time. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: 12 pages, 5 figures, 1 table

arXiv:2105.12006 [pdf, other]

The incel lexicon: Deciphering the emergent cryptolect of a global misogynistic community

Authors: Kelly Gothard, David Rushing Dewhurst, Joshua R. Minot, Jane Lydia Adams, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Evolving out of a gender-neutral framing of an involuntary celibate identity, the concept of `incels' has come to refer to an online community of men who bear antipathy towards themselves, women, and society-at-large for their perceived inability to find and maintain sexual relationships. By exploring incel language use on Reddit, a global online message board, we contextualize the incel community… ▽ More Evolving out of a gender-neutral framing of an involuntary celibate identity, the concept of `incels' has come to refer to an online community of men who bear antipathy towards themselves, women, and society-at-large for their perceived inability to find and maintain sexual relationships. By exploring incel language use on Reddit, a global online message board, we contextualize the incel community's online expressions of misogyny and real-world acts of violence perpetrated against women. After assembling around three million comments from incel-themed Reddit channels, we analyze the temporal dynamics of a data driven rank ordering of the glossary of phrases belonging to an emergent incel lexicon. Our study reveals the generation and normalization of an extensive coded misogynist vocabulary in service of the group's identity. △ Less

Submitted 25 May, 2021; originally announced May 2021.

Comments: 18 pages, 11 figures

arXiv:2103.05841 [pdf, other]

Interpretable bias mitigation for textual data: Reducing gender bias in patient notes while maintaining classification performance

Authors: Joshua R. Minot, Nicholas Cheney, Marc Maier, Danne C. Elbers, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Medical systems in general, and patient treatment decisions and outcomes in particular, are affected by bias based on gender and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language mo… ▽ More Medical systems in general, and patient treatment decisions and outcomes in particular, are affected by bias based on gender and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language models -- statistical estimates of the relationships between concepts derived from distant reading of corpora. Building on this work, we investigate how word choices made by healthcare practitioners and language models interact with regards to bias. We identify and remove gendered language from two clinical-note datasets and describe a new debiasing procedure using BERT-based gender classifiers. We show minimal degradation in health condition classification tasks for low- to medium-levels of bias removal via data augmentation. Finally, we compare the bias semantically encoded in the language models with the bias empirically observed in health records. This work outlines an interpretable approach for using data augmentation to identify and reduce the potential for bias in natural language processing pipelines. △ Less

Submitted 9 March, 2021; originally announced March 2021.

Comments: 31 pages, 22 figures

arXiv:2008.13078 [pdf, other]

Probability-turbulence divergence: A tunable allotaxonometric instrument for comparing heavy-tailed categorical distributions

Authors: P. S. Dodds, J. R. Minot, M. V. Arnold, T. Alshaabi, J. L. Adams, A. J. Reagan, C. M. Danforth

Abstract: Real-world complex systems often comprise many distinct types of elements as well as many more types of networked interactions between elements. When the relative abundances of types can be measured well, we further observe heavy-tailed categorical distributions for type frequencies. For the comparison of type frequency distributions of two systems or a system with itself at different time points… ▽ More Real-world complex systems often comprise many distinct types of elements as well as many more types of networked interactions between elements. When the relative abundances of types can be measured well, we further observe heavy-tailed categorical distributions for type frequencies. For the comparison of type frequency distributions of two systems or a system with itself at different time points in time -- a facet of allotaxonometry -- a great range of probability divergences are available. Here, we introduce and explore `probability-turbulence divergence', a tunable, straightforward, and interpretable instrument for comparing normalizable categorical frequency distributions. We model probability-turbulence divergence (PTD) after rank-turbulence divergence (RTD). While probability-turbulence divergence is more limited in application than rank-turbulence divergence, it is more sensitive to changes in type frequency. We build allotaxonographs to display probability turbulence, incorporating a way to visually accommodate zero probabilities for `exclusive types' which are types that appear in only one system. We explore comparisons of example distributions taken from literature, social media, and ecology. We show how probability-turbulence divergence either explicitly or functionally generalizes many existing kinds of distances and measures, including, as special cases, $L^{(p)}$ norms, the Sørensen-Dice coefficient (the $F_1$ statistic), and the Hellinger distance. We discuss similarities with the generalized entropies of R{é}nyi and Tsallis, and the diversity indices (or Hill numbers) from ecology. We close with thoughts on open problems concerning the optimization of the tuning of rank- and probability-turbulence divergence. △ Less

Submitted 20 August, 2024; v1 submitted 29 August, 2020; originally announced August 2020.

Comments: 14 pages, 7 figures

arXiv:2008.11305 [pdf, other]

Long-term word frequency dynamics derived from Twitter are corrupted: A bespoke approach to detecting and removing pathologies in ensembles of time series

Authors: P. S. Dodds, J. R. Minot, M. V. Arnold, T. Alshaabi, J. L. Adams, D. R. Dewhurst, A. J. Reagan, C. M. Danforth

Abstract: Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in… ▽ More Maintaining the integrity of long-term data collection is an essential scientific practice. As a field evolves, so too will that field's measurement instruments and data storage systems, as they are invented, improved upon, and made obsolete. For data streams generated by opaque sociotechnical systems which may have episodic and unknown internal rule changes, detecting and accounting for shifts in historical datasets requires vigilance and creative analysis. Here, we show that around 10\% of day-scale word usage frequency time series for Twitter collected in real time for a set of roughly 10,000 frequently used words for over 10 years come from tweets with, in effect, corrupted language labels. We describe how we uncovered problematic signals while comparing word usage over varying time frames. We locate time points where Twitter switched on or off different kinds of language identification algorithms, and where data formats may have changed. We then show how we create a statistic for identifying and removing words with pathological time series. While our resulting process for removing `bad' time series from ensembles of time series is particular, the approach leading to its construction may be generalizeable. △ Less

Submitted 27 August, 2020; v1 submitted 25 August, 2020; originally announced August 2020.

Comments: 8 pages, 5 figures

arXiv:2008.07301 [pdf, other]

doi 10.1371/journal.pone.0260592

Computational timeline reconstruction of the stories surrounding Trump: Story turbulence, narrative control, and collective chronopathy

Authors: P. S. Dodds, J. R. Minot, M. V. Arnold, T. Alshaabi, J. L. Adams, A. J. Reagan, C. M. Danforth

Abstract: Measuring the specific kind, temporal ordering, diversity, and turnover rate of stories surrounding any given subject is essential to developing a complete reckoning of that subject's historical impact. Here, we use Twitter as a distributed news and opinion aggregation source to identify and track the dynamics of the dominant day-scale stories around Donald Trump, the 45th President of the United… ▽ More Measuring the specific kind, temporal ordering, diversity, and turnover rate of stories surrounding any given subject is essential to developing a complete reckoning of that subject's historical impact. Here, we use Twitter as a distributed news and opinion aggregation source to identify and track the dynamics of the dominant day-scale stories around Donald Trump, the 45th President of the United States. Working with a data set comprising around 20 billion 1-grams, we first compare each day's 1-gram and 2-gram usage frequencies to those of a year before, to create day- and week-scale timelines for Trump stories for 2016 through 2020. We measure Trump's narrative control, the extent to which stories have been about Trump or put forward by Trump. We then quantify story turbulence and collective chronopathy -- the rate at which a population's stories for a subject seem to change over time. We show that 2017 was the most turbulent overall year for Trump. In 2020, story generation slowed dramatically during the first two major waves of the COVID-19 pandemic, with rapid turnover returning first with the Black Lives Matter protests following George Floyd's murder and then later by events leading up to and following the 2020 US presidential election, including the storming of the US Capitol six days into 2021. Trump story turnover for 2 months during the COVID-19 pandemic was on par with that of 3 days in September 2017. Our methods may be applied to any well-discussed phenomenon, and have potential to enable the computational aspects of journalism, history, and biography. △ Less

Submitted 30 September, 2022; v1 submitted 17 August, 2020; originally announced August 2020.

Comments: 13 pages, 5 figures (4 main, 1 appendix), 1 table. Analysis complete for 6 calendar years, from 2015/01/01 through to 2021/12/31

Journal ref: PLOS ONE, 2021, e0260592

arXiv:2008.02250 [pdf, other]

doi 10.1140/epjds/s13688-021-00260-3

Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts

Authors: Ryan J. Gallagher, Morgan R. Frank, Lewis Mitchell, Aaron J. Schwartz, Andrew J. Reagan, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts' rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or… ▽ More A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts' rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or measurement validity. To better capture fine-grained differences between texts, we introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts for any measure that can be formulated as a weighted average. We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback-Leibler and Jensen-Shannon divergences. Through several case studies, we demonstrate how generalized word shift graphs can be flexibly applied across domains for diagnostic investigation, hypothesis generation, and substantive interpretation. By providing a detailed lens into textual shifts between corpora, generalized word shift graphs help computational social scientists, digital humanists, and other text analysis practitioners fashion more robust scientific narratives. △ Less

Submitted 5 August, 2020; originally announced August 2020.

Comments: 20 pages, 7 figures, 2 tables

Journal ref: EPJ Data Science, 10(4), 2021

arXiv:2007.12988 [pdf, other]

doi 10.1126/sciadv.abe6534

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter

Authors: Thayer Alshaabi, Jane L. Adams, Michael V. Arnold, Joshua R. Minot, David R. Dewhurst, Andrew J. Reagan, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.… ▽ More In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.e., popularity) of trending storylines and contemporary cultural phenomena. Here, we describe Storywrangler, a natural language processing instrument designed to carry out an ongoing, day-scale curation of over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into unigrams, bigrams, and trigrams spanning over 100 languages. We track n-gram usage frequencies, and generate Zipf distributions, for words, hashtags, handles, numerals, symbols, and emojis. We make the data set available through an interactive time series viewer, and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of extracting and tracking dynamic changes of n-grams can be extended to any similar social media platform. We showcase a few examples of the many possible avenues of study we aim to enable including how social amplification can be visualized through 'contagiograms'. We also present some example case studies that bridge n-gram time series with disparate data sources to explore sociotechnical dynamics of famous individuals, box office success, and social unrest. △ Less

Submitted 16 July, 2021; v1 submitted 25 July, 2020; originally announced July 2020.

Comments: Main text: 15 pages, 6 figures; Supplementary text: 23 pages, 11 figures, 15 tables. Website: https://storywrangling.org/

Journal ref: Sci.Adv. 7 eabe6534 (2021)

arXiv:2007.09124 [pdf, other]

Local information sources received the most attention from Puerto Ricans during the aftermath of Hurricane María

Authors: Benjamin Freixas Emery, Meredith T. Niles, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: In September 2017, Hurricane María made landfall across the Caribbean region as a category 4 storm. In the aftermath, many residents of Puerto Rico were without power or clean running water for nearly a year. Using both English and Spanish tweets from September 16 to October 15 2017, we investigate discussion of María both on and off the island, constructing a proxy for the temporal network of com… ▽ More In September 2017, Hurricane María made landfall across the Caribbean region as a category 4 storm. In the aftermath, many residents of Puerto Rico were without power or clean running water for nearly a year. Using both English and Spanish tweets from September 16 to October 15 2017, we investigate discussion of María both on and off the island, constructing a proxy for the temporal network of communication between victims of the hurricane and others. We use information theoretic tools to compare the lexical divergence of different subgroups within the network. Lastly, we quantify temporal changes in user prominence throughout the event. We find at the global level that Spanish tweets more often contained messages of hope and a focus on those helping. At the local level, we find that information propagating among Puerto Ricans most often originated from sources local to the island, such as journalists and politicians. Critically, content from these accounts overshadows content from celebrities, global news networks, and the like for the large majority of the time period studied. Our findings reveal insight into ways social media campaigns could be deployed to disseminate relief information during similar events in the future. △ Less

Submitted 17 July, 2020; originally announced July 2020.

arXiv:2006.10658 [pdf, other]

Gauging the happiness benefit of US urban parks through Twitter

Authors: A. J. Schwartz, P. S. Dodds, J. P. M. O'Neil-Dunne, T. H. Ricketts, C. M. Danforth

Abstract: The relationship between nature contact and mental well-being has received increasing attention in recent years. While a body of evidence has accumulated demonstrating a positive relationship between time in nature and mental well-being, there have been few studies comparing this relationship in different locations over long periods of time. In this study, we estimate a happiness benefit, the diff… ▽ More The relationship between nature contact and mental well-being has received increasing attention in recent years. While a body of evidence has accumulated demonstrating a positive relationship between time in nature and mental well-being, there have been few studies comparing this relationship in different locations over long periods of time. In this study, we estimate a happiness benefit, the difference in expressed happiness between in- and out-of-park tweets, for the 25 largest cities in the US by population. People write happier words during park visits when compared with non-park user tweets collected around the same time. While the words people write are happier in parks on average and in most cities, we find considerable variation across cities. Tweets are happier in parks at all times of the day, week, and year, not just during the weekend or summer vacation. Across all cities, we find that the happiness benefit is highest in parks larger than 100 acres. Overall, our study suggests the happiness benefit associated with park visitation is on par with US holidays such as Thanksgiving and New Year's Day. △ Less

Submitted 18 June, 2020; originally announced June 2020.

Comments: 13 pages including appendix, 9 figures, 2 tables

arXiv:2006.08527 [pdf, other]

doi 10.1371/journal.pone.0247795

The sociospatial factors of death: Analyzing effects of geospatially-distributed variables in a Bayesian mortality model for Hong Kong

Authors: Thayer Alshaabi, David Rushing Dewhurst, James P. Bagrow, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: Human mortality is in part a function of multiple socioeconomic factors that differ both spatially and temporally. Adjusting for other covariates, the human lifespan is positively associated with household wealth. However, the extent to which mortality in a geographical region is a function of socioeconomic factors in both that region and its neighbors is unclear. There is also little information… ▽ More Human mortality is in part a function of multiple socioeconomic factors that differ both spatially and temporally. Adjusting for other covariates, the human lifespan is positively associated with household wealth. However, the extent to which mortality in a geographical region is a function of socioeconomic factors in both that region and its neighbors is unclear. There is also little information on the temporal components of this relationship. Using the districts of Hong Kong over multiple census years as a case study, we demonstrate that there are differences in how wealth indicator variables are associated with longevity in (a) areas that are affluent but neighbored by socially deprived districts versus (b) wealthy areas surrounded by similarly wealthy districts. We also show that the inclusion of spatially-distributed variables reduces uncertainty in mortality rate predictions in each census year when compared with a baseline model. Our results suggest that geographic mortality models should incorporate nonlocal information (e.g., spatial neighbors) to lower the variance of their mortality estimates, and point to a more in-depth analysis of sociospatial spillover effects on mortality rates. △ Less

Submitted 25 January, 2021; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: 26 pages (15 main, 11 appendix), 22 figures (6 main, 11 appendix), 2 tables

arXiv:2006.03526 [pdf, other]

doi 10.1371/journal.pone.0248880

Ratioing the President: An exploration of public engagement with Obama and Trump on Twitter

Authors: Joshua R. Minot, Michael V. Arnold, Thayer Alshaabi, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: The past decade has witnessed a marked increase in the use of social media by politicians, most notably exemplified by the 45th President of the United States (POTUS), Donald Trump. On Twitter, POTUS messages consistently attract high levels of engagement as measured by likes, retweets, and replies. Here, we quantify the balance of these activities, also known as "ratios", and study their dynamics… ▽ More The past decade has witnessed a marked increase in the use of social media by politicians, most notably exemplified by the 45th President of the United States (POTUS), Donald Trump. On Twitter, POTUS messages consistently attract high levels of engagement as measured by likes, retweets, and replies. Here, we quantify the balance of these activities, also known as "ratios", and study their dynamics as a proxy for collective political engagement in response to presidential communications. We find that raw activity counts increase during the period leading up to the 2016 election, accompanied by a regime change in the ratio of retweets-to-replies connected to the transition between campaigning and governing. For the Trump account, we find words related to fake news and the Mueller inquiry are more common in tweets with a high number of replies relative to retweets. Finally, we find that Barack Obama consistently received a higher retweet-to-reply ratio than Donald Trump. These results suggest Trump's Twitter posts are more often controversial and subject to enduring engagement as a given news cycle unfolds. △ Less

Submitted 5 June, 2020; originally announced June 2020.

Comments: 17 pages, 10 figures

arXiv:2004.06790 [pdf, other]

The sleep loss insult of Spring Daylight Savings in the US is absorbed by Twitter users within 48 hours

Authors: Kelsey Linnell, Thayer Alshaabi, Thomas McAndrew, Jeanie Lim, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: Sleep loss has been linked to heart disease, diabetes, cancer, and an increase in accidents, all of which are among the leading causes of death in the United States. Population-scale sleep studies have the potential to advance public health by helping to identify at-risk populations, changes in collective sleep patterns, and to inform policy change. Prior research suggests other kinds of health in… ▽ More Sleep loss has been linked to heart disease, diabetes, cancer, and an increase in accidents, all of which are among the leading causes of death in the United States. Population-scale sleep studies have the potential to advance public health by helping to identify at-risk populations, changes in collective sleep patterns, and to inform policy change. Prior research suggests other kinds of health indicators such as depression and obesity can be estimated using social media activity. However, the inability to effectively measure collective sleep with publicly available data has limited large-scale academic studies. Here, we investigate the passive estimation of sleep loss through a proxy analysis of Twitter activity profiles. We use "Spring Forward" events, which occur at the beginning of Daylight Savings Time in the United States, as a natural experimental condition to estimate spatial differences in sleep loss across the United States. On average, peak Twitter activity occurs roughly 45 minutes later on the Sunday following Spring Forward. By Monday morning however, activity curves are realigned with the week before, suggesting that at least on Twitter, the lost hour of early Sunday morning has been quickly absorbed. △ Less

Submitted 10 April, 2020; originally announced April 2020.

arXiv:2004.03516 [pdf, other]

Divergent modes of online collective attention to the COVID-19 pandemic are associated with future caseload variance

Authors: David Rushing Dewhurst, Thayer Alshaabi, Michael V. Arnold, Joshua R. Minot, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Using a random 10% sample of tweets authored from 2019-09-01 through 2020-04-30, we analyze the dynamic behavior of words (1-grams) used on Twitter to describe the ongoing COVID-19 pandemic. Across 24 languages, we find two distinct dynamic regimes: One characterizing the rise and subsequent collapse in collective attention to the initial Coronavirus outbreak in late January, and a second that rep… ▽ More Using a random 10% sample of tweets authored from 2019-09-01 through 2020-04-30, we analyze the dynamic behavior of words (1-grams) used on Twitter to describe the ongoing COVID-19 pandemic. Across 24 languages, we find two distinct dynamic regimes: One characterizing the rise and subsequent collapse in collective attention to the initial Coronavirus outbreak in late January, and a second that represents March COVID-19-related discourse. Aggregating countries by dominant language use, we find that volatility in the first dynamic regime is associated with future volatility in new cases of COVID-19 roughly three weeks (average 22.49 $\pm$ 3.26 days) later. Our results suggest that surveillance of change in usage of epidemiology-related words on social media may be useful in forecasting later change in disease case numbers, but we emphasize that our current findings are not causal or necessarily predictive. △ Less

Submitted 19 May, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

Comments: 12 + 4 pages, 11 + 4 figures, code + data + figures will soon be available at http://compstorylab.org/covid19ngrams/

arXiv:2003.14291 [pdf, other]

doi 10.1371/journal.pone.0251762

Hurricanes and hashtags: Characterizing online collective attention for natural disasters

Authors: Michael V. Arnold, David Rushing Dewhurst, Thayer Alshaabi, Joshua R. Minot, Jane L. Adams, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: We study collective attention paid towards hurricanes through the lens of $n$-grams on Twitter, a social media platform with global reach. Using hurricane name mentions as a proxy for awareness, we find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct `… ▽ More We study collective attention paid towards hurricanes through the lens of $n$-grams on Twitter, a social media platform with global reach. Using hurricane name mentions as a proxy for awareness, we find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct `hurricane attention maps' and observe that hurricanes causing deaths on (or economic damage to) the continental United States generate substantially more attention in English language tweets than those that do not. We find that a hurricane's Saffir-Simpson wind scale category assignment is strongly associated with the amount of attention it receives. Higher category storms receive higher proportional increases of attention per proportional increases in number of deaths or dollars of damage, than lower category storms. The most damaging and deadly storms of the 2010s, Hurricanes Harvey and Maria, generated the most attention and were remembered the longest, respectively. On average, a category 5 storm receives 4.6 times more attention than a category 1 storm causing the same number of deaths and economic damage. △ Less

Submitted 31 March, 2020; originally announced March 2020.

Comments: 31 pages (14 main, 17 Supplemental), 19 figures (5 main, 14 appendix)

arXiv:2003.12614 [pdf, other]

doi 10.1371/journal.pone.0244476

How the world's collective attention is being paid to a pandemic: COVID-19 related n-gram time series for 24 languages on Twitter

Authors: T. Alshaabi, J. R. Minot, M. V. Arnold, J. L. Adams, D. R. Dewhurst, A. J. Reagan, R. Muhamad, C. M. Danforth, P. S. Dodds

Abstract: In confronting the global spread of the coronavirus disease COVID-19 pandemic we must have coordinated medical, operational, and political responses. In all efforts, data is crucial. Fundamentally, and in the possible absence of a vaccine for 12 to 18 months, we need universal, well-documented testing for both the presence of the disease as well as confirmed recovery through serological tests for… ▽ More In confronting the global spread of the coronavirus disease COVID-19 pandemic we must have coordinated medical, operational, and political responses. In all efforts, data is crucial. Fundamentally, and in the possible absence of a vaccine for 12 to 18 months, we need universal, well-documented testing for both the presence of the disease as well as confirmed recovery through serological tests for antibodies, and we need to track major socioeconomic indices. But we also need auxiliary data of all kinds, including data related to how populations are talking about the unfolding pandemic through news and stories. To in part help on the social media side, we curate a set of 2000 day-scale time series of 1- and 2-grams across 24 languages on Twitter that are most 'important' for April 2020 with respect to April 2019. We determine importance through our allotaxonometric instrument, rank-turbulence divergence. We make some basic observations about some of the time series, including a comparison to numbers of confirmed deaths due to COVID-19 over time. We broadly observe across all languages a peak for the language-specific word for 'virus' in January 2020 followed by a decline through February and then a surge through March and April. The world's collective attention dropped away while the virus spread out from China. We host the time series on Gitlab, updating them on a daily basis while relevant. Our main intent is for other researchers to use these time series to enhance whatever analyses that may be of use during the pandemic as well as for retrospective investigations. △ Less

Submitted 6 January, 2021; v1 submitted 27 March, 2020; originally announced March 2020.

Comments: 13 pages, 6 figures, 3 tables, website: http://compstorylab.org/covid19ngrams/

arXiv:2003.03667 [pdf, other]

doi 10.1140/epjds/s13688-021-00271-0

The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009-2020

Authors: Thayer Alshaabi, David R. Dewhurst, Joshua R. Minot, Michael V. Arnold, Jane L. Adams, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio':… ▽ More Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the 'contagion ratio': The balance of retweets to organic messages. We find that for the most common languages on Twitter there is a growing tendency, though not universal, to retweet rather than share new content. By the end of 2019, the contagion ratios for half of the top 30 languages, including English and Spanish, had reached above 1 -- the naive contagion threshold. In 2019, the top 5 languages with the highest average daily ratios were, in order, Thai (7.3), Hindi, Tamil, Urdu, and Catalan, while the bottom 5 were Russian, Swedish, Esperanto, Cebuano, and Finnish (0.26). Further, we show that over time, the contagion ratios for most common languages are growing more strongly than those of rare languages. △ Less

Submitted 8 March, 2021; v1 submitted 7 March, 2020; originally announced March 2020.

Comments: 26 pages (15 main, 11 appendix), 13 figures (6 main, 7 appendix), and 4 online appendices available at http://compstorylab.org/storywrangler/papers/tlid/

arXiv:2002.09770 [pdf, other]

Allotaxonometry and rank-turbulence divergence: A universal instrument for comparing complex systems

Authors: P. S. Dodds, J. R. Minot, M. V. Arnold, T. Alshaabi, J. L. Adams, D. R. Dewhurst, T. J. Gray, M. R. Frank, A. J. Reagan, C. M. Danforth

Abstract: Complex systems often comprise many kinds of components which vary over many orders of magnitude in size: Populations of cities in countries, individual and corporate wealth in economies, species abundance in ecologies, word frequency in natural language, and node degree in complex networks. Here, we introduce `allotaxonometry' along with `rank-turbulence divergence' (RTD), a tunable instrument fo… ▽ More Complex systems often comprise many kinds of components which vary over many orders of magnitude in size: Populations of cities in countries, individual and corporate wealth in economies, species abundance in ecologies, word frequency in natural language, and node degree in complex networks. Here, we introduce `allotaxonometry' along with `rank-turbulence divergence' (RTD), a tunable instrument for comparing any two ranked lists of components. We analytically develop our rank-based divergence in a series of steps, and then establish a rank-based allotaxonograph which pairs a map-like histogram for rank-rank pairs with an ordered list of components according to divergence contribution. We explore the performance of rank-turbulence divergence, which we view as an instrument of `type calculus', for a series of distinct settings including: Language use on Twitter and in books, species abundance, baby name popularity, market capitalization, performance in sports, mortality causes, and job titles. We provide a series of supplementary flipbooks which demonstrate the tunability and storytelling power of rank-based allotaxonometry. △ Less

Submitted 2 August, 2023; v1 submitted 22 February, 2020; originally announced February 2020.

Comments: 36 pages, 10 main figures, 15 inset figures, 1 table; online appendices: http://compstorylab.org/allotaxonometry/

arXiv:1910.10874 [pdf, ps, other]

Logarithmic submajorization, uniform majorization and Hölder type inequalities for $τ$-measurable operators

Authors: Peter Dodds, Theresa Dodds, Fedor Sukochev, Dmitriy Zanin

Abstract: We extend the notion of the determinant function $Λ$, originally introduced by T.Fack for $τ$-compact operators, to a natural algebra of $τ$-measurable operators affiliated with a semifinite von Neumann algebra which coincides with that defined by Haagerup and Schultz in the finite case and on which the determinant function is shown to be submultiplicative. Application is given to Hölder type ineq… ▽ More We extend the notion of the determinant function $Λ$, originally introduced by T.Fack for $τ$-compact operators, to a natural algebra of $τ$-measurable operators affiliated with a semifinite von Neumann algebra which coincides with that defined by Haagerup and Schultz in the finite case and on which the determinant function is shown to be submultiplicative. Application is given to Hölder type inequalities via general Araki-Lieb-Thirring inequalities due to Kosaki and Han and to a Weyl-type theorem for uniform majorization. △ Less

Submitted 23 October, 2019; originally announced October 2019.

MSC Class: 46L52; 47A63; 46E30; 47A30

arXiv:1910.00149 [pdf, other]

Fame and Ultrafame: Measuring and comparing daily levels of `being talked about' for United States' presidents, their rivals, God, countries, and K-pop

Authors: Peter Sheridan Dodds, Joshua R. Minot, Michael V. Arnold, Thayer Alshaabi, Jane Lydia Adams, David Rushing Dewhurst, Andrew J. Reagan, Christopher M. Danforth

Abstract: When building a global brand of any kind -- a political actor, clothing style, or belief system -- developing widespread awareness is a primary goal. Short of knowing any of the stories or products of a brand, being talked about in whatever fashion -- raw fame -- is, as Oscar Wilde would have it, better than not being talked about at all. Here, we measure, examine, and contrast the day-to-day raw… ▽ More When building a global brand of any kind -- a political actor, clothing style, or belief system -- developing widespread awareness is a primary goal. Short of knowing any of the stories or products of a brand, being talked about in whatever fashion -- raw fame -- is, as Oscar Wilde would have it, better than not being talked about at all. Here, we measure, examine, and contrast the day-to-day raw fame dynamics on Twitter for US Presidents and major US Presidential candidates from 2008 to 2020: Barack Obama, John McCain, Mitt Romney, Hillary Clinton, Donald Trump, and Joe Biden. We assign ``lexical fame'' to be the number and (Zipfian) rank of the (lowercased) mentions made for each individual across all languages. We show that all five political figures have at some point reached extraordinary volume levels of what we define to be ``lexical ultrafame'': An overall rank of approximately 300 or less which is largely the realm of function words and demarcated by the highly stable rank of `god'. By this measure, `trump' has become enduringly ultrafamous, from the 2016 election on. We use typical ranks for country names and function words as standards to improve perception of scale. We quantify relative fame rates and find that in the eight weeks leading up the 2008 and 2012 elections, `obama' held a 1000:757 volume ratio over `mccain' and 1000:892 over `romney', well short of the 1000:544 and 1000:504 volumes favoring `trump' over `hillary' and `biden' in the 8 weeks leading up to the 2016 and 2020 elections. Finally, we track how only one other entity has more sustained ultrafame than `trump' on Twitter: The K-pop (Korean pop) band BTS. We chart the dramatic rise of BTS, finding their Twitter handle `@bts\_twt' has been able to compete with `a' and `the'. Our findings for BTS more generally point to K-pop's growing economic, social, and political power. △ Less

Submitted 29 October, 2021; v1 submitted 30 September, 2019; originally announced October 2019.

Comments: 31 pages (21 pages main text, 10 pages appendix), 8 figures (7 in main text, 1 in appendix), 10 tables (1 in main text, 9 in appendix)

arXiv:1908.07039 [pdf, other]

doi 10.1142/S0218127420502569

Chimera States and Seizures in a Mouse Neuronal Model

Authors: Henry M. Mitchell, Peter Sheridan Dodds, J. Matthew Mahoney, Christopher M. Danforth

Abstract: Chimera states---the coexistence of synchrony and asynchrony in a nonlocally-coupled network of identical oscillators---are often used as a model framework for epileptic seizures. Here, we explore the dynamics of chimera states in a network of modified Hindmarsh-Rose neurons, configured to reflect the graph of the mesoscale mouse connectome. Our model produces superficially epileptiform activity c… ▽ More Chimera states---the coexistence of synchrony and asynchrony in a nonlocally-coupled network of identical oscillators---are often used as a model framework for epileptic seizures. Here, we explore the dynamics of chimera states in a network of modified Hindmarsh-Rose neurons, configured to reflect the graph of the mesoscale mouse connectome. Our model produces superficially epileptiform activity converging on persistent chimera states in a large region of a two-parameter space governing connections (a) between subcortices within a cortex and (b) between cortices. Our findings contribute to a growing body of literature suggesting mathematical models can qualitatively reproduce epileptic seizure dynamics. △ Less

Submitted 23 June, 2020; v1 submitted 2 August, 2019; originally announced August 2019.

Comments: 16 pages, 17 figures

arXiv:1908.02793 [pdf, other]

doi 10.1103/PhysRevE.101.022307

Noncooperative dynamics in election interference

Authors: David Rushing Dewhurst, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Foreign power interference in domestic elections is an existential threat to societies. Manifested through myriad methods from war to words, such interference is a timely example of strategic interaction between economic and political agents. We model this interaction between rational game players as a continuous-time differential game, constructing an analytical model of this competition with a v… ▽ More Foreign power interference in domestic elections is an existential threat to societies. Manifested through myriad methods from war to words, such interference is a timely example of strategic interaction between economic and political agents. We model this interaction between rational game players as a continuous-time differential game, constructing an analytical model of this competition with a variety of payoff structures. All-or-nothing attitudes by only one player regarding the outcome of the game lead to an arms race in which both countries spend increasing amounts on interference and counter-interference operations. We then confront our model with data pertaining to the Russian interference in the 2016 United States presidential election contest. We introduce and estimate a Bayesian structural time series model of election polls and social media posts by Russian Twitter troll accounts. Our analytical model, while purposefully abstract and simple, adequately captures many temporal characteristics of the election and social media activity. We close with a discussion of our model's shortcomings and suggestions for future research. △ Less

Submitted 9 January, 2020; v1 submitted 7 August, 2019; originally announced August 2019.

Comments: 33 pages (22 body, 11 appendix), 33 figures (15 body, 18 appendix), accompanying code at https://gitlab.com/daviddewhurst/red-blue-game

Journal ref: Phys. Rev. E 101, 022307 (2020)

arXiv:1907.12567 [pdf]

Exploring Perceptions of Veganism

Authors: Laura Jennings, Christopher M. Danforth, Peter Sheridan Dodds, Elizabeth Pinel, Lizzy Pope

Abstract: This project examined perceptions of the vegan lifestyle using surveys and social media to explore barriers to choosing veganism. A survey of 510 individuals indicated that non-vegans did not believe veganism was as healthy or difficult as vegans. In a second analysis, Instagram posts using #vegan suggest content is aimed primarily at the female vegan community. Finally, sentiment analysis of roug… ▽ More This project examined perceptions of the vegan lifestyle using surveys and social media to explore barriers to choosing veganism. A survey of 510 individuals indicated that non-vegans did not believe veganism was as healthy or difficult as vegans. In a second analysis, Instagram posts using #vegan suggest content is aimed primarily at the female vegan community. Finally, sentiment analysis of roughly 5 million Twitter posts mentioning 'vegan' found veganism to be portrayed in a more positive light compared to other topics. Results suggest non-vegans' lack of interest in veganism is driven by non-belief in the health benefits of the diet. △ Less

Submitted 29 July, 2019; originally announced July 2019.

arXiv:1907.03920 [pdf, other]

doi 10.1371/journal.pone.0232938

Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings

Authors: Tyler J. Gray, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Stretched words like `heellllp' or `heyyyyy' are a regular feature of spoken language, often used to emphasize or exaggerate the underlying meaning of the root word. While stretched words are rarely found in formal written language and dictionaries, they are prevalent within social media. In this paper, we examine the frequency distributions of `stretchable words' found in roughly 100 billion twee… ▽ More Stretched words like `heellllp' or `heyyyyy' are a regular feature of spoken language, often used to emphasize or exaggerate the underlying meaning of the root word. While stretched words are rarely found in formal written language and dictionaries, they are prevalent within social media. In this paper, we examine the frequency distributions of `stretchable words' found in roughly 100 billion tweets authored over an 8 year period. We introduce two central parameters, `balance' and `stretch', that capture their main characteristics, and explore their dynamics by creating visual tools we call `balance plots' and `spelling trees'. We discuss how the tools and methods we develop here could be used to study the statistical patterns of mistypings and misspellings, along with the potential applications in augmenting dictionaries, improving language processing, and in any area where sequence construction matters, such as genetics. △ Less

Submitted 8 July, 2019; originally announced July 2019.

Comments: 18 pages, 18 figures, and 9 tables. Online appendices at http://compstorylab.org/stretchablewords/

arXiv:1906.11710 [pdf, other]

The shocklet transform: A decomposition method for the identification of local, mechanism-driven dynamics in sociotechnical time series

Authors: David Rushing Dewhurst, Thayer Alshaabi, Dilan Kiley, Michael V. Arnold, Joshua R. Minot, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series---termed the Discrete Shocklet Transform (DST)---and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous be… ▽ More We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series---termed the Discrete Shocklet Transform (DST)---and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous behavior. After distinguishing our algorithms from other methods used in anomaly detection and time series similarity search, such as the matrix profile, seasonal-hybrid ESD, and discrete wavelet transform-based procedures, we demonstrate the DST's ability to identify mechanism-driven dynamics at a wide range of timescales and its relative insensitivity to functional parameterization. As an application, we analyze a sociotechnical data source (usage frequencies for a subset of words on Twitter) and highlight our algorithms' utility by using them to extract both a typology of mechanistic local dynamics and a data-driven narrative of socially-important events as perceived by English-language Twitter. △ Less

Submitted 18 December, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

Comments: 29 pages (20 body, 9 appendix), 20 figures (13 body, 7 appendix), three online appendices available at http://compstorylab.org/shocklets/ (two displaying interactive visualizations and one containing over 10,000 figures), open-source implementation of STAR algorithm and discrete shocklet transform available at https://gitlab.com/compstorylab/discrete-shocklet-transform

arXiv:1807.07982 [pdf]

doi 10.1002/pan3.10045

Visitors to urban greenspace have higher sentiment and lower negativity on Twitter

Authors: Aaron J. Schwartz, Peter Sheridan Dodds, Jarlath P. M. O'Neil-Dunne, Christopher M. Danforth, Taylor H. Ricketts

Abstract: With more people living in cities, we are witnessing a decline in exposure to nature. A growing body of research has demonstrated an association between nature contact and improved mood. Here, we used Twitter and the Hedonometer, a world analysis tool, to investigate how sentiment, or the estimated happiness of the words people write, varied before, during, and after visits to San Francisco's urba… ▽ More With more people living in cities, we are witnessing a decline in exposure to nature. A growing body of research has demonstrated an association between nature contact and improved mood. Here, we used Twitter and the Hedonometer, a world analysis tool, to investigate how sentiment, or the estimated happiness of the words people write, varied before, during, and after visits to San Francisco's urban park system. We found that sentiment was substantially higher during park visits and remained elevated for several hours following the visit. Leveraging differences in vegetative cover across park types, we explored how different types of outdoor public spaces may contribute to subjective well-being. Tweets during visits to Regional Parks, which are greener and have greater vegetative cover, exhibited larger increases in sentiment than tweets during visits to Civic Plazas and Squares. Finally, we analyzed word frequencies to explore several mechanisms theorized to link nature exposure with mental and cognitive benefits. Negation words such as 'no', 'not', and 'don't' decreased in frequency during visits to urban parks. These results can be used by urban planners and public health officials to better target nature contact recommendations for growing urban populations. △ Less

Submitted 27 August, 2019; v1 submitted 20 July, 2018; originally announced July 2018.

Comments: 18 pages, 5 figures

Journal ref: People Nat. 2019; 00: 1- 10

arXiv:1806.07451 [pdf, other]

doi 10.1371/journal.pone.0210484

Social media usage patterns during natural hazards

Authors: Meredith T. Niles, Benjamin F. Emery, Andrew J. Reagan, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: Natural hazards are becoming increasingly expensive as climate change and development are exposing communities to greater risks. Preparation and recovery are critical for climate change resilience, and social media are being used more and more to communicate before, during, and after disasters. While there is a growing body of research aimed at understanding how people use social media surrounding… ▽ More Natural hazards are becoming increasingly expensive as climate change and development are exposing communities to greater risks. Preparation and recovery are critical for climate change resilience, and social media are being used more and more to communicate before, during, and after disasters. While there is a growing body of research aimed at understanding how people use social media surrounding disaster events, most existing work has focused on a single disaster case study. In the present study, we analyze five of the costliest disasters in the last decade in the United States (Hurricanes Irene and Sandy, two sets of tornado outbreaks, and flooding in Louisiana) through the lens of Twitter. In particular, we explore the frequency of both generic and specific food-security related terms, and quantify the relationship between network size and Twitter activity during disasters. We find differences in tweet volume for keywords depending on disaster type, with people using Twitter more frequently in preparation for Hurricanes, and for real-time or recovery information for tornado and flooding events. Further, we find that people share a host of general disaster and specific preparation and recovery terms during these events. Finally, we find that among all account types, individuals with "average" sized networks are most likely to share information during these disasters, and in most cases, do so more frequently than normal. This suggests that around disasters, an ideal form of social contagion is being engaged in which average people rather than outsized influentials are key to communication. These results provide important context for the type of disaster information and target audiences that may be most useful for disaster communication during varying extreme events. △ Less

Submitted 24 October, 2018; v1 submitted 19 June, 2018; originally announced June 2018.

arXiv:1805.09959 [pdf, other]

A Sentiment Analysis of Breast Cancer Treatment Experiences and Healthcare Perceptions Across Twitter

Authors: Eric M. Clark, Ted James, Chris A. Jones, Amulya Alapati, Promise Ukandu, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Background: Social media has the capacity to afford the healthcare industry with valuable feedback from patients who reveal and express their medical decision-making process, as well as self-reported quality of life indicators both during and post treatment. In prior work, [Crannell et. al.], we have studied an active cancer patient population on Twitter and compiled a set of tweets describing the… ▽ More Background: Social media has the capacity to afford the healthcare industry with valuable feedback from patients who reveal and express their medical decision-making process, as well as self-reported quality of life indicators both during and post treatment. In prior work, [Crannell et. al.], we have studied an active cancer patient population on Twitter and compiled a set of tweets describing their experience with this disease. We refer to these online public testimonies as "Invisible Patient Reported Outcomes" (iPROs), because they carry relevant indicators, yet are difficult to capture by conventional means of self-report. Methods: Our present study aims to identify tweets related to the patient experience as an additional informative tool for monitoring public health. Using Twitter's public streaming API, we compiled over 5.3 million "breast cancer" related tweets spanning September 2016 until mid December 2017. We combined supervised machine learning methods with natural language processing to sift tweets relevant to breast cancer patient experiences. We analyzed a sample of 845 breast cancer patient and survivor accounts, responsible for over 48,000 posts. We investigated tweet content with a hedonometric sentiment analysis to quantitatively extract emotionally charged topics. Results: We found that positive experiences were shared regarding patient treatment, raising support, and spreading awareness. Further discussions related to healthcare were prevalent and largely negative focusing on fear of political legislation that could result in loss of coverage. Conclusions: Social media can provide a positive outlet for patients to discuss their needs and concerns regarding their healthcare coverage and treatment needs. Capturing iPROs from online communication can help inform healthcare professionals and lead to more connected and personalized treatment regimens. △ Less

Submitted 12 October, 2018; v1 submitted 24 May, 2018; originally announced May 2018.

arXiv:1803.09745 [pdf, other]

doi 10.1371/journal.pone.0209651

English verb regularization in books and tweets

Authors: Tyler J. Gray, Andrew J. Reagan, Peter Sheridan Dodds, Christopher M. Danforth

Abstract: The English language has evolved dramatically throughout its lifespan, to the extent that a modern speaker of Old English would be incomprehensible without translation. One concrete indicator of this process is the movement from irregular to regular (-ed) forms for the past tense of verbs. In this study we quantify the extent of verb regularization using two vastly disparate datasets: (1) Six year… ▽ More The English language has evolved dramatically throughout its lifespan, to the extent that a modern speaker of Old English would be incomprehensible without translation. One concrete indicator of this process is the movement from irregular to regular (-ed) forms for the past tense of verbs. In this study we quantify the extent of verb regularization using two vastly disparate datasets: (1) Six years of published books scanned by Google (2003--2008), and (2) A decade of social media messages posted to Twitter (2008--2017). We find that the extent of verb regularization is greater on Twitter, taken as a whole, than in English Fiction books. Regularization is also greater for tweets geotagged in the United States relative to American English books, but the opposite is true for tweets geotagged in the United Kingdom relative to British English books. We also find interesting regional variations in regularization across counties in the United States. However, once differences in population are accounted for, we do not identify strong correlations with socio-demographic variables such as education or income. △ Less

Submitted 3 January, 2019; v1 submitted 26 March, 2018; originally announced March 2018.

Comments: 16 pages, 10 figures, and 4 tables. Online appendices at https://www.uvm.edu/storylab/share/papers/gray2018a/ ; Updated to journal version with minor differences from first version

Journal ref: PLOS ONE 13(12): e0209651, 2018

arXiv:1710.07580 [pdf, other]

doi 10.1103/PhysRevE.97.062317

Continuum rich-get-richer processes: Mean field analysis with an application to firm size

Authors: David Rushing Dewhurst, Christopher M. Danforth, Peter Sheridan Dodds

Abstract: Classical rich-get-richer models have found much success in being able to broadly reproduce the statistics and dynamics of diverse real complex systems. These rich-get-richer models are based on classical urn models and unfold step-by-step in discrete time. Here, we consider a natural variation acting on a temporal continuum in the form of a partial differential equation (PDE). We first show that… ▽ More Classical rich-get-richer models have found much success in being able to broadly reproduce the statistics and dynamics of diverse real complex systems. These rich-get-richer models are based on classical urn models and unfold step-by-step in discrete time. Here, we consider a natural variation acting on a temporal continuum in the form of a partial differential equation (PDE). We first show that the continuum version of Herbert Simon's canonical preferential attachment model exhibits an identical size distribution. In relaxing Simon's assumption of a linear growth mechanism, we consider the case of an arbitrary growth kernel and find the general solution to the resultant PDE. We then extend the PDE to multiple spatial dimensions, again determining the general solution. Finally, we apply the model to size and wealth distributions of firms. We obtain power law scaling for both to be concordant with simulations as well as observational data. △ Less

Submitted 24 March, 2018; v1 submitted 20 October, 2017; originally announced October 2017.

Comments: 7 pages, 4 figures

Journal ref: Phys. Rev. E 97, 062317 (2018)

arXiv:1708.09697 [pdf, other]

Slightly generalized Generalized Contagion: Unifying simple models of biological and social spreading

Authors: Peter Sheridan Dodds

Abstract: We motivate and explore the basic features of generalized contagion, a model mechanism that unifies fundamental models of biological and social contagion. Generalized contagion builds on the elementary observation that spreading and contagion of all kinds involve some form of system memory. We discuss the three main classes of systems that generalized contagion affords, resembling: simple biologic… ▽ More We motivate and explore the basic features of generalized contagion, a model mechanism that unifies fundamental models of biological and social contagion. Generalized contagion builds on the elementary observation that spreading and contagion of all kinds involve some form of system memory. We discuss the three main classes of systems that generalized contagion affords, resembling: simple biological contagion; critical mass contagion of social phenomena; and an intermediate, and explosive, vanishing critical mass contagion. We also present a simple explanation of the global spreading condition in the context of a small seed of infected individuals. △ Less

Submitted 31 August, 2017; originally announced August 2017.

Comments: 8 pages, 5 figures; chapter to appear in "Spreading Dynamics in Social Systems"; Eds. Sune Lehmann and Yong-Yeol Ahn, Springer Nature

Showing 1–50 of 133 results for author: Dodds, P