-
Deep Lexical Hypothesis: Identifying personality structure in natural language
Authors:
Andrew Cutler,
David M. Condon
Abstract:
Recent advances in natural language processing (NLP) have produced general models that can perform complex tasks such as summarizing long passages and translating across languages. Here, we introduce a method to extract adjective similarities from language models as done with survey-based ratings in traditional psycholexical studies but using millions of times more text in a natural setting. The c…
▽ More
Recent advances in natural language processing (NLP) have produced general models that can perform complex tasks such as summarizing long passages and translating across languages. Here, we introduce a method to extract adjective similarities from language models as done with survey-based ratings in traditional psycholexical studies but using millions of times more text in a natural setting. The correlational structure produced through this method is highly similar to that of self- and other-ratings of 435 terms reported by Saucier and Goldberg (1996a). The first three unrotated factors produced using NLP are congruent with those in survey data, with coefficients of 0.89, 0.79, and 0.79. This structure is robust to many modeling decisions: adjective set, including those with 1,710 terms (Goldberg, 1982) and 18,000 terms (Allport & Odbert, 1936); the query used to extract correlations; and language model. Notably, Neuroticism and Openness are only weakly and inconsistently recovered. This is a new source of signal that is closer to the original (semantic) vision of the Lexical Hypothesis. The method can be applied where surveys cannot: in dozens of languages simultaneously, with tens of thousands of items, on historical text, and at extremely large scale for little cost. The code is made public to facilitate reproduction and fast iteration in new directions of research.
△ Less
Submitted 3 March, 2022;
originally announced March 2022.
-
Geometry- and Accuracy-Preserving Random Forest Proximities
Authors:
Jake S. Rhodes,
Adele Cutler,
Kevin R. Moon
Abstract:
Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest and measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including…
▽ More
Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest and measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly matches the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
△ Less
Submitted 28 February, 2023; v1 submitted 29 January, 2022;
originally announced January 2022.
-
Tree-based Regression for Interval-valued Data
Authors:
Chih-Ching Yeh,
Yan Sun,
Adele Cutler
Abstract:
Regression methods for interval-valued data have been increasingly studied in recent years. As most of the existing works focus on linear models, it is important to note that many problems in practice are nonlinear in nature and therefore development of nonlinear regression tools for interval-valued data is crucial. In this paper, we propose a tree-based regression method for interval-valued data,…
▽ More
Regression methods for interval-valued data have been increasingly studied in recent years. As most of the existing works focus on linear models, it is important to note that many problems in practice are nonlinear in nature and therefore development of nonlinear regression tools for interval-valued data is crucial. In this paper, we propose a tree-based regression method for interval-valued data, which is well applicable to both linear and nonlinear problems. Unlike linear regression models that usually require additional constraints to ensure positivity of the predicted interval length, the proposed method estimates the regression function in a nonparametric way, so the predicted length is naturally positive without any constraints. A simulation study is conducted that compares our method to popular existing regression models for interval-valued data under both linear and nonlinear settings. Furthermore, a real data example is presented where we apply our method to analyze price range data of the Dow Jones Industrial Average index and its component stocks.
△ Less
Submitted 9 January, 2022;
originally announced January 2022.
-
Adaptive Modeling Powers Fast Multi-parameter Fitting of CARS Spectra
Authors:
Gregory J. Hunt,
Cody R. Ground,
Andrew D. Cutler
Abstract:
Coherent anti-Stokes Raman Spectroscopy (CARS) is a laser-based measurement technique widely applied across many science and engineering disciplines to perform non-intrusive gas diagnostics. CARS is often used to study combustion, where the measured spectra can be used to simultaneously recover multiple flow parameters from the reacting gas such as temperature and relative species mole fractions.…
▽ More
Coherent anti-Stokes Raman Spectroscopy (CARS) is a laser-based measurement technique widely applied across many science and engineering disciplines to perform non-intrusive gas diagnostics. CARS is often used to study combustion, where the measured spectra can be used to simultaneously recover multiple flow parameters from the reacting gas such as temperature and relative species mole fractions. This is typically done by using numerical optimization to find the flow parameters for which a theoretical model of the CARS spectra best matches the actual measurements. The most commonly used theoretical model is the CARSFT spectrum calculator. Unfortunately, this CARSFT spectrum generator is computationally expensive and using it to recover multiple flow parameters can be prohibitively time-consuming, especially when experiments have hundreds or thousands of measurements distributed over time or space. To overcome these issues, several methods have been developed to approximate CARSFT using a library of pre-computed theoretical spectra. In this work we present a new approach that leverages ideas from the machine learning literature to build an adaptively smoothed kernel-based approximator. In application on a simulated dual-pump CARS experiment probing a $H_2/$air flame, we show that the approach can use a small number library spectra to quickly and accurately recover temperature and four gas species' mole fractions. The method's flexibility allows fine-tuned navigation of the trade-off between speed and accuracy, and makes the approach suitable for a wide range of problems and flow regimes.
△ Less
Submitted 26 October, 2021;
originally announced November 2021.
-
Supervised Visualization for Data Exploration
Authors:
Jake S. Rhodes,
Adele Cutler,
Guy Wolf,
Kevin R. Moon
Abstract:
Dimensionality reduction is often used as an initial step in data exploration, either as preprocessing for classification or regression or for visualization. Most dimensionality reduction techniques to date are unsupervised; they do not take class labels into account (e.g., PCA, MDS, t-SNE, Isomap). Such methods require large amounts of data and are often sensitive to noise that may obfuscate impo…
▽ More
Dimensionality reduction is often used as an initial step in data exploration, either as preprocessing for classification or regression or for visualization. Most dimensionality reduction techniques to date are unsupervised; they do not take class labels into account (e.g., PCA, MDS, t-SNE, Isomap). Such methods require large amounts of data and are often sensitive to noise that may obfuscate important patterns in the data. Various attempts at supervised dimensionality reduction methods that take into account auxiliary annotations (e.g., class labels) have been successfully implemented with goals of increased classification accuracy or improved data visualization. Many of these supervised techniques incorporate labels in the loss function in the form of similarity or dissimilarity matrices, thereby creating over-emphasized separation between class clusters, which does not realistically represent the local and global relationships in the data. In addition, these approaches are often sensitive to parameter tuning, which may be difficult to configure without an explicit quantitative notion of visual superiority. In this paper, we describe a novel supervised visualization technique based on random forest proximities and diffusion-based dimensionality reduction. We show, both qualitatively and quantitatively, the advantages of our approach in retaining local and global structures in data, while emphasizing important variables in the low-dimensional embedding. Importantly, our approach is robust to noise and parameter tuning, thus making it simple to use while producing reliable visualizations for data exploration.
△ Less
Submitted 15 June, 2020;
originally announced June 2020.
-
Inferring Human Traits From Facebook Statuses
Authors:
Andrew Cutler,
Brian Kulis
Abstract:
This paper explores the use of language models to predict 20 human traits from users' Facebook status updates. The data was collected by the myPersonality project, and includes user statuses along with their personality, gender, political identification, religion, race, satisfaction with life, IQ, self-disclosure, fair-mindedness, and belief in astrology. A single interpretable model meets state o…
▽ More
This paper explores the use of language models to predict 20 human traits from users' Facebook status updates. The data was collected by the myPersonality project, and includes user statuses along with their personality, gender, political identification, religion, race, satisfaction with life, IQ, self-disclosure, fair-mindedness, and belief in astrology. A single interpretable model meets state of the art results for well-studied tasks such as predicting gender and personality; and sets the standard on other traits such as IQ, sensational interests, political identity, and satisfaction with life. Additionally, highly weighted words are published for each trait. These lists are valuable for creating hypotheses about human behavior, as well as for understanding what information a model is extracting. Using performance and extracted features we analyze models built on social media. The real world problems we explore include gendered classification bias and Cambridge Analytica's use of psychographic models.
△ Less
Submitted 25 July, 2018; v1 submitted 22 May, 2018;
originally announced May 2018.
-
Icosahedral Skeletal Polyhedra Realizing Petrie Relatives of Gordan's Regular Map
Authors:
Anthony M. Cutler,
Egon Schulte,
Jorg M. Wills
Abstract:
Every regular map on a closed surface gives rise to generally six regular maps, its "Petrie relatives", that are obtained through iteration of the duality and Petrie operations (taking duals and Petrie-duals). It is shown that the skeletal polyhedra in Euclidean 3-space which realize a Petrie relative of the classical Gordan regular map and have full icosahedral symmetry, comprise precisely four i…
▽ More
Every regular map on a closed surface gives rise to generally six regular maps, its "Petrie relatives", that are obtained through iteration of the duality and Petrie operations (taking duals and Petrie-duals). It is shown that the skeletal polyhedra in Euclidean 3-space which realize a Petrie relative of the classical Gordan regular map and have full icosahedral symmetry, comprise precisely four infinite families of polyhedra, as well as four individual polyhedra.
△ Less
Submitted 7 October, 2012;
originally announced October 2012.
-
Remembering Leo Breiman
Authors:
Adele Cutler
Abstract:
Leo Breiman was a highly creative, influential researcher with a down-to-earth personal style and an insistence on working on important real world problems and producing useful solutions. This paper is a short review of Breiman's extensive contributions to the field of applied statistics.
Leo Breiman was a highly creative, influential researcher with a down-to-earth personal style and an insistence on working on important real world problems and producing useful solutions. This paper is a short review of Breiman's extensive contributions to the field of applied statistics.
△ Less
Submitted 5 January, 2011;
originally announced January 2011.
-
Regular Polyhedra of Index Two, II
Authors:
Anthony M. Cutler
Abstract:
A polyhedron in Euclidean 3-space is called a regular polyhedron of index 2 if it is combinatorially regular and its geometric symmetry group has index 2 in its combinatorial automorphism group; thus its automorphism group is flag-transitive but its symmetry group has two flag orbits. The present paper completes the classification of finite regular polyhedra of index 2 in 3-space. In particular, t…
▽ More
A polyhedron in Euclidean 3-space is called a regular polyhedron of index 2 if it is combinatorially regular and its geometric symmetry group has index 2 in its combinatorial automorphism group; thus its automorphism group is flag-transitive but its symmetry group has two flag orbits. The present paper completes the classification of finite regular polyhedra of index 2 in 3-space. In particular, this paper enumerates the regular polyhedra of index 2 with vertices on one orbit under the symmetry group. There are ten such polyhedra.
△ Less
Submitted 12 November, 2010;
originally announced November 2010.
-
Regular Polyhedra of Index Two, I
Authors:
Anthony M. Cutler,
Egon Schulte
Abstract:
A polyhedron in Euclidean 3-space is called a regular polyhedron of index 2 if it is combinatorially regular but "fails geometric regularity by a factor of 2"; its combinatorial automorphism group is flag-transitive but its geometric symmetry group has two flag orbits. The present paper, and its successor by the first author, describe a complete classification of regular polyhedra of index 2 in 3-…
▽ More
A polyhedron in Euclidean 3-space is called a regular polyhedron of index 2 if it is combinatorially regular but "fails geometric regularity by a factor of 2"; its combinatorial automorphism group is flag-transitive but its geometric symmetry group has two flag orbits. The present paper, and its successor by the first author, describe a complete classification of regular polyhedra of index 2 in 3-space. In particular, the present paper enumerates the regular polyhedra of index 2 with vertices on two orbits under the symmetry group. The subsequent paper will enumerate the regular polyhedra of index 2 with vertices on one orbit under the symmetry group.
△ Less
Submitted 26 May, 2010;
originally announced May 2010.