PreprintArticleVersion 2Preserved in Portico This version is not peer-reviewed
What Do We Learn from Word Associations? Evaluating Machine Learning Algorithms for the Extraction of Contextual Word Meaning in Natural Language Processing
Version 1
: Received: 4 May 2018 / Approved: 7 May 2018 / Online: 7 May 2018 (06:25:55 CEST)
Version 2
: Received: 9 May 2018 / Approved: 10 May 2018 / Online: 10 May 2018 (05:56:56 CEST)
How to cite:
Kapetanios, E.; Alshahrani, S.; Angelopoulou, A.; Baldwin, M. What Do We Learn from Word Associations? Evaluating Machine Learning Algorithms for the Extraction of Contextual Word Meaning in Natural Language Processing. Preprints2018, 2018050102. https://doi.org/10.20944/preprints201805.0102.v2
Kapetanios, E.; Alshahrani, S.; Angelopoulou, A.; Baldwin, M. What Do We Learn from Word Associations? Evaluating Machine Learning Algorithms for the Extraction of Contextual Word Meaning in Natural Language Processing. Preprints 2018, 2018050102. https://doi.org/10.20944/preprints201805.0102.v2
Kapetanios, E.; Alshahrani, S.; Angelopoulou, A.; Baldwin, M. What Do We Learn from Word Associations? Evaluating Machine Learning Algorithms for the Extraction of Contextual Word Meaning in Natural Language Processing. Preprints2018, 2018050102. https://doi.org/10.20944/preprints201805.0102.v2
APA Style
Kapetanios, E., Alshahrani, S., Angelopoulou, A., & Baldwin, M. (2018). <em>What Do We Learn from Word Associations</em>? Evaluating Machine Learning Algorithms for the Extraction of Contextual Word Meaning in Natural Language Processing. Preprints. https://doi.org/10.20944/preprints201805.0102.v2
Chicago/Turabian Style
Kapetanios, E., Anastasia Angelopoulou and Mark Baldwin. 2018 "<em>What Do We Learn from Word Associations</em>? Evaluating Machine Learning Algorithms for the Extraction of Contextual Word Meaning in Natural Language Processing" Preprints. https://doi.org/10.20944/preprints201805.0102.v2
Abstract
“You should know the words by the company they keep!” has been one of the most famous slogans attributed to John Rubert Firth, 1957. This has ignited a whole school in linguistic research known as the British empiricist contextualism. Sixty years later, many un- or semi-supervised machine learning algorithms have been successfully designed and implemented aiming at extracting word meaning from within the context of a text corpus. These algorithms treat words, more or less, as vectors of real numbers representing frequencies of word occurrences within context and word meaning as positions of words in a high-dimensional vector space model. Word associations, in turn, are treated as calculated distances among them. With the rise of Deep Learning (DL) and other artificial neural networks based architectures, learning the positioning of words and extracting word associations as measured by their distances has further improved. In this paper, however, we revisited the main stream of algorithmic approaches and set the stage for a partly cross-disciplinary evaluation framework to judge about the nature of the extracted word associations by state-of-the-art machine learning algorithms. Our preliminary results are based on word associations extracted from the application of DL framework on a Google News text corpus, as well as on comparisons with human created word association lists such as word collocation dictionaries and psycholinguistic experiments. The results and conclusions provide some insights into the inherited limitations in interpreting the type of word associations and underpinning relations between words with inevitable consequences in other areas, such as extraction of knowledge graphs or image understanding.
Keywords
machine learning; algorithms; natural language processing, deep learning, vector space models, semantic similarity, distributional semantics, latent semantic analysis, word2vec
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.