Udo Hahn

2024

pdf bib
Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing
Chung-Chi Chen | Xiaomo Liu | Udo Hahn | Armineh Nourbakhsh | Zhiqiang Ma | Charese Smiley | Veronique Hoste | Sanjiv Ranjan Das | Manling Li | Mohammad Ghassemi | Hen-Hsen Huang | Hiroya Takamura | Hsin-Hsi Chen
Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing

2023

pdf bib abs
DOPA METER – A Tool Suite for Metrical Document Profiling and Aggregation
Christina Lohr | Udo Hahn
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present DOPA METER, a tool suite for the metrical investigation of written language, that provides diagnostic means for its division into discourse categories, such as registers, genres, and style. The quantitative basis of our system are 120 metrics covering a wide range of lexical, syntactic, and semantic features relevant for language profiling. The scores can be summarized, compared, and aggregated using visualization tools that can be tailored according to the users’ needs. We also showcase an application scenario for DOPA METER.

2022

pdf bib abs
“Beste Grüße, Maria Meyer” — Pseudonymization of Privacy-Sensitive Information in Emails
Elisabeth Eder | Michael Wiegand | Ulrike Krieg-Holz | Udo Hahn
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The exploding amount of user-generated content has spurred NLP research to deal with documents from various digital communication formats (tweets, chats, emails, etc.). Using these texts as language resources implies complying with legal data privacy regulations. To protect the personal data of individuals and preclude their identification, we employ pseudonymization. More precisely, we identify those text spans that carry information revealing an individual’s identity (e.g., names of persons, locations, phone numbers, or dates) and subsequently substitute them with synthetically generated surrogates. Based on CodE Alltag, a German-language email corpus, we address two tasks. The first task is to evaluate various architectures for the automatic recognition of privacy-sensitive entities in raw data. The second task examines the applicability of pseudonymized data as training data for such systems since models learned on original data cannot be published for reasons of privacy protection. As outputs of both tasks, we, first, generate a new pseudonymized version of CodE Alltag compliant with the legal requirements of the General Data Protection Regulation (GDPR). Second, we make accessible a tagger for recognizing privacy-sensitive information in German emails and similar text genres, which is trained on already pseudonymized data.

Despite remarkable advances in the development of language resources over the recent years, there is still a shortage of annotated, publicly available corpora covering (German) medical language. With the initial release of the German Guideline Program in Oncology NLP Corpus (GGPONC), we have demonstrated how such corpora can be built upon clinical guidelines, a widely available resource in many natural languages with a reasonable coverage of medical terminology. In this work, we describe a major new release for GGPONC. The corpus has been substantially extended in size and re-annotated with a new annotation scheme based on SNOMED CT top level hierarchies, reaching high inter-annotator agreement (γ=.94). Moreover, we annotated elliptical coordinated noun phrases and their resolutions, a common language phenomenon in (not only German) scientific documents. We also trained BERT-based named entity recognition models on this new data set, which achieve high performance on short, coarse-grained entity spans (F1=.89), while the rate of boundary errors increases for long entity spans. GGPONC is freely available through a data use agreement. The trained named entity recognition models, as well as the detailed annotation guide, are also made publicly available.

2021

pdf bib
Proceedings of the Third Workshop on Economics and Natural Language Processing
Udo Hahn | Veronique Hoste | Amanda Stent
Proceedings of the Third Workshop on Economics and Natural Language Processing

pdf bib abs
Acquiring a Formality-Informed Lexical Resource for Style Analysis
Elisabeth Eder | Ulrike Krieg-Holz | Udo Hahn
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

To track different levels of formality in written discourse, we introduce a novel type of lexicon for the German language, with entries ordered by their degree of (in)formality. We start with a set of words extracted from traditional lexicographic resources, extend it by sentence-based similarity computations, and let crowdworkers assess the enlarged set of lexical items on a continuous informal-formal scale as a gold standard for evaluation. We submit this lexicon to an intrinsic evaluation related to the best regression models and their effect on predicting formality scores and complement our investigation by an extrinsic evaluation of formality on a German-language email corpus.

pdf bib abs
Towards Label-Agnostic Emotion Embeddings
Sven Buechel | Luise Modersohn | Udo Hahn
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Research in emotion analysis is scattered across different label formats (e.g., polarity types, basic emotion categories, and affective dimensions), linguistic levels (word vs. sentence vs. discourse), and, of course, (few well-resourced but much more under-resourced) natural languages and text genres (e.g., product reviews, tweets, news). The resulting heterogeneity makes data and software developed under these conflicting constraints hard to compare and challenging to integrate. To resolve this unsatisfactory state of affairs we here propose a training scheme that learns a shared latent representation of emotion independent from different label formats, natural languages, and even disparate model architectures. Experiments on a wide range of datasets indicate that this approach yields the desired interoperability without penalizing prediction quality. Code and data are archived under DOI 10.5281/zenodo.5466068.

2020

The lack of publicly accessible text corpora is a major obstacle for progress in natural language processing. For medical applications, unfortunately, all language communities other than English are low-resourced. In this work, we present GGPONC (German Guideline Program in Oncology NLP Corpus), a freely dis tributable German language corpus based on clinical practice guidelines for oncology. This corpus is one of the largest ever built from German medical documents. Unlike clinical documents, clinical guidelines do not contain any patient-related information and can therefore be used without data protection restrictions. Moreover, GGPONC is the first corpus for the German language covering diverse conditions in a large medical subfield and provides a variety of metadata, such as literature references and evidence levels. By applying and evaluating existing medical information extraction pipelines for German text, we are able to draw comparisons for the use of medical language to other corpora, medical and non-medical ones.

pdf bib abs
Allgemeine Musikalische Zeitung as a Searchable Online Corpus
Bernd Kampe | Tinghui Duan | Udo Hahn
Proceedings of the Twelfth Language Resources and Evaluation Conference

The massive digitization efforts related to historical newspapers over the past decades have focused on mass media sources and ordinary people as their primary recipients. Much less attention has been paid to newspapers published for a more specialized audience, e.g., those aiming at scholarly or cultural exchange within intellectual communities much narrower in scope, such as newspapers devoted to music criticism, arts or philosophy. Only some few of these specialized newspapers have been digitized up until now, but they are usually not well curated in terms of digitization quality, data formatting, completeness, redundancy (de-duplication), supply of metadata, and, hence, searchability. This paper describes our approach to eliminate these drawbacks for a major German-language newspaper resource of the Romantic Age, the Allgemeine Musikalische Zeitung (General Music Gazette). We here focus on a workflow that copes with a posteriori digitization problems, inconsistent OCRing and index building for searchability. In addition, we provide a user-friendly graphic interface to empower content-centric access to this (and other) digital resource(s) adopting open-source software for the purpose of Web presentation.

pdf bib abs
CodE Alltag 2.0 — A Pseudonymized German-Language Email Corpus
Elisabeth Eder | Ulrike Krieg-Holz | Udo Hahn
Proceedings of the Twelfth Language Resources and Evaluation Conference

The vast amount of social communication distributed over various electronic media channels (tweets, blogs, emails, etc.), so-called user-generated content (UGC), creates entirely new opportunities for today’s NLP research. Yet, data privacy concerns implied by the unauthorized use of these text streams as a data resource are often neglected. In an attempt to reconciliate the diverging needs of unconstrained raw data use and preservation of data privacy in digital communication, we here investigate the automatic recognition of privacy-sensitive stretches of text in UGC and provide an algorithmic solution for the protection of personal data via pseudonymization. Our focus is directed at the de-identification of emails where personally identifying information does not only refer to the sender but also to those people, locations, dates, and other identifiers mentioned in greetings, boilerplates and the content-carrying body of emails. We evaluate several de-identification procedures and systems on two hitherto non-anonymized German-language email corpora (CodE AlltagS+d and CodE AlltagXL), and generate fully pseudonymized versions for both (CodE Alltag 2.0) in which personally identifying information of all social actors addressed in these mails has been camouflaged (to the greatest extent possible).

pdf bib abs
ProGene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus
Erik Faessler | Luise Modersohn | Christina Lohr | Udo Hahn
Proceedings of the Twelfth Language Resources and Evaluation Conference

Genes and proteins constitute the fundamental entities of molecular genetics. We here introduce ProGene (formerly called FSU-PRGE), a corpus that reflects our efforts to cope with this important class of named entities within the framework of a long-lasting large-scale annotation campaign at the Jena University Language & Information Engineering (JULIE) Lab. We assembled the entire corpus from 11 subcorpora covering various biological domains to achieve an overall subdomain-independent corpus. It consists of 3,308 MEDLINE abstracts with over 36k sentences and more than 960k tokens annotated with nearly 60k named entity mentions. Two annotators strove for carefully assigning entity mentions to classes of genes/proteins as well as families/groups, complexes, variants and enumerations of those where genes and proteins are represented by a single class. The main purpose of the corpus is to provide a large body of consistent and reliable annotations for supervised training and evaluation of machine learning algorithms in this relevant domain. Furthermore, we provide an evaluation of two state-of-the-art baseline systems — BioBert and flair — on the ProGene corpus. We make the evaluation datasets and the trained models available to encourage comparable evaluations of new methods in the future.

pdf bib abs
Learning and Evaluating Emotion Lexicons for 91 Languages
Sven Buechel | Susanna Rücker | Udo Hahn
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Emotion lexicons describe the affective meaning of words and thus constitute a centerpiece for advanced sentiment and emotion analysis. Yet, manually curated lexicons are only available for a handful of languages, leaving most languages of the world without such a precious resource for downstream applications. Even worse, their coverage is often limited both in terms of the lexical units they contain and the emotional variables they feature. In order to break this bottleneck, we here introduce a methodology for creating almost arbitrarily large emotion lexicons for any target language. Our approach requires nothing but a source language emotion lexicon, a bilingual word translation model, and a target language embedding model. Fulfilling these requirements for 91 languages, we are able to generate representationally rich high-coverage lexicons comprising eight emotional variables with more than 100k lexical entries each. We evaluated the automatically generated lexicons against human judgment from 26 datasets, spanning 12 typologically diverse languages, and found that our approach produces results in line with state-of-the-art monolingual approaches to lexicon creation and even surpasses human reliability for some languages and variables. Code and data are available at https://github.com/JULIELab/MEmoLon archived under DOI 10.5281/zenodo.3779901.

2019

pdf bib
Proceedings of the Second Workshop on Economics and Natural Language Processing
Udo Hahn | Véronique Hoste | Zhu Zhang
Proceedings of the Second Workshop on Economics and Natural Language Processing

pdf bib abs
A Time Series Analysis of Emotional Loading in Central Bank Statements
Sven Buechel | Simon Junker | Thore Schlaak | Claus Michelsen | Udo Hahn
Proceedings of the Second Workshop on Economics and Natural Language Processing

We examine the affective content of central bank press statements using emotion analysis. Our focus is on two major international players, the European Central Bank (ECB) and the US Federal Reserve Bank (Fed), covering a time span from 1998 through 2019. We reveal characteristic patterns in the emotional dimensions of valence, arousal, and dominance and find—despite the commonly established attitude that emotional wording in central bank communication should be avoided—a correlation between the state of the economy and particularly the dominance dimension in the press releases under scrutiny and, overall, an impact of the president in office.

pdf bib abs
The Influence of Down-Sampling Strategies on SVD Word Embedding Stability
Johannes Hellrich | Bernd Kampe | Udo Hahn
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP

The stability of word embedding algorithms, i.e., the consistency of the word representations they reveal when trained repeatedly on the same data set, has recently raised concerns. We here compare word embedding algorithms on three corpora of different sizes, and evaluate both their stability and accuracy. We find strong evidence that down-sampling strategies (used as part of their training procedures) are particularly influential for the stability of SVD-PPMI-type embeddings. This finding seems to explain diverging reports on their stability and lead us to a simple modification which provides superior stability as well as accuracy on par with skip-gram embedding

pdf bib abs
Modeling Word Emotion in Historical Language: Quantity Beats Supposed Stability in Seed Word Selection
Johannes Hellrich | Sven Buechel | Udo Hahn
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

To understand historical texts, we must be aware that language—including the emotional connotation attached to words—changes over time. In this paper, we aim at estimating the emotion which is associated with a given word in former language stages of English and German. Emotion is represented following the popular Valence-Arousal-Dominance (VAD) annotation scheme. While being more expressive than polarity alone, existing word emotion induction methods are typically not suited for addressing it. To overcome this limitation, we present adaptations of two popular algorithms to VAD. To measure their effectiveness in diachronic settings, we present the first gold standard for historical word emotions, which was created by scholars with proficiency in the respective language stages and covers both English and German. In contrast to claims in previous work, our findings indicate that hand-selecting small sets of seed words with supposedly stable emotional meaning is actually harm- rather than helpful.

pdf bib abs
At the Lower End of Language—Exploring the Vulgar and Obscene Side of German
Elisabeth Eder | Ulrike Krieg-Holz | Udo Hahn
Proceedings of the Third Workshop on Abusive Language Online

In this paper, we describe a workflow for the data-driven acquisition and semantic scaling of a lexicon that covers lexical items from the lower end of the German language register—terms typically considered as rough, vulgar or obscene. Since the fine semantic representation of grades of obscenity can only inadequately be captured at the categorical level (e.g., obscene vs. non-obscene, or rough vs. vulgar), our main contribution lies in applying best-worst scaling, a rating methodology that has already been shown to be useful for emotional language, to capture the relative strength of obscenity of lexical items. We describe the empirical foundations for bootstrapping such a low-end lexicon for German by starting from manually supplied lexicographic categorizations of a small seed set of rough and vulgar lexical items and automatically enlarging this set by means of distributional semantics. We then determine the degrees of obscenity for the full set of all acquired lexical items by letting crowdworkers comparatively assess their pejorative grade using best-worst scaling. This semi-automatically enriched lexicon already comprises 3,300 lexical items and incorporates 33,000 vulgarity ratings. Using it as a seed lexicon for fully automatic lexical acquisition, we were able to raise its coverage up to slightly more than 11,000 entries.

Today’s widely used annotation tools were designed for annotating typically short textual mentions of entities or relations, making their interface cumbersome to use for long(er) stretches of text, e.g, sentences running over several lines in a document. They also lack systematic support for hierarchically structured labels, i.e., one label being conceptually more general than another (e.g., anamnesis in relation to family anamnesis). Moreover, as a more fundamental shortcoming of today’s tools, they provide no continuous quality con trol mechanisms for the annotation process, an essential feature to intrinsically support iterative cycles in the development of annotation guidelines. We alleviated these problems by developing WAT-SL 2.0, an open-source web-based annotation tool for long-segment labeling, hierarchically structured label sets and built-ins for quality control.

pdf bib abs
De-Identification of Emails: Pseudonymizing Privacy-Sensitive Data in a German Email Corpus
Elisabeth Eder | Ulrike Krieg-Holz | Udo Hahn
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We deal with the pseudonymization of those stretches of text in emails that might allow to identify real individual persons. This task is decomposed into two steps. First, named entities carrying privacy-sensitive information (e.g., names of persons, locations, phone numbers or dates) are identified, and, second, these privacy-bearing entities are replaced by synthetically generated surrogates (e.g., a person originally named ‘John Doe’ is renamed as ‘Bill Powers’). We describe a system architecture for surrogate generation and evaluate our approach on CodeAlltag, a German email corpus.

2018

pdf bib abs
Word Emotion Induction for Multiple Languages as a Deep Multi-Task Learning Problem
Sven Buechel | Udo Hahn
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Predicting the emotional value of lexical items is a well-known problem in sentiment analysis. While research has focused on polarity for quite a long time, meanwhile this early focus has been shifted to more expressive emotion representation models (such as Basic Emotions or Valence-Arousal-Dominance). This change resulted in a proliferation of heterogeneous formats and, in parallel, often small-sized, non-interoperable resources (lexicons and corpus annotations). In particular, the limitations in size hampered the application of deep learning methods in this area because they typically require large amounts of input data. We here present a solution to get around this language data bottleneck by rephrasing word emotion induction as a multi-task learning problem. In this approach, the prediction of each independent emotion dimension is considered as an individual task and hidden layers are shared between these dimensions. We investigate whether multi-task learning is more advantageous than single-task learning for emotion prediction by comparing our model against a wide range of alternative emotion and polarity induction methods featuring 9 typologically diverse languages and a total of 15 conditions. Our model turns out to outperform each one of them. Against all odds, the proposed deep learning approach yields the largest gain on the smallest data sets, merely composed of one thousand samples.

pdf bib abs
Emotion Representation Mapping for Automatic Lexicon Construction (Mostly) Performs on Human Level
Sven Buechel | Udo Hahn
Proceedings of the 27th International Conference on Computational Linguistics

Emotion Representation Mapping (ERM) has the goal to convert existing emotion ratings from one representation format into another one, e.g., mapping Valence-Arousal-Dominance annotations for words or sentences into Ekman’s Basic Emotions and vice versa. ERM can thus not only be considered as an alternative to Word Emotion Induction (WEI) techniques for automatic emotion lexicon construction but may also help mitigate problems that come from the proliferation of emotion representation formats in recent years. We propose a new neural network approach to ERM that not only outperforms the previous state-of-the-art. Equally important, we present a refined evaluation methodology and gather strong evidence that our model yields results which are (almost) as reliable as human annotations, even in cross-lingual settings. Based on these results we generate new emotion ratings for 13 typologically diverse languages and claim that they have near-gold quality, at least.

pdf bib abs
JeSemE: Interleaving Semantics and Emotions in a Web Service for the Exploration of Language Change Phenomena
Johannes Hellrich | Sven Buechel | Udo Hahn
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

We here introduce a substantially extended version of JeSemE, an interactive website for visually exploring computationally derived time-variant information on word meanings and lexical emotions assembled from five large diachronic text corpora. JeSemE is designed for scholars in the (digital) humanities as an alternative to consulting manually compiled, printed dictionaries for such information (if available at all). This tool uniquely combines state-of-the-art distributional semantics with a nuanced model of human emotions, two information streams we deem beneficial for a data-driven interpretation of texts in the humanities.

pdf bib
Representation Mapping: A Novel Approach to Generate High-Quality Multi-Lingual Emotion Lexicons
Sven Buechel | Udo Hahn
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Sharing Copies of Synthetic Clinical Corpora without Physical Distribution — A Case Study to Get Around IPRs and Privacy Constraints Featuring the German JSYNCC Corpus
Christina Lohr | Sven Buechel | Udo Hahn
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Proceedings of the First Workshop on Economics and Natural Language Processing
Udo Hahn | Véronique Hoste | Ming-Feng Tsai
Proceedings of the First Workshop on Economics and Natural Language Processing

pdf bib abs
A Corpus of Corporate Annual and Social Responsibility Reports: 280 Million Tokens of Balanced Organizational Writing
Sebastian G.M. Händschke | Sven Buechel | Jan Goldenstein | Philipp Poschmann | Tinghui Duan | Peter Walgenbach | Udo Hahn
Proceedings of the First Workshop on Economics and Natural Language Processing

We introduce JOCo, a novel text corpus for NLP analytics in the field of economics, business and management. This corpus is composed of corporate annual and social responsibility reports of the top 30 US, UK and German companies in the major (DJIA, FTSE 100, DAX), middle-sized (S&P 500, FTSE 250, MDAX) and technology (NASDAQ, FTSE AIM 100, TECDAX) stock indices, respectively. Altogether, this adds up to 5,000 reports from 270 companies headquartered in three of the world’s most important economies. The corpus spans a time frame from 2000 up to 2015 and contains, in total, 282M tokens. We also feature JOCo in a small-scale experiment to demonstrate its potential for NLP-fueled studies in economics, business and management research.

2017

pdf bib abs
EmoBank: Studying the Impact of Annotation Perspective and Representation Format on Dimensional Emotion Analysis
Sven Buechel | Udo Hahn
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We describe EmoBank, a corpus of 10k English sentences balancing multiple genres, which we annotated with dimensional emotion metadata in the Valence-Arousal-Dominance (VAD) representation format. EmoBank excels with a bi-perspectival and bi-representational design. On the one hand, we distinguish between writer’s and reader’s emotions, on the other hand, a subset of the corpus complements dimensional VAD annotations with categorical ones based on Basic Emotions. We find evidence for the supremacy of the reader’s perspective in terms of IAA and rating intensity, and achieve close-to-human performance when mapping between dimensional and categorical formats.

pdf bib abs
Readers vs. Writers vs. Texts: Coping with Different Perspectives of Text Understanding in Emotion Annotation
Sven Buechel | Udo Hahn
Proceedings of the 11th Linguistic Annotation Workshop

We here examine how different perspectives of understanding written discourse, like the reader’s, the writer’s or the text’s point of view, affect the quality of emotion annotations. We conducted a series of annotation experiments on two corpora, a popular movie review corpus and a genre- and domain-balanced corpus of standard English. We found statistical evidence that the writer’s perspective yields superior annotation quality overall. However, the quality one perspective yields compared to the other(s) seems to depend on the domain the utterance originates from. Our data further suggest that the popular movie review data set suffers from an atypical bimodal distribution which may decrease model performance when used as a training resource.

pdf bib
Exploring Diachronic Lexical Semantics with JeSemE
Johannes Hellrich | Udo Hahn
Proceedings of ACL 2017, System Demonstrations

pdf bib
Semedico: A Comprehensive Semantic Search Engine for the Life Sciences
Erik Faessler | Udo Hahn
Proceedings of ACL 2017, System Demonstrations

2016

pdf bib
Do Enterprises Have Emotions?
Sven Buechel | Udo Hahn | Jan Goldenstein | Sebastian G. M. Händschke | Peter Walgenbach
Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability
Johannes Hellrich | Udo Hahn
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib abs
Feelings from the Past—Adapting Affective Lexicons for Historical Emotion Analysis
Sven Buechel | Johannes Hellrich | Udo Hahn
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

We here describe a novel methodology for measuring affective language in historical text by expanding an affective lexicon and jointly adapting it to prior language stages. We automatically construct a lexicon for word-emotion association of 18th and 19th century German which is then validated against expert ratings. Subsequently, this resource is used to identify distinct emotional patterns and trace long-term emotional trends in different genres of writing spanning several centuries.

pdf bib abs
UIMA-Based JCoRe 2.0 Goes GitHub and Maven Central ― State-of-the-Art Software Resource Engineering and Distribution of NLP Pipelines
Udo Hahn | Franz Matthies | Erik Faessler | Johannes Hellrich
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce JCoRe 2.0, the relaunch of a UIMA-based open software repository for full-scale natural language processing originating from the Jena University Language & Information Engineering (JULIE) Lab. In an attempt to put the new release of JCoRe on firm software engineering ground, we uploaded it to GitHub, a social coding platform, with an underlying source code versioning system and various means to support collaboration for software development and code modification management. In order to automate the builds of complex NLP pipelines and properly represent and track dependencies of the underlying Java code, we incorporated Maven as part of our software configuration management efforts. In the meantime, we have deployed our artifacts on Maven Central, as well. JCoRe 2.0 offers a broad range of text analytics functionality (mostly) for English-language scientific abstracts and full-text articles, especially from the life sciences domain.

pdf bib abs
CodE Alltag: A German-Language E-Mail Corpus
Ulrike Krieg-Holz | Christian Schuschnig | Franz Matthies | Benjamin Redling | Udo Hahn
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce CODE ALLTAG, a text corpus composed of German-language e-mails. It is divided into two partitions: the first of these portions, CODE ALLTAG_XL, consists of a bulk-size collection drawn from an openly accessible e-mail archive (roughly 1.5M e-mails), whereas the second portion, CODE ALLTAG_S+d, is much smaller in size (less than thousand e-mails), yet excels with demographic data from each author of an e-mail. CODE ALLTAG, thus, currently constitutes the largest E-Mail corpus ever built. In this paper, we describe, for both parts, the solicitation process for gathering e-mails, present descriptive statistical properties of the corpus, and, for CODE ALLTAG_S+d, reveal a compilation of demographic features of the donors of e-mails.

pdf bib abs
Bad Company—Neighborhoods in Neural Embedding Spaces Considered Harmful
Johannes Hellrich | Udo Hahn
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We assess the reliability and accuracy of (neural) word embeddings for both modern and historical English and German. Our research provides deeper insights into the empirically justified choice of optimal training methods and parameters. The overall low reliability we observe, nevertheless, casts doubt on the suitability of word neighborhoods in embedding spaces as a basis for qualitative conclusions on synchronic and diachronic lexico-semantic matters, an issue currently high up in the agenda of Digital Humanities.

2014

pdf bib abs
Collaboratively Annotating Multilingual Parallel Corpora in the Biomedical Domain—some MANTRAs
Johannes Hellrich | Simon Clematide | Udo Hahn | Dietrich Rebholz-Schuhmann
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The coverage of multilingual biomedical resources is high for the English language, yet sparse for non-English languages―an observation which holds for seemingly well-resourced, yet still dramatically low-resourced ones such as Spanish, French or German but even more so for really under-resourced ones such as Dutch. We here present experimental results for automatically annotating parallel corpora and simultaneously acquiring new biomedical terminology for these under-resourced non-English languages on the basis of two types of language resources, namely parallel corpora (i.e. full translation equivalents at the document unit level) and (admittedly deficient) multilingual biomedical terminologies, with English as their anchor language. We automatically annotate these parallel corpora with biomedical named entities by an ensemble of named entity taggers and harmonize non-identical annotations the outcome of which is a so-called silver standard corpus. We conclude with an empirical assessment of this approach to automatically identify both known and new terms in multilingual corpora.

pdf bib abs
Disclose Models, Hide the Data - How to Make Use of Confidential Corpora without Seeing Sensitive Raw Data
Erik Faessler | Johannes Hellrich | Udo Hahn
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Confidential corpora from the medical, enterprise, security or intelligence domains often contain sensitive raw data which lead to severe restrictions as far as the public accessibility and distribution of such language resources are concerned. The enforcement of strict mechanisms of data protection consitutes a serious barrier for progress in language technology (products) in such domains, since these data are extremely rare or even unavailable for scientists and developers not directly involved in the creation and maintenance of such resources. In order to by-pass this problem, we here propose to distribute trained language models which were derived from such resources as a substitute for the original confidential raw data which remain hidden to the outside world. As an example, we exploit the access-protected German-language medical FRAMED corpus from which we generate and distribute models for sentence splitting, tokenization and POS tagging based on software taken from OPENNLP, NLTK and JCORE, our own UIMA-based text analytics pipeline.

2012

pdf bib abs
Iterative Refinement and Quality Checking of Annotation Guidelines — How to Deal Effectively with Semantically Sloppy Named Entity Types, such as Pathological Phenomena
Udo Hahn | Elena Beisswanger | Ekaterina Buyko | Erik Faessler | Jenny Traumüller | Susann Schröder | Kerstin Hornbostel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We here discuss a methodology for dealing with the annotation of semantically hard to delineate, i.e., sloppy, named entity types. To illustrate sloppiness of entities, we treat an example from the medical domain, namely pathological phenomena. Based on our experience with iterative guideline refinement we propose to carefully characterize the thematic scope of the annotation by positive and negative coding lists and allow for alternative, short vs. long mention span annotations. Short spans account for canonical entity mentions (e.g., standardized disease names), while long spans cover descriptive text snippets which contain entity-specific elaborations (e.g., anatomical locations, observational details, etc.). Using this stratified approach, evidence for increasing annotation performance is provided by kappa-based inter-annotator agreement measurements over several, iterative annotation rounds using continuously refined guidelines. The latter reflects the increasing understanding of the sloppy entity class both from the perspective of guideline writers and users (annotators). Given our data, we have gathered evidence that we can deal with sloppiness in a controlled manner and expect inter-annotator agreement values around 80% for PathoJen, the pathological phenomena corpus currently under development in our lab.

A number of gold standard corpora for named entity recognition are available to the public. However, the existing gold standard corpora are limited in size and semantic entity types. These usually lead to implementation of trained solutions (1) for a limited number of semantic entity types and (2) lacking in generalization capability. In order to overcome these problems, the CALBC project has aimed to automatically generate large scale corpora annotated with multiple semantic entity types in a community-wide manner based on the consensus of different named entity solutions. The generated corpus is called the silver standard corpus since the corpus generation process does not involve any manual curation. In this publication, we announce the release of the final CALBC corpora which include the silver standard corpus in different versions and several gold standard corpora for the further usage of the biomedical text mining community. The gold standard corpora are utilised to benchmark the methods used in the silver standard corpora generation process and released in a shared format. All the corpora are released in a shared format and accessible at www.calbc.eu.

2010

pdf bib
A Cognitive Cost Model of Annotations Based on Eye-Tracking Data
Katrin Tomanek | Udo Hahn | Steffen Lohmann | Jürgen Ziegler
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks
Ekaterina Buyko | Udo Hahn
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Proposal for a Configurable Silver Standard
Udo Hahn | Katrin Tomanek | Elena Beisswanger | Erik Faessler
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
Book Review: Introduction to Linguistic Annotation and Text Analytics by Graham Wilcock
Udo Hahn
Computational Linguistics, Volume 36, Issue 4 - December 2010

pdf bib
A Comparison of Models for Cost-Sensitive Active Learning
Katrin Tomanek | Udo Hahn
Coling 2010: Posters

pdf bib abs
The GeneReg Corpus for Gene Expression Regulation Events — An Overview of the Corpus and its In-Domain and Out-of-Domain Interoperability
Ekaterina Buyko | Elena Beisswanger | Udo Hahn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Despite the large variety of corpora in the biomedical domain their annotations differ in many respects, e.g., the coverage of different, highly specialized knowledge domains, varying degrees of granularity of targeted relations, the specificity of linguistic anchoring of relations and named entities in documents, etc. We here present GeneReg (Gene Regulation Corpus), the result of an annotation campaign led by the Jena University Language & Information Engineering (JULIE) Lab. The GeneReg corpus consists of 314 abstracts dealing with the regulation of gene expression in the model organism E. coli. Our emphasis in this paper is on the compatibility of the GeneReg corpus with the alternative Genia event corpus and with several in-domain and out-of-domain lexical resources, e.g., the Specialist Lexicon, FrameNet, and WordNet. The links we established from the GeneReg corpus to these external resources will help improve the performance of the automatic relation extraction engine JREx trained and evaluated on GeneReg.

pdf bib abs
Annotation Time Stamps — Temporal Metadata from the Linguistic Annotation Process
Katrin Tomanek | Udo Hahn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe the re-annotation of selected types of named entities (persons, organizations, locations) from the Muc7 corpus. The focus of this annotation initiative is on recording the time needed for the linguistic process of named entity annotation. Annotation times are measured on two basic annotation units -- sentences vs. complex noun phrases. We gathered evidence that decision times are non-uniformly distributed over the annotation units, while they do not substantially deviate among annotators. This data seems to support the hypothesis that annotation times very much depend on the inherent ""hardness"" of each single annotation decision. We further show how such time-stamped information can be used for empirically grounded studies of selective sampling techniques, such as Active Learning. We directly compare Active Learning costs on the basis of token-based vs. time-based measurements. The data reveals that Active Learning keeps its competitive advantage over random sampling in both scenarios though the difference is less marked for the time metric than for the token metric.

The production of gold standard corpora is time-consuming and costly. We propose an alternative: the âsilver standard corpus (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15,956,841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus.

2009

pdf bib
How Feasible and Robust is the Automatic Extraction of Gene Regulation Events? A Cross-Method Evaluation under Lab and Real-Life Conditions
Udo Hahn | Katrin Tomanek | Ekaterina Buyko | Jung-jae Kim | Dietrich Rebholz-Schuhmann
Proceedings of the BioNLP 2009 Workshop

pdf bib
Event Extraction from Trimmed Dependency Graphs
Ekaterina Buyko | Erik Faessler | Joachim Wermter | Udo Hahn
Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task

pdf bib
On Proper Unit Selection in Active Learning: Co-Selection Effects for Named Entity Recognition
Katrin Tomanek | Florian Laws | Udo Hahn | Hinrich Schütze
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

pdf bib
Timed Annotations — Enhancing MUC7 Metadata by the Time It Takes to Annotate Named Entities
Katrin Tomanek | Udo Hahn
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf bib
Semi-Supervised Active Learning for Sequence Labeling
Katrin Tomanek | Udo Hahn
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Building a BioWordNet Using WordNet Data Structures and WordNet’s Software Infrastructure–A Failure Story
Michael Poprat | Elena Beisswanger | Udo Hahn
Software Engineering, Testing, and Quality Assurance for Natural Language Processing

pdf bib
Multi-Task Active Learning for Linguistic Annotations
Roi Reichart | Katrin Tomanek | Udo Hahn | Ari Rappoport
Proceedings of ACL-08: HLT

pdf bib abs
Approximating Learning Curves for Active-Learning-Driven Annotation
Katrin Tomanek | Udo Hahn
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Active learning (AL) is getting more and more popular as a methodology to considerably reduce the annotation effort when building training material for statistical learning methods for various NLP tasks. A crucial issue rarely addressed, however, is when to actually stop the annotation process to profit from the savings in efforts. This question is tightly related to estimating the classifier performance after a certain amount of data has already been annotated. While learning curves are the default means to monitor the progress of the annotation process in terms of classifier performance, this requires a labeled gold standard which - in realistic annotation settings, at least - is often unavailable. We here propose a method for committee-based AL to approximate the progression of the learning curve based on the disagreement among the committee members. This method relies on a separate, unlabeled corpus and is thus well suited for situations where a labeled gold standard is not available or would be too expensive to obtain. Considering named entity recognition as a test case we provide empirical evidence that this approach works well under simulation as well as under real-world annotation conditions.

pdf bib abs
Semantic Annotations for Biology: a Corpus Development Initiative at the Jena University Language & Information Engineering (JULIE) Lab
Udo Hahn | Elena Beisswanger | Ekaterina Buyko | Michael Poprat | Katrin Tomanek | Joachim Wermter
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We provide an overview of corpus building efforts at the Jena University Language & Information Engineering (JULIE) Lab which are focused on life science documents. Special emphasis is laid on semantic annotations in terms of a large amount of biomedical named entities (almost 100 entity types), semantic relations, as well as discourse phenomena, reference relations in particular.

pdf bib
Are Morpho-Syntactic Features More Predictive for the Resolution of Noun Phrase Coordination Ambiguity than Lexico-Semantic Similarity Scores?
Ekaterina Buyko | Udo Hahn
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib
Quantitative Data on Referring Expressions in Biomedical Abstracts
Michael Poprat | Udo Hahn
Biological, translational, and clinical language processing

pdf bib
Efficient Annotation with the Jena ANnotation Environment (JANE)
Katrin Tomanek | Joachim Wermter | Udo Hahn
Proceedings of the Linguistic Annotation Workshop

pdf bib
An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data
Katrin Tomanek | Joachim Wermter | Udo Hahn
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

We present the lexico-semantic foundations underlying a multilingual lexicon the entries of which are constituted by so-called subwords. These subwords reflect semantic atomicity constraints in the medical domain which diverge from canonical lexicological understanding in NLP. We focus here on criteria to identify and delimit reasonable subword units, to group them into functionally adequate synonymy classes and relate them by two types of lexical relations. The lexicon we implemented on the basis of these considerations forms the lexical backbone for MorphoSaurus, a cross-language document retrieval engine for the medical domain.

pdf bib
You Can’t Beat Frequency (Unless You Use Linguistic Knowledge) – A Qualitative Evaluation of Association Measures for Collocation and Term Extraction
Joachim Wermter | Udo Hahn
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2005

pdf bib abs
Subword Clusters as Light-Weight Interlingua for Multilingual Document Retrieval
Udo Hahn | Kornel Marko | Stefan Schulz
Proceedings of Machine Translation Summit X: Papers

We introduce a light-weight interlingua for a cross-language document retrieval system in the medical domain. It is composed of equivalence classes of semantically primitive, language-specific subwords which are clustered by interlingual and intralingual synonymy. Each subword cluster represents a basic conceptual entity of the language-independent interlingua. Documents, as well as queries, are mapped to this interlingua level on which retrieval operations are performed. Evaluation experiments reveal that this interlingua-based retrieval model outperforms a direct translation approach.

pdf bib
Paradigmatic Modifiability Statistics for the Extraction of Complex Multi-Word Terms
Joachim Wermter | Udo Hahn
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

We argue for a performance-based design of natural language grammars and their associated parsers in order to meet the constraints imposed by real-world NLP. Our approach incorporates declarative and procedural knowledge about language and language use within an object-oriented specification framework. We discuss several message-passing protocols for parsing and provide reasons for sacrificing completeness of the parse in favor of efficiency based on a preliminary empirical evaluation.