Margot Mieskes


pdf bib
Proceedings of the Sixth Workshop on Teaching NLP
Sana Al-azzawi | Laura Biester | György Kovács | Ana Marasović | Leena Mathur | Margot Mieskes | Leonie Weissweiler
Proceedings of the Sixth Workshop on Teaching NLP

pdf bib
Autism Detection in Speech – A Survey
Nadine Probol | Margot Mieskes
Findings of the Association for Computational Linguistics: EACL 2024

There has been a range of studies of how autism is displayed in voice, speech, and language. We analyse studies from the biomedical, as well as the psychological domain, but also from the NLP domain in order to find linguistic, prosodic and acoustic cues. Our survey looks at all three domains. We define autism and which comorbidities might influence the correct detection of the disorder. We especially look at observations such as verbal and semantic fluency, prosodic features, but also disfluencies and speaking rate. We also show word-based approaches and describe machine learning and transformer-based approaches both on the audio data as well as the transcripts. Lastly, we conclude, while there already is a lot of research, female patients seem to be severely under-researched. Also, most NLP research focuses on traditional machine learning methods instead of transformers. Additionally, we were unable to find research combining both features from audio and transcripts.

pdf bib
Your Stereotypical Mileage May Vary: Practical Challenges of Evaluating Biases in Multiple Languages and Cultural Contexts
Karen Fort | Laura Alonso Alemany | Luciana Benotti | Julien Bezançon | Claudia Borg | Marthese Borg | Yongjian Chen | Fanny Ducel | Yoann Dupont | Guido Ivetta | Zhijian Li | Margot Mieskes | Marco Naguib | Yuyan Qian | Matteo Radaelli | Wolfgang S. Schmeisser-Nieto | Emma Raimundo Schulz | Thiziri Saci | Sarah Saidi | Javier Torroba Marchante | Shilin Xie | Sergio E. Zanotto | Aurélie Névéol
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting The study of bias, fairness and social impact in Natural Language Processing (NLP) lacks resources in languages other than English. Our objective is to support the evaluation of bias in language models in a multilingual setting. We use stereotypes across nine types of biases to build a corpus containing contrasting sentence pairs, one sentence that presents a stereotype concerning an underadvantaged group and another minimally changed sentence, concerning a matching advantaged group. We build on the French CrowS-Pairs corpus and guidelines to provide translations of the existing material into seven additional languages. In total, we produce 11,139 new sentence pairs that cover stereotypes dealing with nine types of biases in seven cultural contexts. We use the final resource for the evaluation of relevant monolingual and multilingual masked language models. We find that language models in all languages favor sentences that express stereotypes in most bias categories. The process of creating a resource that covers a wide range of language types and cultural settings highlights the difficulty of bias evaluation, in particular comparability across languages and contexts.


pdf bib
Investigating Bias in Multilingual Language Models: Cross-Lingual Transfer of Debiasing Techniques
Manon Reusens | Philipp Borchert | Margot Mieskes | Jochen De Weerdt | Bart Baesens
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This paper investigates the transferability of debiasing techniques across different languages within multilingual models. We examine the applicability of these techniques in English, French, German, and Dutch. Using multilingual BERT (mBERT), we demonstrate that cross-lingual transfer of debiasing techniques is not only feasible but also yields promising results. Surprisingly, our findings reveal no performance disadvantages when applying these techniques to non-English languages. Using translations of the CrowS-Pairs dataset, our analysis identifies SentenceDebias as the best technique across different languages, reducing bias in mBERT by an average of 13%. We also find that debiasing techniques with additional pretraining exhibit enhanced cross-lingual effectiveness for the languages included in the analyses, particularly in lower-resource languages. These novel insights contribute to a deeper understanding of bias mitigation in multilingual language models and provide practical guidance for debiasing techniques in different language contexts.

pdf bib
h_da@ReproHumn – Reproduction of Human Evaluation and Technical Pipeline
Margot Mieskes | Jacob Georg Benz
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

How reliable are human evaluation results? Is it possible to replicate human evaluation? This work takes a closer look at the evaluation of the output of a Text-to-Speech (TTS) system. Unfortunately, our results indicate that human evaluation is not as straightforward to replicate as expected. Additionally, we also present results on reproducing the technical background of the TTS system and discuss potential reasons for the reproduction failure.

pdf bib
Emotions in Spoken Language - Do we need acoustics?
Nadine Probol | Margot Mieskes
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Work on emotion detection is often focused on textual data from i.e. Social Media. If multimodal data (i.e. speech) is analysed, the focus again is often placed on the transcription. This paper takes a closer look at how crucial acoustic information actually is for the recognition of emotions from multimodal data. To this end we use the IEMOCAP data, which is one of the larger data sets that provides transcriptions, audio recordings and manual emotion categorization. We build models for emotion classification using text-only, acoustics-only and combining both modalities in order to examine the influence of the various modalities on the final categorization. Our results indicate that using text-only models outperform acoustics-only models. But combining text-only and acoustic-only models improves the results. Additionally, we perform a qualitative analysis and find that a range of misclassifications are due to factors not related to the model, but to the data such as, recording quality, a challenging classification task and misclassifications that are unsurprising for humans.

pdf bib
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Gavin Abercrombie | Jose M. Alonso-Moral | Mohammad Arvan | Anouck Braggaar | Mark Cieliebak | Elizabeth Clark | Kees van Deemter | Tanvi Dinkar | Ondřej Dušek | Steffen Eger | Qixiang Fang | Mingqi Gao | Albert Gatt | Dimitra Gkatzia | Javier González-Corbelle | Dirk Hovy | Manuela Hürlimann | Takumi Ito | John D. Kelleher | Filip Klubicka | Emiel Krahmer | Huiyuan Lai | Chris van der Lee | Yiru Li | Saad Mahamood | Margot Mieskes | Emiel van Miltenburg | Pablo Mosteiro | Malvina Nissim | Natalie Parde | Ondřej Plátek | Verena Rieser | Jie Ruan | Joel Tetreault | Antonio Toral | Xiaojun Wan | Leo Wanner | Lewis Watson | Diyi Yang
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

pdf bib
Proceedings of the 1st Workshop on Teaching for NLP
Annemarie Friedrich | Stefan Gr{\"u}newald | Margot Mieskes | Jannik Str{\"o}tgen | Christian Wartena
Proceedings of the 1st Workshop on Teaching for NLP


pdf bib
Replicability under Near-Perfect Conditions – A Case-Study from Automatic Summarization
Margot Mieskes
Proceedings of the Third Workshop on Insights from Negative Results in NLP

Replication of research results has become more and more important in Natural Language Processing. Nevertheless, we still rely on results reported in the literature for comparison. Additionally, elements of an experimental setup are not always completely reported. This includes, but is not limited to reporting specific parameters used or omitting an implementational detail. In our experiment based on two frequently used data sets from the domain of automatic summarization and the seemingly full disclosure of research artefacts, we examine how well results reported are replicable and what elements influence the success or failure of replication. Our results indicate that publishing research artifacts is far from sufficient, that that publishing all relevant parameters in all possible detail is cruicial.


pdf bib
Reviewing Natural Language Processing Research
Kevin Cohen | Karën Fort | Margot Mieskes | Aurélie Névéol | Anna Rogers
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

The reviewing procedure has been identified as one of the major issues in the current situation of the NLP field. While it is implicitly assumed that junior researcher learn reviewing during their PhD project, this might not always be the case. Additionally, with the growing NLP community and the efforts in the context of widening the NLP community, researchers joining the field might not have the opportunity to practise reviewing. This tutorial fills in this gap by providing an opportunity to learn the basics of reviewing. Also more experienced researchers might find this tutorial interesting to revise their reviewing procedure.

pdf bib
Proceedings of the Fifth Workshop on Teaching NLP
David Jurgens | Varada Kolhatkar | Lucy Li | Margot Mieskes | Ted Pedersen
Proceedings of the Fifth Workshop on Teaching NLP

pdf bib
Are We Summarizing the Right Way? A Survey of Dialogue Summarization Data Sets
Don Tuggener | Margot Mieskes | Jan Deriu | Mark Cieliebak
Proceedings of the Third Workshop on New Frontiers in Summarization

Dialogue summarization is a long-standing task in the field of NLP, and several data sets with dialogues and associated human-written summaries of different styles exist. However, it is unclear for which type of dialogue which type of summary is most appropriate. For this reason, we apply a linguistic model of dialogue types to derive matching summary items and NLP tasks. This allows us to map existing dialogue summarization data sets into this model and identify gaps and potential directions for future work. As part of this process, we also provide an extensive overview of existing dialogue summarization data sets.


pdf bib
Language Agnostic Automatic Summarization Evaluation
Christopher Tauchmann | Margot Mieskes
Proceedings of the Twelfth Language Resources and Evaluation Conference

So far work on automatic summarization has dealt primarily with English data. Accordingly, evaluation methods were primarily developed with this language in mind. In our work, we present experiments of adapting available evaluation methods such as ROUGE and PYRAMID to non-English data. We base our experiments on various English and non-English homogeneous benchmark data sets as well as a non-English heterogeneous data set. Our results indicate that ROUGE can indeed be adapted to non-English data – both homogeneous and heterogeneous. Using a recent implementation of performing an automatic PYRAMID evaluation, we also show its adaptability to non-English data.

pdf bib
A Data Set for the Analysis of Text Quality Dimensions in Summarization Evaluation
Margot Mieskes | Eneldo Loza Mencía | Tim Kronsbein
Proceedings of the Twelfth Language Resources and Evaluation Conference

Automatic evaluation of summarization focuses on developing a metric to represent the quality of the resulting text. However, text qualityis represented in a variety of dimensions ranging from grammaticality to readability and coherence. In our work, we analyze the depen-dencies between a variety of quality dimensions on automatically created multi-document summaries and which dimensions automaticevaluation metrics such as ROUGE, PEAK or JSD are able to capture. Our results indicate that variants of ROUGE are correlated tovarious quality dimensions and that some automatic summarization methods achieve higher quality summaries than others with respectto individual summary quality dimensions. Our results also indicate that differentiating between quality dimensions facilitates inspectionand fine-grained comparison of summarization methods and its characteristics. We make the data from our two summarization qualityevaluation experiments publicly available in order to facilitate the future development of specialized automatic evaluation methods.

pdf bib
Reviewing Natural Language Processing Research
Kevin Cohen | Karën Fort | Margot Mieskes | Aurélie Névéol
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

This tutorial will cover the theory and practice of reviewing research in natural language processing. Heavy reviewing burdens on natural language processing researchers have made it clear that our community needs to increase the size of our pool of potential reviewers. Simultaneously, notable “false negatives”—rejection by our conferences of work that was later shown to be tremendously important after acceptance by other conferences—have raised awareness of the fact that our reviewing practices leave something to be desired. We do not often talk about “false positives” with respect to conference papers, but leaders in the field have noted that we seem to have a publication bias towards papers that report high performance, with perhaps not much else of interest in them. It need not be this way. Reviewing is a learnable skill, and you will learn it here via lectures and a considerable amount of hands-on practice.


OCR Quality and NLP Preprocessing
Margot Mieskes | Stefan Schmunk
Proceedings of the 2019 Workshop on Widening NLP

We present initial experiments to evaluate the performance of tasks such as Part of Speech Tagging on data corrupted by Optical Character Recognition (OCR). Our results, based on English and German data, using artificial experiments as well as initial real OCRed data indicate that already a small drop in OCR quality considerably increases the error rates, which would have a significant impact on subsequent processing steps.

pdf bib
Summarization Evaluation meets Short-Answer Grading
Margot Mieskes | Ulrike Padó
Proceedings of the 8th Workshop on NLP for Computer Assisted Language Learning

pdf bib
Community Perspective on Replicability in Natural Language Processing
Margot Mieskes | Karën Fort | Aurélie Névéol | Cyril Grouin | Kevin Cohen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

With recent efforts in drawing attention to the task of replicating and/or reproducing results, for example in the context of COLING 2018 and various LREC workshops, the question arises how the NLP community views the topic of replicability in general. Using a survey, in which we involve members of the NLP community, we investigate how our community perceives this topic, its relevance and options for improvement. Based on over two hundred participants, the survey results confirm earlier observations, that successful reproducibility requires more than having access to code and data. Additionally, the results show that the topic has to be tackled from the authors’, reviewers’ and community’s side.


pdf bib
Preparing Data from Psychotherapy for Natural Language Processing
Margot Mieskes | Andreas Stiegelmayr
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data
Christopher Tauchmann | Thomas Arnold | Andreas Hanselowski | Christian M. Meyer | Margot Mieskes
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Work Smart - Reducing Effort in Short-Answer Grading
Margot Mieskes | Ulrike Padó
Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning


pdf bib
A Quantitative Study of Data in the NLP community
Margot Mieskes
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

We present results on a quantitative analysis of publications in the NLP domain on collecting, publishing and availability of research data. We find that a wide range of publications rely on data crawled from the web, but few give details on how potentially sensitive data was treated. Additionally, we find that while links to repositories of data are given, they often do not work even a short time after publication. We put together several suggestions on how to improve this situation based on publications from the NLP domain, but also other research areas.


pdf bib
EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres
Steffen Remus | Gerold Hintz | Chris Biemann | Christian M. Meyer | Darina Benikova | Judith Eckle-Kohler | Margot Mieskes | Thomas Arnold
Proceedings of the 10th Web as Corpus Workshop

pdf bib
Bridging the gap between extractive and abstractive summaries: Creation and evaluation of coherent extracts from heterogeneous sources
Darina Benikova | Margot Mieskes | Christian M. Meyer | Iryna Gurevych
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Coherent extracts are a novel type of summary combining the advantages of manually created abstractive summaries, which are fluent but difficult to evaluate, and low-quality automatically created extractive summaries, which lack coherence and structure. We use a corpus of heterogeneous documents to address the issue that information seekers usually face – a variety of different types of information sources. We directly extract information from these, but minimally redact and meaningfully order it to form a coherent text. Our qualitative and quantitative evaluations show that quantitative results are not sufficient to judge the quality of a summary and that other quality criteria, such as coherence, should also be taken into account. We find that our manually created corpus is of high quality and that it has the potential to bridge the gap between reference corpora of abstracts and automatic methods producing extracts. Our corpus is available to the research community for further development.

pdf bib
MDSWriter: Annotation Tool for Creating High-Quality Multi-Document Summarization Corpora
Christian M. Meyer | Darina Benikova | Margot Mieskes | Iryna Gurevych
Proceedings of ACL-2016 System Demonstrations


pdf bib
DKPro Agreement: An Open-Source Java Library for Measuring Inter-Rater Agreement
Christian M. Meyer | Margot Mieskes | Christian Stab | Iryna Gurevych
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations


pdf bib
Knowledge Sources for Bridging Resolution in Multi-Party Dialog
Mark-Christoph Mueller | Margot Mieskes | Michael Strube
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we investigate the coverage of the two knowledge sources WordNet and Wikipedia for the task of bridging resolution. We report on an annotation experiment which yielded pairs of bridging anaphors and their antecedents in spoken multi-party dialog. Manual inspection of the two knowledge sources showed that, with some interesting exceptions, Wikipedia is superior to WordNet when it comes to the coverage of information necessary to resolve the bridging anaphors in our data set. We further describe a simple procedure for the automatic extraction of the required knowledge from Wikipedia by means of an API, and discuss some of the implications of the procedure’s performance.

pdf bib
A Three-stage Disfluency Classifier for Multi Party Dialogues
Margot Mieskes | Michael Strube
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present work on a three-stage system to detect and classify disfluencies in multi party dialogues. The system consists of a regular expression based module and two machine learning based modules. The results are compared to other work on multi party dialogues and we show that our system outperforms previously reported ones.

pdf bib
Parameters for Topic Boundary Detection in Multi-Party Dialogues
Margot Mieskes | Michael Strube
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present a topic boundary detection method that searches for connections between sequences of utterances in multi party dialogues. The connections are established based on word identity. We compare our method to a state-of-the art automatic Topic boundary detection method that was also used on multi party dialogues. We checked various methods of preprocessing of the data, including stemming, lemmatization and stopword filtering with a text-based as well as speech-based stopword lists. Using standard evaluation methods we found that our method outperformed the state-of-the art method.


pdf bib
Part-of-Speech Tagging of Transcribed Speech
Margot Mieskes | Michael Strube
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We used four Part-of-Speech taggers, which are available for research purposes and were originally trained on text to tag a corpus of transcribed multiparty spoken dialogues. The assigned tags were then manually corrected. The correction was first used to evaluate the four taggers, then to retrain them. Despite limited resources in time, money and annotators we reached results comparable to those reported for the taggers on text. Based on our experience we present guidelines to produce reliably POS tagged corpora of new domains.