-
Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation
Authors:
Emmanouil Zaranis,
Giuseppe Attanasio,
Sweta Agrawal,
André F. T. Martins
Abstract:
The automatic assessment of translation quality has recently become crucial for many stages of the translation pipeline, from data curation to training and decoding. However, while quality estimation metrics have been optimized to align with human judgments, no attention has been given to these metrics' potential biases, particularly in reinforcing visibility and usability for some demographic gro…
▽ More
The automatic assessment of translation quality has recently become crucial for many stages of the translation pipeline, from data curation to training and decoding. However, while quality estimation metrics have been optimized to align with human judgments, no attention has been given to these metrics' potential biases, particularly in reinforcing visibility and usability for some demographic groups over others. This paper is the first to investigate gender bias in quality estimation (QE) metrics and its downstream impact on machine translation (MT). We focus on out-of-English translations where the target language uses grammatical gender. We ask: (RQ1) Do contemporary QE metrics exhibit gender bias? (RQ2) Can the use of contextual information mitigate this bias? (RQ3) How does QE influence gender bias in MT outputs? Experiments with state-of-the-art QE metrics across multiple domains, datasets, and languages reveal significant bias. Masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized. Moreover, context-aware QE metrics reduce errors for masculine-inflected references but fail to address feminine referents, exacerbating gender disparities. Additionally, we show that QE metrics can perpetuate gender bias in MT systems when used in quality-aware decoding. Our findings highlight the need to address gender bias in QE metrics to ensure equitable and unbiased MT systems.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Building Bridges: A Dataset for Evaluating Gender-Fair Machine Translation into German
Authors:
Manuel Lardelli,
Giuseppe Attanasio,
Anne Lauscher
Abstract:
The translation of gender-neutral person-referring terms (e.g., the students) is often non-trivial. Translating from English into German poses an interesting case -- in German, person-referring nouns are usually gender-specific, and if the gender of the referent(s) is unknown or diverse, the generic masculine (die Studenten (m.)) is commonly used. This solution, however, reduces the visibility of…
▽ More
The translation of gender-neutral person-referring terms (e.g., the students) is often non-trivial. Translating from English into German poses an interesting case -- in German, person-referring nouns are usually gender-specific, and if the gender of the referent(s) is unknown or diverse, the generic masculine (die Studenten (m.)) is commonly used. This solution, however, reduces the visibility of other genders, such as women and non-binary people. To counteract gender discrimination, a societal movement towards using gender-fair language exists (e.g., by adopting neosystems). However, gender-fair German is currently barely supported in machine translation (MT), requiring post-editing or manual translations. We address this research gap by studying gender-fair language in English-to-German MT. Concretely, we enrich a community-created gender-fair language dictionary and sample multi-sentence test instances from encyclopedic text and parliamentary speeches. Using these novel resources, we conduct the first benchmark study involving two commercial systems and six neural MT models for translating words in isolation and natural contexts across two domains. Our findings show that most systems produce mainly masculine forms and rarely gender-neutral variants, highlighting the need for future research. We release code and data at https://github.com/g8a9/building-bridges-gender-fair-german-mt.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Classist Tools: Social Class Correlates with Performance in NLP
Authors:
Amanda Cercas Curry,
Giuseppe Attanasio,
Zeerak Talat,
Dirk Hovy
Abstract:
Since the foundational work of William Labov on the social stratification of language (Labov, 1964), linguistics has made concentrated efforts to explore the links between sociodemographic characteristics and language production and perception. But while there is strong evidence for socio-demographic characteristics in language, they are infrequently used in Natural Language Processing (NLP). Age…
▽ More
Since the foundational work of William Labov on the social stratification of language (Labov, 1964), linguistics has made concentrated efforts to explore the links between sociodemographic characteristics and language production and perception. But while there is strong evidence for socio-demographic characteristics in language, they are infrequently used in Natural Language Processing (NLP). Age and gender are somewhat well represented, but Labov's original target, socioeconomic status, is noticeably absent. And yet it matters. We show empirically that NLP disadvantages less-privileged socioeconomic groups. We annotate a corpus of 95K utterances from movies with social class, ethnicity and geographical language variety and measure the performance of NLP systems on three tasks: language modelling, automatic speech recognition, and grammar error correction. We find significant performance disparities that can be attributed to socioeconomic status as well as ethnicity and geographical differences. With NLP technologies becoming ever more ubiquitous and quotidian, they must accommodate all language varieties to avoid disadvantaging already marginalised groups. We argue for the inclusion of socioeconomic class in future language technologies.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps
Authors:
Giuseppe Attanasio,
Beatrice Savoldi,
Dennis Fucci,
Dirk Hovy
Abstract:
Current automatic speech recognition (ASR) models are designed to be used across many languages and tasks without substantial changes. However, this broad language coverage hides performance gaps within languages, for example, across genders. Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight languag…
▽ More
Current automatic speech recognition (ASR) models are designed to be used across many languages and tasks without substantial changes. However, this broad language coverage hides performance gaps within languages, for example, across genders. Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. Our findings reveal clear gender disparities, with the advantaged group varying across languages and models. Surprisingly, those gaps are not explained by acoustic or lexical properties. However, probing internal model states reveals a correlation with gendered performance gap. That is, the easier it is to distinguish speaker gender in a language using probes, the more the gap reduces, favoring female speakers. Our results show that gender disparities persist even in state-of-the-art models. Our findings have implications for the improvement of multilingual ASR systems, underscoring the importance of accessibility to training data and nuanced evaluation to predict and mitigate gender gaps. We release all code and artifacts at https://github.com/g8a9/multilingual-asr-gender-gap.
△ Less
Submitted 3 October, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine Translation
Authors:
Giuseppe Attanasio,
Flor Miriam Plaza-del-Arco,
Debora Nozza,
Anne Lauscher
Abstract:
Recent instruction fine-tuned models can solve multiple NLP tasks when prompted to do so, with machine translation (MT) being a prominent use case. However, current research often focuses on standard performance benchmarks, leaving compelling fairness and ethical considerations behind. In MT, this might lead to misgendered translations, resulting, among other harms, in the perpetuation of stereoty…
▽ More
Recent instruction fine-tuned models can solve multiple NLP tasks when prompted to do so, with machine translation (MT) being a prominent use case. However, current research often focuses on standard performance benchmarks, leaving compelling fairness and ethical considerations behind. In MT, this might lead to misgendered translations, resulting, among other harms, in the perpetuation of stereotypes and prejudices. In this work, we address this gap by investigating whether and to what extent such models exhibit gender bias in machine translation and how we can mitigate it. Concretely, we compute established gender bias metrics on the WinoMT corpus from English to German and Spanish. We discover that IFT models default to male-inflected translations, even disregarding female occupational stereotypes. Next, using interpretability methods, we unveil that models systematically overlook the pronoun indicating the gender of a target occupation in misgendered translations. Finally, based on this finding, we propose an easy-to-implement and effective bias mitigation solution based on few-shot learning that leads to significantly fairer translations.
△ Less
Submitted 25 October, 2023; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
Authors:
Federico Bianchi,
Mirac Suzgun,
Giuseppe Attanasio,
Paul Röttger,
Dan Jurafsky,
Tatsunori Hashimoto,
James Zou
Abstract:
Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning.…
▽ More
Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe.
△ Less
Submitted 19 March, 2024; v1 submitted 14 September, 2023;
originally announced September 2023.
-
Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features
Authors:
Eliana Pastor,
Alkis Koudounas,
Giuseppe Attanasio,
Dirk Hovy,
Elena Baralis
Abstract:
Recent advances in eXplainable AI (XAI) have provided new insights into how models for vision, language, and tabular data operate. However, few approaches exist for understanding speech models. Existing work focuses on a few spoken language understanding (SLU) tasks, and explanations are difficult to interpret for most users. We introduce a new approach to explain speech classification models. We…
▽ More
Recent advances in eXplainable AI (XAI) have provided new insights into how models for vision, language, and tabular data operate. However, few approaches exist for understanding speech models. Existing work focuses on a few spoken language understanding (SLU) tasks, and explanations are difficult to interpret for most users. We introduce a new approach to explain speech classification models. We generate easy-to-interpret explanations via input perturbation on two information levels. 1) Word-level explanations reveal how each word-related audio segment impacts the outcome. 2) Paralinguistic features (e.g., prosody and background noise) answer the counterfactual: ``What would the model prediction be if we edited the audio signal in this way?'' We validate our approach by explaining two state-of-the-art SLU models on two speech classification tasks in English and Italian. Our findings demonstrate that the explanations are faithful to the model's inner workings and plausible to humans. Our method and findings pave the way for future research on interpreting speech models.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Weigh Your Own Words: Improving Hate Speech Counter Narrative Generation via Attention Regularization
Authors:
Helena Bonaldi,
Giuseppe Attanasio,
Debora Nozza,
Marco Guerini
Abstract:
Recent computational approaches for combating online hate speech involve the automatic generation of counter narratives by adapting Pretrained Transformer-based Language Models (PLMs) with human-curated data. This process, however, can produce in-domain overfitting, resulting in models generating acceptable narratives only for hatred similar to training data, with little portability to other targe…
▽ More
Recent computational approaches for combating online hate speech involve the automatic generation of counter narratives by adapting Pretrained Transformer-based Language Models (PLMs) with human-curated data. This process, however, can produce in-domain overfitting, resulting in models generating acceptable narratives only for hatred similar to training data, with little portability to other targets or to real-world toxic language. This paper introduces novel attention regularization methodologies to improve the generalization capabilities of PLMs for counter narratives generation. Overfitting to training-specific terms is then discouraged, resulting in more diverse and richer narratives. We experiment with two attention-based regularization techniques on a benchmark English dataset. Regularized models produce better counter narratives than state-of-the-art approaches in most cases, both in terms of automatic metrics and human evaluation, especially when hateful targets are not present in the training data. This work paves the way for better and more flexible counter-speech generation models, a task for which datasets are highly challenging to produce.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Authors:
Paul Röttger,
Hannah Rose Kirk,
Bertie Vidgen,
Giuseppe Attanasio,
Federico Bianchi,
Dirk Hovy
Abstract:
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and…
▽ More
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.
△ Less
Submitted 1 April, 2024; v1 submitted 2 August, 2023;
originally announced August 2023.
-
ITALIC: An Italian Intent Classification Dataset
Authors:
Alkis Koudounas,
Moreno La Quatra,
Lorenzo Vaiani,
Luca Colomba,
Giuseppe Attanasio,
Eliana Pastor,
Luca Cagliero,
Elena Baralis
Abstract:
Recent large-scale Spoken Language Understanding datasets focus predominantly on English and do not account for language-specific phenomena such as particular phonemes or words in different lects. We introduce ITALIC, the first large-scale speech dataset designed for intent classification in Italian. The dataset comprises 16,521 crowdsourced audio samples recorded by 70 speakers from various Itali…
▽ More
Recent large-scale Spoken Language Understanding datasets focus predominantly on English and do not account for language-specific phenomena such as particular phonemes or words in different lects. We introduce ITALIC, the first large-scale speech dataset designed for intent classification in Italian. The dataset comprises 16,521 crowdsourced audio samples recorded by 70 speakers from various Italian regions and annotated with intent labels and additional metadata. We explore the versatility of ITALIC by evaluating current state-of-the-art speech and text models. Results on intent classification suggest that increasing scale and running language adaptation yield better speech models, monolingual text models outscore multilingual ones, and that speech recognition on ITALIC is more challenging than on existing Italian benchmarks. We release both the dataset and the annotation scheme to streamline the development of new Italian SLU models and language-specific datasets.
△ Less
Submitted 14 June, 2023;
originally announced June 2023.
-
E Pluribus Unum: Guidelines on Multi-Objective Evaluation of Recommender Systems
Authors:
Patrick John Chia,
Giuseppe Attanasio,
Jacopo Tagliabue,
Federico Bianchi,
Ciro Greco,
Gabriel de Souza P. Moreira,
Davide Eynard,
Fahd Husain
Abstract:
Recommender Systems today are still mostly evaluated in terms of accuracy, with other aspects beyond the immediate relevance of recommendations, such as diversity, long-term user retention and fairness, often taking a back seat. Moreover, reconciling multiple performance perspectives is by definition indeterminate, presenting a stumbling block to those in the pursuit of rounded evaluation of Recom…
▽ More
Recommender Systems today are still mostly evaluated in terms of accuracy, with other aspects beyond the immediate relevance of recommendations, such as diversity, long-term user retention and fairness, often taking a back seat. Moreover, reconciling multiple performance perspectives is by definition indeterminate, presenting a stumbling block to those in the pursuit of rounded evaluation of Recommender Systems. EvalRS 2022 -- a data challenge designed around Multi-Objective Evaluation -- was a first practical endeavour, providing many insights into the requirements and challenges of balancing multiple objectives in evaluation. In this work, we reflect on EvalRS 2022 and expound upon crucial learnings to formulate a first-principles approach toward Multi-Objective model selection, and outline a set of guidelines for carrying out a Multi-Objective Evaluation challenge, with potential applicability to the problem of rounded evaluation of competing models in real-world deployments.
△ Less
Submitted 20 April, 2023;
originally announced April 2023.
-
Is It Worth the (Environmental) Cost? Limited Evidence for Temporal Adaptation via Continuous Training
Authors:
Giuseppe Attanasio,
Debora Nozza,
Federico Bianchi,
Dirk Hovy
Abstract:
Language is constantly changing and evolving, leaving language models to become quickly outdated. Consequently, we should continuously update our models with new data to expose them to new events and facts. However, that requires additional computing, which means new carbon emissions. Do any measurable benefits justify this cost? This paper looks for empirical evidence to support continuous traini…
▽ More
Language is constantly changing and evolving, leaving language models to become quickly outdated. Consequently, we should continuously update our models with new data to expose them to new events and facts. However, that requires additional computing, which means new carbon emissions. Do any measurable benefits justify this cost? This paper looks for empirical evidence to support continuous training. We reproduce existing benchmarks and extend them to include additional time periods, models, and tasks. Our results show that the downstream task performance of temporally adapted English models for social media data do not improve over time. Pretrained models without temporal adaptation are actually significantly more effective and efficient. However, we also note a lack of suitable temporal benchmarks. Our findings invite a critical reflection on when and how to temporally adapt language models, accounting for sustainability.
△ Less
Submitted 4 May, 2023; v1 submitted 13 October, 2022;
originally announced October 2022.
-
ferret: a Framework for Benchmarking Explainers on Transformers
Authors:
Giuseppe Attanasio,
Eliana Pastor,
Chiara Di Bonaventura,
Debora Nozza
Abstract:
As Transformers are increasingly relied upon to solve complex NLP problems, there is an increased need for their decisions to be humanly interpretable. While several explainable AI (XAI) techniques for interpreting the outputs of transformer-based models have been proposed, there is still a lack of easy access to using and comparing them. We introduce ferret, a Python library to simplify the use a…
▽ More
As Transformers are increasingly relied upon to solve complex NLP problems, there is an increased need for their decisions to be humanly interpretable. While several explainable AI (XAI) techniques for interpreting the outputs of transformer-based models have been proposed, there is still a lack of easy access to using and comparing them. We introduce ferret, a Python library to simplify the use and comparisons of XAI methods on transformer-based classifiers. With ferret, users can visualize and compare transformers-based models output explanations using state-of-the-art XAI methods on any free-text or existing XAI corpora. Moreover, users can also evaluate ad-hoc XAI metrics to select the most faithful and plausible explanations. To align with the recently consolidated process of sharing and using transformers-based models from Hugging Face, ferret interfaces directly with its Python library. In this paper, we showcase ferret to benchmark XAI methods used on transformers for sentiment analysis and hate speech detection. We show how specific methods provide consistently better explanations and are preferable in the context of transformer models.
△ Less
Submitted 2 March, 2023; v1 submitted 2 August, 2022;
originally announced August 2022.
-
EvalRS: a Rounded Evaluation of Recommender Systems
Authors:
Jacopo Tagliabue,
Federico Bianchi,
Tobias Schnabel,
Giuseppe Attanasio,
Ciro Greco,
Gabriel de Souza P. Moreira,
Patrick John Chia
Abstract:
Much of the complexity of Recommender Systems (RSs) comes from the fact that they are used as part of more complex applications and affect user experience through a varied range of user interfaces. However, research focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow f…
▽ More
Much of the complexity of Recommender Systems (RSs) comes from the fact that they are used as part of more complex applications and affect user experience through a varied range of user interfaces. However, research focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow focus has limited the capacity of RSs to have a lasting impact in the real world and makes them vulnerable to undesired behavior, such as reinforcing data biases. We propose EvalRS as a new type of challenge, in order to foster this discussion among practitioners and build in the open new methodologies for testing RSs "in the wild".
△ Less
Submitted 12 August, 2022; v1 submitted 12 July, 2022;
originally announced July 2022.
-
Contrastive language and vision learning of general fashion concepts
Authors:
Patrick John Chia,
Giuseppe Attanasio,
Federico Bianchi,
Silvia Terragni,
Ana Rita Magalhães,
Diogo Goncalves,
Ciro Greco,
Jacopo Tagliabue
Abstract:
The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like mo…
▽ More
The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model for the fashion industry. We showcase its capabilities for retrieval, classification and grounding, and release our model and code to the community.
△ Less
Submitted 18 April, 2023; v1 submitted 8 April, 2022;
originally announced April 2022.
-
Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists
Authors:
Giuseppe Attanasio,
Debora Nozza,
Dirk Hovy,
Elena Baralis
Abstract:
Natural Language Processing (NLP) models risk overfitting to specific terms in the training data, thereby reducing their performance, fairness, and generalizability. E.g., neural hate speech detection models are strongly influenced by identity terms like gay, or women, resulting in false positives, severe unintended bias, and lower performance. Most mitigation techniques use lists of identity term…
▽ More
Natural Language Processing (NLP) models risk overfitting to specific terms in the training data, thereby reducing their performance, fairness, and generalizability. E.g., neural hate speech detection models are strongly influenced by identity terms like gay, or women, resulting in false positives, severe unintended bias, and lower performance. Most mitigation techniques use lists of identity terms or samples from the target domain during training. However, this approach requires a-priori knowledge and introduces further bias if important terms are neglected. Instead, we propose a knowledge-free Entropy-based Attention Regularization (EAR) to discourage overfitting to training-specific terms. An additional objective function penalizes tokens with low self-attention entropy. We fine-tune BERT via EAR: the resulting model matches or exceeds state-of-the-art performance for hate speech classification and bias metrics on three benchmark corpora in English and Italian. EAR also reveals overfitting terms, i.e., terms most likely to induce bias, to help identify their effect on the model, task, and predictions.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.
-
Contrastive Language-Image Pre-training for the Italian Language
Authors:
Federico Bianchi,
Giuseppe Attanasio,
Raphael Pisoni,
Silvia Terragni,
Gabriele Sarti,
Sri Lakshmi
Abstract:
CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs hi…
▽ More
CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model for the Italian Language (CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.
△ Less
Submitted 19 August, 2021;
originally announced August 2021.