subscribe to arXiv mailings

arXiv:2407.18528 [pdf, other]

Time performance of Analog Pixel Test Structures with in-chip operational amplifier implemented in 65 nm CMOS imaging process

Authors: Gianluca Aglieri Rinella, Luca Aglietta, Matias Antonelli, Francesco Barile, Franco Benotto, Stefania Maria Beolè, Elena Botta, Giuseppe Eugenio Bruno, Francesca Carnesecchi, Domenico Colella, Angelo Colelli, Giacomo Contin, Giuseppe De Robertis, Florina Dumitrache, Domenico Elia, Chiara Ferrero, Martin Fransen, Alex Kluge, Shyam Kumar, Corentin Lemoine, Francesco Licciulli, Bong-Hwi Lim, Flavio Loddo, Magnus Mager, Davide Marras , et al. (21 additional authors not shown)

Abstract: In the context of the CERN EP R&D on monolithic sensors and the ALICE ITS3 upgrade, the Tower Partners Semiconductor Co (TPSCo) 65 nm process has been qualified for use in high energy physics, and adopted for the ALICE ITS3 upgrade. An Analog Pixel Test Structure (APTS) featuring fast per pixel operational-amplifier-based buffering for a small matrix of four by four pixels, with a sensor with a sm… ▽ More In the context of the CERN EP R&D on monolithic sensors and the ALICE ITS3 upgrade, the Tower Partners Semiconductor Co (TPSCo) 65 nm process has been qualified for use in high energy physics, and adopted for the ALICE ITS3 upgrade. An Analog Pixel Test Structure (APTS) featuring fast per pixel operational-amplifier-based buffering for a small matrix of four by four pixels, with a sensor with a small collection electrode and a very non-uniform electric field, was designed to allow detailed characterization of the pixel performance in this technology. Several variants of this chip with different pixel designs have been characterized with a (120 GeV/$c$) positive hadron beam. This result indicates that the APTS-OA prototype variants with the best performance achieve a time resolution of 63 ps with a detection efficiency exceeding 99% and a spatial resolution of 2 $μ$m, highlighting the potential of TPSCo 65nm CMOS imaging technology for high-energy physics and other fields requiring precise time measurement, high detection efficiency, and excellent spatial resolution. △ Less

Submitted 26 July, 2024; originally announced July 2024.

arXiv:2403.08952 [pdf, other]

Characterisation of analogue Monolithic Active Pixel Sensor test structures implemented in a 65 nm CMOS imaging process

Authors: Gianluca Aglieri Rinella, Giacomo Alocco, Matias Antonelli, Roberto Baccomi, Stefania Maria Beole, Mihail Bogdan Blidaru, Bent Benedikt Buttwill, Eric Buschmann, Paolo Camerini, Francesca Carnesecchi, Marielle Chartier, Yongjun Choi, Manuel Colocci, Giacomo Contin, Dominik Dannheim, Daniele De Gruttola, Manuel Del Rio Viera, Andrea Dubla, Antonello di Mauro, Maurice Calvin Donner, Gregor Hieronymus Eberwein, Jan Egger, Laura Fabbietti, Finn Feindt, Kunal Gautam , et al. (69 additional authors not shown)

Abstract: Analogue test structures were fabricated using the Tower Partners Semiconductor Co. CMOS 65 nm ISC process. The purpose was to characterise and qualify this process and to optimise the sensor for the next generation of Monolithic Active Pixels Sensors for high-energy physics. The technology was explored in several variants which differed by: doping levels, pixel geometries and pixel pitches (10-25… ▽ More Analogue test structures were fabricated using the Tower Partners Semiconductor Co. CMOS 65 nm ISC process. The purpose was to characterise and qualify this process and to optimise the sensor for the next generation of Monolithic Active Pixels Sensors for high-energy physics. The technology was explored in several variants which differed by: doping levels, pixel geometries and pixel pitches (10-25 $μ$m). These variants have been tested following exposure to varying levels of irradiation up to 3 MGy and $10^{16}$ 1 MeV n$_\text{eq}$ cm$^{-2}$. Here the results from prototypes that feature direct analogue output of a 4$\times$4 pixel matrix are reported, allowing the systematic and detailed study of charge collection properties. Measurements were taken both using $^{55}$Fe X-ray sources and in beam tests using minimum ionizing particles. The results not only demonstrate the feasibility of using this technology for particle detection but also serve as a reference for future applications and optimisations. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2306.06804 [pdf, other]

Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction

Authors: Manuel Mager, Rajat Bhatnagar, Graham Neubig, Ngoc Thang Vu, Katharina Kann

Abstract: Neural models have drastically advanced state of the art for machine translation (MT) between high-resource languages. Traditionally, these models rely on large amounts of training data, but many language pairs lack these resources. However, an important part of the languages in the world do not have this amount of data. Most languages from the Americas are among them, having a limited amount of p… ▽ More Neural models have drastically advanced state of the art for machine translation (MT) between high-resource languages. Traditionally, these models rely on large amounts of training data, but many language pairs lack these resources. However, an important part of the languages in the world do not have this amount of data. Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. Here, we present an introduction to the interested reader to the basic challenges, concepts, and techniques that involve the creation of MT systems for these languages. Finally, we discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages. △ Less

Submitted 11 June, 2023; originally announced June 2023.

Comments: Accepted to AmericasNLP 2023

arXiv:2305.19474 [pdf, other]

Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers

Authors: Manuel Mager, Elisabeth Mager, Katharina Kann, Ngoc Thang Vu

Abstract: In recent years machine translation has become very successful for high-resource language pairs. This has also sparked new interest in research on the automatic translation of low-resource languages, including Indigenous languages. However, the latter are deeply related to the ethnic and cultural groups that speak (or used to speak) them. The data collection, modeling and deploying machine transla… ▽ More In recent years machine translation has become very successful for high-resource language pairs. This has also sparked new interest in research on the automatic translation of low-resource languages, including Indigenous languages. However, the latter are deeply related to the ethnic and cultural groups that speak (or used to speak) them. The data collection, modeling and deploying machine translation systems thus result in new ethical questions that must be addressed. Motivated by this, we first survey the existing literature on ethical considerations for the documentation, translation, and general natural language processing for Indigenous languages. Afterward, we conduct and analyze an interview study to shed light on the positions of community leaders, teachers, and language activists regarding ethical concerns for the automatic translation of their languages. Our results show that the inclusion, at different degrees, of native speakers and community members is vital to performing better and more ethical research on Indigenous languages. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Accepted to ACL2023 Main Conference

arXiv:2305.17154 [pdf, other]

On convex decision regions in deep network representations

Authors: Lenka Tětková, Thea Brüsch, Teresa Karen Scheidt, Fabian Martin Mager, Rasmus Ørtoft Aagaard, Jonathan Foldager, Tommy Sonne Alstrøm, Lars Kai Hansen

Abstract: Current work on human-machine alignment aims at understanding machine-learned latent spaces and their correspondence to human representations. G{ä}rdenfors' conceptual spaces is a prominent framework for understanding human representations. Convexity of object regions in conceptual spaces is argued to promote generalizability, few-shot learning, and interpersonal alignment. Based on these insights… ▽ More Current work on human-machine alignment aims at understanding machine-learned latent spaces and their correspondence to human representations. G{ä}rdenfors' conceptual spaces is a prominent framework for understanding human representations. Convexity of object regions in conceptual spaces is argued to promote generalizability, few-shot learning, and interpersonal alignment. Based on these insights, we investigate the notion of convexity of concept regions in machine-learned latent spaces. We develop a set of tools for measuring convexity in sampled data and evaluate emergent convexity in layered representations of state-of-the-art deep networks. We show that convexity is robust to basic re-parametrization and, hence, meaningful as a quality of machine-learned latent spaces. We find that approximate convexity is pervasive in neural representations in multiple application domains, including models of images, audio, human activity, text, and medical images. Generally, we observe that fine-tuning increases the convexity of label regions. We find evidence that pretraining convexity of class label regions predicts subsequent fine-tuning performance. △ Less

Submitted 6 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

arXiv:2212.08621 [pdf, other]

doi 10.1016/j.nima.2023.168589

Digital Pixel Test Structures implemented in a 65 nm CMOS process

Authors: Gianluca Aglieri Rinella, Anton Andronic, Matias Antonelli, Mauro Aresti, Roberto Baccomi, Pascal Becht, Stefania Beole, Justus Braach, Matthew Daniel Buckland, Eric Buschmann, Paolo Camerini, Francesca Carnesecchi, Leonardo Cecconi, Edoardo Charbon, Giacomo Contin, Dominik Dannheim, Joao de Melo, Wenjing Deng, Antonello di Mauro, Jan Hasenbichler, Hartmut Hillemanns, Geun Hee Hong, Artem Isakov, Antoine Junique, Alex Kluge , et al. (27 additional authors not shown)

Abstract: The ALICE ITS3 (Inner Tracking System 3) upgrade project and the CERN EP R&D on monolithic pixel sensors are investigating the feasibility of the Tower Partners Semiconductor Co. 65 nm process for use in the next generation of vertex detectors. The ITS3 aims to employ wafer-scale Monolithic Active Pixel Sensors thinned down to 20 to 40 um and bent to form truly cylindrical half barrels. Among the… ▽ More The ALICE ITS3 (Inner Tracking System 3) upgrade project and the CERN EP R&D on monolithic pixel sensors are investigating the feasibility of the Tower Partners Semiconductor Co. 65 nm process for use in the next generation of vertex detectors. The ITS3 aims to employ wafer-scale Monolithic Active Pixel Sensors thinned down to 20 to 40 um and bent to form truly cylindrical half barrels. Among the first critical steps towards the realisation of this detector is to validate the sensor technology through extensive characterisation both in the laboratory and with in-beam measurements. The Digital Pixel Test Structure (DPTS) is one of the prototypes produced in the first sensor submission in this technology and has undergone a systematic measurement campaign whose details are presented in this article. The results confirm the goals of detection efficiency and non-ionising and ionising radiation hardness up to the expected levels for ALICE ITS3 and also demonstrate operation at +20 C and a detection efficiency of 99% for a DPTS irradiated with a dose of $10^{15}$ 1 MeV n$_{\mathrm{eq}}/$cm$^2$. Furthermore, spatial, timing and energy resolutions were measured at various settings and irradiation levels. △ Less

Submitted 10 July, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

Comments: v4: Corrected Table 1. v3: Implemented reviewers' comments. v2: Updated threshold calibration method. Implemented colorblind friendly color palette in all figures. Updated references

arXiv:2210.06990 [pdf, other]

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Authors: Marwa Gaser, Manuel Mager, Injy Hamed, Nizar Habash, Slim Abdennadher, Ngoc Thang Vu

Abstract: Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of diffe… ▽ More Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation. △ Less

Submitted 30 April, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: Accepted to EACL 2023

arXiv:2209.02511 [pdf, other]

doi 10.1088/1748-0221/18/01/P01038

Performance of the Electromagnetic Pixel Calorimeter Prototype EPICAL-2

Authors: J. Alme, R. Barthel, A. van Bochove, V. Borshchov, R. Bosley, A. van den Brink, E. Broeils, H. Büsching, V. N. Eikeland, O. S. Groettvik, Y. H. Han, N. van der Kolk, J. H. Kim, T. J. Kim, Y. Kwon, M. Mager, Q. W. Malik, E. Okkinga, T. Y. Park, T. Peitzmann, F. Pliquett, M. Protsenko, F. Reidt, S. van Rijk, K. Røed , et al. (9 additional authors not shown)

Abstract: The first evaluation of an ultra-high granularity digital electromagnetic calorimeter prototype using 1.0-5.8 GeV/c electrons is presented. The $25\times10^6$ pixel detector consists of 24 layers of ALPIDE CMOS MAPS sensors, with a pitch of around 30~$μ$m, and has a depth of almost 20 radiation lengths of tungsten absorber. Ultra-thin cables allow for a very compact design. The properties that are… ▽ More The first evaluation of an ultra-high granularity digital electromagnetic calorimeter prototype using 1.0-5.8 GeV/c electrons is presented. The $25\times10^6$ pixel detector consists of 24 layers of ALPIDE CMOS MAPS sensors, with a pitch of around 30~$μ$m, and has a depth of almost 20 radiation lengths of tungsten absorber. Ultra-thin cables allow for a very compact design. The properties that are critical for physics studies are measured: electromagnetic shower response, energy resolution and linearity. The stochastic energy resolution is comparable with the state-of-the art resolution for a Si-W calorimeter, with data described well by a simulation model using GEANT and Allpix$^2$. The performance achieved makes this technology a good candidate for use in the ALICE FoCal upgrade, and in general demonstrates the strong potential for future applications in high-energy physics. △ Less

Submitted 28 December, 2022; v1 submitted 6 September, 2022; originally announced September 2022.

Comments: 30 pages, 19 figures, submitted to JINST

arXiv:2207.01815 [pdf, other]

doi 10.1016/j.nima.2022.167539

Results from the EPICAL-2 Ultra-High Granularity Electromagnetic Calorimeter Prototype

Authors: T. Peitzmann, J. Alme, R. Barthel, A. van Bochove, V. Borshchov, R. Bosley, A. van den Brink, E. Broeils, H. Büsching, V. N. Eikeland, O. S. Groettvik, Y. H. Han, N. van der Kolk, J. H. Kim, T. J. Kim, Y. Kwon, M. Mager, Q. W. Malik, E. Okkinga, T. Y. Park, F. Pliquett, M. Protsenko, F. Reidt, S. van Rijk, K. Røed , et al. (9 additional authors not shown)

Abstract: A prototype of a new type of calorimeter has been designed and constructed, based on a silicon-tungsten sampling design using pixel sensors with digital readout. It makes use of the Alpide MAPS sensor developed for the ALICE ITS upgrade. A binary readout is possible due to the pixel size of $\approx 30 \times 30 \, μ\mathrm{m}^2$. This prototype has been successfully tested with cosmic muons and w… ▽ More A prototype of a new type of calorimeter has been designed and constructed, based on a silicon-tungsten sampling design using pixel sensors with digital readout. It makes use of the Alpide MAPS sensor developed for the ALICE ITS upgrade. A binary readout is possible due to the pixel size of $\approx 30 \times 30 \, μ\mathrm{m}^2$. This prototype has been successfully tested with cosmic muons and with test beams at DESY and the CERN SPS. We report on performance results obtained at DESY, showing good energy resolution and linearity, and compare to detailed MC simulations. Also shown are preliminary results of the high-energy performance as measured at the SPS. The two-shower separation capabilities are discussed. △ Less

Submitted 27 September, 2022; v1 submitted 5 July, 2022; originally announced July 2022.

Comments: Proceedings to PM2021 - The 15. PISA Meeting on Advanced Detectors, updated after referee review

arXiv:2205.12669 [pdf, other]

doi 10.1016/j.nima.2022.167673

The MAPS foil

Authors: S. Beolé, F. Carnesecchi, G. Contin, R. de Oliveira, A. di Mauro, S. Ferry, H. Hillemanns, A. Junique, A. Kluge, L. Lautner, M. Mager, B. Mehl, K. Rebane, F. Reidt, I. Sanna, M. Šuljić, A. Yüncü

Abstract: We present a method of embedding a Monolithic Active Pixel Sensor (MAPS) into a flexible printed circuit board (FPC) and its interconnection by means of through-hole copper plating. The resulting assembly, baptised "MAPS foil", is a flexible, light, protected, and fully integrated detector module. By using widely available printed circuit board manufacturing techniques, the production of these dev… ▽ More We present a method of embedding a Monolithic Active Pixel Sensor (MAPS) into a flexible printed circuit board (FPC) and its interconnection by means of through-hole copper plating. The resulting assembly, baptised "MAPS foil", is a flexible, light, protected, and fully integrated detector module. By using widely available printed circuit board manufacturing techniques, the production of these devices can be scaled easily in size and volume, making it a compelling candidate for future large-scale applications. A first series of prototypes that embed the ALPIDE chip has been produced, functionally tested, and shown to be working. △ Less

Submitted 19 October, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

arXiv:2203.08954 [pdf, other]

BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Authors: Manuel Mager, Arturo Oncevay, Elisabeth Mager, Katharina Kann, Ngoc Thang Vu

Abstract: Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically insp… ▽ More Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri--Spanish. △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: Accepted to Findings of ACL 2022

arXiv:2111.11880 [pdf, other]

doi 10.1038/s41467-022-30843-1

CRISPR SWAPnDROP -- A multifunctional system for genome editing and large-scale interspecies gene transfer

Authors: Marc Teufel, Carlo A. Klein, Maurice Mager, Patrick Sobetzko

Abstract: The need for diverse chromosomal modifications in biotechnology, synthetic biology and basic research requires the development of new technologies. With CRISPR SWAPnDROP, we extend the limits of genome editing to large-scale in-vivo DNA transfer between bacterial species. Its modular platform approach facilitates species specific adaptation to confer genome editing in various species. In this stud… ▽ More The need for diverse chromosomal modifications in biotechnology, synthetic biology and basic research requires the development of new technologies. With CRISPR SWAPnDROP, we extend the limits of genome editing to large-scale in-vivo DNA transfer between bacterial species. Its modular platform approach facilitates species specific adaptation to confer genome editing in various species. In this study, we show the implementation of the CRISPR SWAPnDROP concept for the model organism Escherichia coli and the currently fastest growing and biotechnologically relevant organism Vibrio natriegens. We demonstrate the excision, transfer and integration of 151kb chromosomal DNA between E. coli strains and from E. coli to V. natriegens without size-limiting intermediate DNA extraction. With the transfer of the E. coli MG1655 wild type lac operon, we establish a functional lactose and galactose degradation pathway in V. natriegens to extend its biotechnological spectrum. We also transfer the E. coli DH5alpha lac operon and make V. natriegens capable of alpha-complementation - a step towards an ultra-fast cloning strain. Furthermore, CRISPR SWAPnDROP is designed to be the swiss army knife of genome engineering. Its spectrum of application comprises scarless, marker-free, iterative and parallel insertions and deletions, genome rearrangements, as well as gene transfer between strains and across species. The modular character facilitates DNA library applications and the recycling of standardized parts. Its novel multi-color scarless co-selection system significantly improves editing efficiency to 92% for single edits and 83% for quadruple edits and provides visual quality controls throughout the assembly and editing process. △ Less

Submitted 23 November, 2021; originally announced November 2021.

arXiv:2106.16055 [pdf, ps, other]

IMS' Systems for the IWSLT 2021 Low-Resource Speech Translation Task

Authors: Pavel Denisov, Manuel Mager, Ngoc Thang Vu

Abstract: This paper describes the submission to the IWSLT 2021 Low-Resource Speech Translation Shared Task by IMS team. We utilize state-of-the-art models combined with several data augmentation, multi-task and transfer learning approaches for the automatic speech recognition (ASR) and machine translation (MT) steps of our cascaded system. Moreover, we also explore the feasibility of a full end-to-end spee… ▽ More This paper describes the submission to the IWSLT 2021 Low-Resource Speech Translation Shared Task by IMS team. We utilize state-of-the-art models combined with several data augmentation, multi-task and transfer learning approaches for the automatic speech recognition (ASR) and machine translation (MT) steps of our cascaded system. Moreover, we also explore the feasibility of a full end-to-end speech translation (ST) model in the case of very constrained amount of ground truth labeled data. Our best system achieves the best performance among all submitted systems for Congolese Swahili to English and French with BLEU scores 7.7 and 13.7 respectively, and the second best result for Coastal Swahili to English with BLEU score 14.9. △ Less

Submitted 30 June, 2021; originally announced June 2021.

Comments: IWSLT 2021

arXiv:2105.13000 [pdf, other]

First demonstration of in-beam performance of bent Monolithic Active Pixel Sensors

Authors: ALICE ITS project, :, G. Aglieri Rinella, M. Agnello, B. Alessandro, F. Agnese, R. S. Akram, J. Alme, E. Anderssen, D. Andreou, F. Antinori, N. Apadula, P. Atkinson, R. Baccomi, A. Badalà, A. Balbino, C. Bartels, R. Barthel, F. Baruffaldi, I. Belikov, S. Beole, P. Becht, A. Bhatti, M. Bhopal, N. Bianchi , et al. (230 additional authors not shown)

Abstract: A novel approach for designing the next generation of vertex detectors foresees to employ wafer-scale sensors that can be bent to truly cylindrical geometries after thinning them to thicknesses of 20-40$μ$m. To solidify this concept, the feasibility of operating bent MAPS was demonstrated using 1.5$\times$3cm ALPIDE chips. Already with their thickness of 50$μ$m, they can be successfully bent to ra… ▽ More A novel approach for designing the next generation of vertex detectors foresees to employ wafer-scale sensors that can be bent to truly cylindrical geometries after thinning them to thicknesses of 20-40$μ$m. To solidify this concept, the feasibility of operating bent MAPS was demonstrated using 1.5$\times$3cm ALPIDE chips. Already with their thickness of 50$μ$m, they can be successfully bent to radii of about 2cm without any signs of mechanical or electrical damage. During a subsequent characterisation using a 5.4GeV electron beam, it was further confirmed that they preserve their full electrical functionality as well as particle detection performance. In this article, the bending procedure and the setup used for characterisation are detailed. Furthermore, the analysis of the beam test, including the measurement of the detection efficiency as a function of beam position and local inclination angle, is discussed. The results show that the sensors maintain their excellent performance after bending to radii of 2cm, with detection efficiencies above 99.9% at typical operating conditions, paving the way towards a new class of detectors with unprecedented low material budget and ideal geometrical properties. △ Less

Submitted 17 August, 2021; v1 submitted 27 May, 2021; originally announced May 2021.

arXiv:2104.08726 [pdf, other]

AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages

Authors: Abteen Ebrahimi, Manuel Mager, Arturo Oncevay, Vishrav Chaudhary, Luis Chiruzzo, Angela Fan, John Ortega, Ricardo Ramos, Annette Rios, Ivan Meza-Ruiz, Gustavo A. Giménez-Lugo, Elisabeth Mager, Graham Neubig, Alexis Palmer, Rolando Coto-Solano, Ngoc Thang Vu, Katharina Kann

Abstract: Pretrained multilingual models are able to perform cross-lingual transfer in a zero-shot setting, even for languages unseen during pretraining. However, prior work evaluating performance on unseen languages has largely been limited to low-level, syntactic tasks, and it remains unclear if zero-shot learning of high-level, semantic tasks is possible for unseen languages. To explore this question, we… ▽ More Pretrained multilingual models are able to perform cross-lingual transfer in a zero-shot setting, even for languages unseen during pretraining. However, prior work evaluating performance on unseen languages has largely been limited to low-level, syntactic tasks, and it remains unclear if zero-shot learning of high-level, semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, an extension of XNLI (Conneau et al., 2018) to 10 indigenous languages of the Americas. We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches. Additionally, we explore model adaptation via continued pretraining and provide an analysis of the dataset by considering hypothesis-only models. We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%. Continued pretraining offers improvements, with an average accuracy of 44.05%. Surprisingly, training on poorly translated data by far outperforms all other methods with an accuracy of 48.72%. △ Less

Submitted 16 March, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

Comments: Accepted to ACL 2022

arXiv:2010.02804 [pdf, other]

Tackling the Low-resource Challenge for Canonical Segmentation

Authors: Manuel Mager, Özlem Çetinoğlu, Katharina Kann

Abstract: Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the high-resource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua.… ▽ More Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the high-resource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua. We explore two new models for the task, borrowing from the closely related area of morphological generation: an LSTM pointer-generator and a sequence-to-sequence model with hard monotonic attention trained with imitation learning. We find that, in the low-resource setting, the novel approaches outperform existing ones on all languages by up to 11.4% accuracy. However, while accuracy in emulated low-resource scenarios is over 50% for all languages, for the truly low-resource languages Popoluca and Tepehua, our best model only obtains 37.4% and 28.4% accuracy, respectively. Thus, we conclude that canonical segmentation is still a challenging task for low-resource languages. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: Accepted to EMNLP 2020

arXiv:2009.10517 [pdf, other]

doi 10.1016/j.nima.2020.164859

Charge collection properties of TowerJazz 180 nm CMOS Pixel Sensors in dependence of pixel geometries and bias parameters, studied using a dedicated test-vehicle: the Investigator chip

Authors: G. Aglieri Rinella, G. Chaosong, A. di Mauro, J. Eum, H. Hillemanns, A. Junique, M. Keil, D. Kim, H. Kim, T. Kugathasan, S. Lee, M. Mager, V. Manzari, C. A. Marin Tobon, P. Martinengo, H. Mugnier, L. Musa, F. Reidt, J. Rousset, K. Sielewicz, W. Snoeys, M. Šuljić, J. W. van Hoorne, Q. M. Waheed, P. Yang , et al. (1 additional authors not shown)

Abstract: This paper contains a compilation of parameters influencing the charge collection process extracted from a comprehensive study of partially depleted Monolithic Active Pixel Sensors with small (<25 um$^2$) collection electrodes fabricated in the TowerJazz 180 nm CMOS process. These results gave guidance for the optimisation of the diode implemented in ALPIDE, the chip used in the second generation… ▽ More This paper contains a compilation of parameters influencing the charge collection process extracted from a comprehensive study of partially depleted Monolithic Active Pixel Sensors with small (<25 um$^2$) collection electrodes fabricated in the TowerJazz 180 nm CMOS process. These results gave guidance for the optimisation of the diode implemented in ALPIDE, the chip used in the second generation Inner Tracking System of ALICE, and serve as reference for future simulation studies of similar devices. The studied parameters include: reverse substrate bias, epitaxial layer thickness, charge collection electrode size and the spacing of the electrode to surrounding in-pixel electronics. The results from pixels of 28 um pitch confirm that even in partially depleted circuits, charge collection can be fast (<10 ns), and quantify the influence of the parameters onto the signal sharing and amplitudes, highlighting the importance of a correct spacing between wells and of the impact of the reverse substrate bias. △ Less

Submitted 23 September, 2020; v1 submitted 22 September, 2020; originally announced September 2020.

arXiv:2005.12411 [pdf, other]

The IMS-CUBoulder System for the SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion

Authors: Manuel Mager, Katharina Kann

Abstract: In this paper, we present the systems of the University of Stuttgart IMS and the University of Colorado Boulder (IMS-CUBoulder) for SIGMORPHON 2020 Task 2 on unsupervised morphological paradigm completion (Kann et al., 2020). The task consists of generating the morphological paradigms of a set of lemmas, given only the lemmas themselves and unlabeled text. Our proposed system is a modified version… ▽ More In this paper, we present the systems of the University of Stuttgart IMS and the University of Colorado Boulder (IMS-CUBoulder) for SIGMORPHON 2020 Task 2 on unsupervised morphological paradigm completion (Kann et al., 2020). The task consists of generating the morphological paradigms of a set of lemmas, given only the lemmas themselves and unlabeled text. Our proposed system is a modified version of the baseline introduced together with the task. In particular, we experiment with substituting the inflection generation component with an LSTM sequence-to-sequence model and an LSTM pointer-generator network. Our pointer-generator system obtains the best score of all seven submitted systems on average over all languages, and outperforms the official baseline, which was best overall, on Bulgarian and Kannada. △ Less

Submitted 25 May, 2020; originally announced May 2020.

arXiv:2005.09123 [pdf, ps, other]

GPT-too: A language-model-first approach for AMR-to-text generation

Authors: Manuel Mager, Ramon Fernandez Astudillo, Tahira Naseem, Md Arafat Sultan, Young-Suk Lee, Radu Florian, Salim Roukos

Abstract: Meaning Representations (AMRs) are broad-coverage sentence-level semantic graphs. Existing approaches to generating text from AMR have focused on training sequence-to-sequence or graph-to-sequence models on AMR annotated data only. In this paper, we propose an alternative approach that combines a strong pre-trained language model with cycle consistency-based re-scoring. Despite the simplicity of t… ▽ More Meaning Representations (AMRs) are broad-coverage sentence-level semantic graphs. Existing approaches to generating text from AMR have focused on training sequence-to-sequence or graph-to-sequence models on AMR annotated data only. In this paper, we propose an alternative approach that combines a strong pre-trained language model with cycle consistency-based re-scoring. Despite the simplicity of the approach, our experimental results show these models outperform all previous techniques on the English LDC2017T10dataset, including the recent use of transformer architectures. In addition to the standard evaluation metrics, we provide human evaluation experiments that further substantiate the strength of our approach. △ Less

Submitted 27 May, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: Paper accepted to the Annual Meeting of the Association for Computational Linguistics (ACL 2020)

arXiv:1904.01989 [pdf, other]

Subword-Level Language Identification for Intra-Word Code-Switching

Authors: Manuel Mager, Özlem Çetinoğlu, Katharina Kann

Abstract: Language identification for code-switching (CS), the phenomenon of alternating between two or more languages in conversations, has traditionally been approached under the assumption of a single language per token. However, if at least one language is morphologically rich, a large number of words can be composed of morphemes from more than one language (intra-word CS). In this paper, we extend the… ▽ More Language identification for code-switching (CS), the phenomenon of alternating between two or more languages in conversations, has traditionally been approached under the assumption of a single language per token. However, if at least one language is morphologically rich, a large number of words can be composed of morphemes from more than one language (intra-word CS). In this paper, we extend the language identification task to the subword-level, such that it includes splitting mixed words while tagging each part with a language ID. We further propose a model for this task, which is based on a segmental recurrent neural network. In experiments on a new Spanish--Wixarika dataset and on an adapted German--Turkish dataset, our proposed model performs slightly better than or roughly on par with our best baseline, respectively. Considering only mixed words, however, it strongly outperforms all baselines. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Comments: NAACL-HLT 2019

arXiv:1807.00286 [pdf, ps, other]

Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages

Authors: Manuel Mager, Elisabeth Mager, Alfonso Medina-Urrea, Ivan Meza, Katharina Kann

Abstract: Machine translation from polysynthetic to fusional languages is a challenging task, which gets further complicated by the limited amount of parallel text available. Thus, translation performance is far from the state of the art for high-resource and more intensively studied language pairs. To shed light on the phenomena which hamper automatic translation to and from polysynthetic languages, we stu… ▽ More Machine translation from polysynthetic to fusional languages is a challenging task, which gets further complicated by the limited amount of parallel text available. Thus, translation performance is far from the state of the art for high-resource and more intensively studied language pairs. To shed light on the phenomena which hamper automatic translation to and from polysynthetic languages, we study translations from three low-resource, polysynthetic languages (Nahuatl, Wixarika and Yorem Nokki) into Spanish and vice versa. Doing so, we find that in a morpheme-to-morpheme alignment an important amount of information contained in polysynthetic morphemes has no Spanish counterpart, and its translation is often omitted. We further conduct a qualitative analysis and, thus, identify morpheme types that are commonly hard to align or ignored in the translation process. △ Less

Submitted 1 July, 2018; originally announced July 2018.

Comments: To appear in "All Together Now? Computational Modeling of Polysynthetic Languages" Workshop, at COLING 2018

arXiv:1806.04291 [pdf, ps, other]

Challenges of language technologies for the indigenous languages of the Americas

Authors: Manuel Mager, Ximena Gutierrez-Vasques, Gerardo Sierra, Ivan Meza

Abstract: Indigenous languages of the American continent are highly diverse. However, they have received little attention from the technological perspective. In this paper, we review the research, the digital resources and the available NLP systems that focus on these languages. We present the main challenges and research questions that arise when distant languages and low-resource scenarios are faced. We w… ▽ More Indigenous languages of the American continent are highly diverse. However, they have received little attention from the technological perspective. In this paper, we review the research, the digital resources and the available NLP systems that focus on these languages. We present the main challenges and research questions that arise when distant languages and low-resource scenarios are faced. We would like to encourage NLP research in linguistically rich and diverse areas like the Americas. △ Less

Submitted 11 June, 2018; originally announced June 2018.

Comments: In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)

arXiv:1804.06024 [pdf, ps, other]

Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages

Authors: Katharina Kann, Manuel Mager, Ivan Meza-Ruiz, Hinrich Schütze

Abstract: Morphological segmentation for polysynthetic languages is challenging, because a word may consist of many individual morphemes and training data can be extremely scarce. Since neural sequence-to-sequence (seq2seq) models define the state of the art for morphological segmentation in high-resource settings and for (mostly) European languages, we first show that they also obtain competitive performan… ▽ More Morphological segmentation for polysynthetic languages is challenging, because a word may consist of many individual morphemes and training data can be extremely scarce. Since neural sequence-to-sequence (seq2seq) models define the state of the art for morphological segmentation in high-resource settings and for (mostly) European languages, we first show that they also obtain competitive performance for Mexican polysynthetic languages in minimal-resource settings. We then propose two novel multi-task training approaches -one with, one without need for external unlabeled resources-, and two corresponding data augmentation methods, improving over the neural baseline for all languages. Finally, we explore cross-lingual transfer as a third way to fortify our neural model and show that we can train one single multi-lingual model for related languages while maintaining comparable or even improved performance, thus reducing the amount of parameters by close to 75%. We provide our morphological segmentation datasets for Mexicanero, Nahuatl, Wixarika and Yorem Nokki for future research. △ Less

Submitted 16 April, 2018; originally announced April 2018.

Comments: Long Paper, 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

arXiv:1203.3641 [pdf, ps, other]

doi 10.1016/j.physletb.2012.10.078

Inclusive J/psi production in pp collisions at sqrt(s) = 2.76 TeV

Authors: ALICE Collaboration, B. Abelev, J. Adam, D. Adamova, A. M. Adare, M. M. Aggarwal, G. Aglieri Rinella, A. G. Agocs, A. Agostinelli, S. Aguilar Salazar, Z. Ahammed, A. Ahmad Masoodi, N. Ahmad, S. U. Ahn, A. Akindinov, D. Aleksandrov, B. Alessandro, R. Alfaro Molina, A. Alici, A. Alkin, E. Almaraz Avina, J. Alme, T. Alt, V. Altini, S. Altinpinar , et al. (948 additional authors not shown)

Abstract: The ALICE Collaboration has measured inclusive J/psi production in pp collisions at a center of mass energy sqrt(s)=2.76 TeV at the LHC. The results presented in this Letter refer to the rapidity ranges |y|<0.9 and 2.5<y<4 and have been obtained by measuring the electron and muon pair decay channels, respectively. The integrated luminosities for the two channels are L^e_int=1.1 nb^-1 and L^mu_int=… ▽ More The ALICE Collaboration has measured inclusive J/psi production in pp collisions at a center of mass energy sqrt(s)=2.76 TeV at the LHC. The results presented in this Letter refer to the rapidity ranges |y|<0.9 and 2.5<y<4 and have been obtained by measuring the electron and muon pair decay channels, respectively. The integrated luminosities for the two channels are L^e_int=1.1 nb^-1 and L^mu_int=19.9 nb^-1, and the corresponding signal statistics are N_J/psi^e+e-=59 +/- 14 and N_J/psi^mu+mu-=1364 +/- 53. We present dsigma_J/psi/dy for the two rapidity regions under study and, for the forward-y range, d^2sigma_J/psi/dydp_t in the transverse momentum domain 0<p_t<8 GeV/c. The results are compared with previously published results at sqrt(s)=7 TeV and with theoretical calculations. △ Less

Submitted 6 November, 2012; v1 submitted 16 March, 2012; originally announced March 2012.

Comments: 7 figures, 3 tables, accepted for publication in Phys. Lett. B

Report number: CERN-PH-EP-2012-055

Journal ref: Phys.Lett.B 718 (2012) 295-306, Phys.Lett.B 748 (2015) 472-473 (erratum)

arXiv:1110.3232 [pdf, other]

Measurement of single event upsets in the ALICE-TPC front-end electronics

Authors: M. Mager, L. Musa, A. Rehman, A. Szczepankiewicz

Abstract: The Time Projection Chamber of the ALICE experiment at the CERN Large Hadron Collider features highly integrated on-detector read-out electronics. It is following the general trend of high energy physics experiments by placing the front-end electronics as close to the detector as possible -- only some 10 cm away from its active volume. Being located close to the beams and the interaction region, t… ▽ More The Time Projection Chamber of the ALICE experiment at the CERN Large Hadron Collider features highly integrated on-detector read-out electronics. It is following the general trend of high energy physics experiments by placing the front-end electronics as close to the detector as possible -- only some 10 cm away from its active volume. Being located close to the beams and the interaction region, the electronics is subject to a moderate radiation load, which allowed us to use commercial off-the-shelf components. However, they needed to be selected and qualified carefully for radiation hardness and means had to be taken to protect their functionality against soft errors, i.e. single event upsets. Here we report on the first measurements of LHC induced radiation effects on ALICE front-end electronics and on how they attest to expectations. △ Less

Submitted 14 October, 2011; originally announced October 2011.

arXiv:1001.1950 [pdf, other]

doi 10.1016/j.nima.2010.04.042

The ALICE TPC, a large 3-dimensional tracking device with fast readout for ultra-high multiplicity events

Authors: J. Alme, Y. Andres, H. Appelshauser, S. Bablok, N. Bialas, R. Bolgen, U. Bonnes, R. Bramm, P. Braun-Munzinger, R. Campagnolo, P. Christiansen, A. Dobrin, C. Engster, D. Fehlker, P. Foka, U. Frankenfeld, J. J. Gaardhoje, C. Garabatos, P. Glassel, C. Gonzalez Gutierrez, P. Gros, H. -A. Gustafsson, H. Helstrup, M. Hoch, M. Ivanov , et al. (51 additional authors not shown)

Abstract: The design, construction, and commissioning of the ALICE Time-Projection Chamber (TPC) is described. It is the main device for pattern recognition, tracking, and identification of charged particles in the ALICE experiment at the CERN LHC. The TPC is cylindrical in shape with a volume close to 90 m^3 and is operated in a 0.5 T solenoidal magnetic field parallel to its axis. In this paper we des… ▽ More The design, construction, and commissioning of the ALICE Time-Projection Chamber (TPC) is described. It is the main device for pattern recognition, tracking, and identification of charged particles in the ALICE experiment at the CERN LHC. The TPC is cylindrical in shape with a volume close to 90 m^3 and is operated in a 0.5 T solenoidal magnetic field parallel to its axis. In this paper we describe in detail the design considerations for this detector for operation in the extreme multiplicity environment of central Pb--Pb collisions at LHC energy. The implementation of the resulting requirements into hardware (field cage, read-out chambers, electronics), infrastructure (gas and cooling system, laser-calibration system), and software led to many technical innovations which are described along with a presentation of all the major components of the detector, as currently realized. We also report on the performance achieved after completion of the first round of stand-alone calibration runs and demonstrate results close to those specified in the TPC Technical Design Report. △ Less

Submitted 12 January, 2010; originally announced January 2010.

Comments: 55 pages, 82 figures

Journal ref: Nucl.Instrum.Meth.A622:316-367,2010

Showing 1–26 of 26 results for author: Mager, M