-
Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models
Authors:
Zaid Alyafeai,
Maged S. Alshaibani,
Badr AlKhamissi,
Hamzah Luqman,
Ebrahim Alareqi,
Ali Fadel
Abstract:
Large language models (LLMs) have demonstrated impressive performance on various downstream tasks without requiring fine-tuning, including ChatGPT, a chat-based model built on top of LLMs such as GPT-3.5 and GPT-4. Despite having a lower training proportion compared to English, these models also exhibit remarkable capabilities in other languages. In this study, we assess the performance of GPT-3.5…
▽ More
Large language models (LLMs) have demonstrated impressive performance on various downstream tasks without requiring fine-tuning, including ChatGPT, a chat-based model built on top of LLMs such as GPT-3.5 and GPT-4. Despite having a lower training proportion compared to English, these models also exhibit remarkable capabilities in other languages. In this study, we assess the performance of GPT-3.5 and GPT-4 models on seven distinct Arabic NLP tasks: sentiment analysis, translation, transliteration, paraphrasing, part of speech tagging, summarization, and diacritization. Our findings reveal that GPT-4 outperforms GPT-3.5 on five out of the seven tasks. Furthermore, we conduct an extensive analysis of the sentiment analysis task, providing insights into how LLMs achieve exceptional results on a challenging dialectal dataset. Additionally, we introduce a new Python interface https://github.com/ARBML/Taqyim that facilitates the evaluation of these tasks effortlessly.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
Heuristic Algorithm for Univariate Stratification Problem
Authors:
José Brito,
Gustavo Semaan,
Leonardo de Lima,
Augusto Fadel
Abstract:
In sampling theory, stratification corresponds to a technique used in surveys, which allows segmenting a population into homogeneous subpopulations (strata) to produce statistics with a higher level of precision. In particular, this article proposes a heuristic to solve the univariate stratification problem - widely studied in the literature. One of its versions sets the number of strata and the p…
▽ More
In sampling theory, stratification corresponds to a technique used in surveys, which allows segmenting a population into homogeneous subpopulations (strata) to produce statistics with a higher level of precision. In particular, this article proposes a heuristic to solve the univariate stratification problem - widely studied in the literature. One of its versions sets the number of strata and the precision level and seeks to determine the limits that define such strata to minimize the sample size allocated to the strata. A heuristic-based on a stochastic optimization method and an exact optimization method was developed to achieve this goal. The performance of this heuristic was evaluated through computational experiments, considering its application in various populations used in other works in the literature, based on 20 scenarios that combine different numbers of strata and levels of precision. From the analysis of the obtained results, it is possible to verify that the heuristic had a performance superior to four algorithms in the literature in more than 94% of the cases, particularly concerning the known algorithms of Kozak and Lavallee-Hidiroglou.
△ Less
Submitted 19 November, 2022;
originally announced November 2022.
-
Masader Plus: A New Interface for Exploring +500 Arabic NLP Datasets
Authors:
Yousef Altaher,
Ali Fadel,
Mazen Alotaibi,
Mazen Alyazidi,
Mishari Al-Mutairi,
Mutlaq Aldhbuiub,
Abdulrahman Mosaibah,
Abdelrahman Rezk,
Abdulrazzaq Alhendi,
Mazen Abo Shal,
Emad A. Alghamdi,
Maged S. Alshaibani,
Jezia Zakraoui,
Wafaa Mohammed,
Kamel Gaanoun,
Khalid N. Elmadani,
Mustafa Ghaleb,
Nouamane Tazi,
Raed Alharbi,
Maraim Masoud,
Zaid Alyafeai
Abstract:
Masader (Alyafeai et al., 2021) created a metadata structure to be used for cataloguing Arabic NLP datasets. However, developing an easy way to explore such a catalogue is a challenging task. In order to give the optimal experience for users and researchers exploring the catalogue, several design and user experience challenges must be resolved. Furthermore, user interactions with the website may p…
▽ More
Masader (Alyafeai et al., 2021) created a metadata structure to be used for cataloguing Arabic NLP datasets. However, developing an easy way to explore such a catalogue is a challenging task. In order to give the optimal experience for users and researchers exploring the catalogue, several design and user experience challenges must be resolved. Furthermore, user interactions with the website may provide an easy approach to improve the catalogue. In this paper, we introduce Masader Plus, a web interface for users to browse Masader. We demonstrate data exploration, filtration, and a simple API that allows users to examine datasets from the backend. Masader Plus can be explored using this link https://arbml.github.io/masader. A video recording explaining the interface can be found here https://www.youtube.com/watch?v=SEtdlSeqchk.
△ Less
Submitted 1 August, 2022;
originally announced August 2022.
-
Scoring Aave Accounts for Creditworthiness
Authors:
Will Wolf,
Aaron Henry,
Hamza Al Fadel,
Xavier Quintuna,
Julian Gay
Abstract:
Scoring the creditworthiness of accounts that interact with decentralized financial (DeFi) protocols remains an important yet unsolved problem. In this paper, we propose a credit scoring system for those accounts that have interacted with the Aave v2 liquidity protocol. The key component of this system is a tree-based binary classifier that predicts "position delinquency." To the community, we pro…
▽ More
Scoring the creditworthiness of accounts that interact with decentralized financial (DeFi) protocols remains an important yet unsolved problem. In this paper, we propose a credit scoring system for those accounts that have interacted with the Aave v2 liquidity protocol. The key component of this system is a tree-based binary classifier that predicts "position delinquency." To the community, we provide our method, results, and the (abridged) dataset on which this system is built.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
Utilization of 3D segmentation for measurement of pediatric brain tumor outcomes after treatment: review of available free tools, step-by-step instructions, and applications to clinical practice
Authors:
Marina Kazarian,
Sandra Abi Fadel,
Amit Mahajan,
Mariam Aboian
Abstract:
Volumetric measurements are known to provide more information when it comes to segmenting tumors, in comparison to one- and two-dimensional measurements, and thus can lead to better informed therapy. In this work, we review the free and easily accessible computer platforms available for conducting these 3D measurements, such as Horos and 3D Slicer and compare the segmentations to commercial Visage…
▽ More
Volumetric measurements are known to provide more information when it comes to segmenting tumors, in comparison to one- and two-dimensional measurements, and thus can lead to better informed therapy. In this work, we review the free and easily accessible computer platforms available for conducting these 3D measurements, such as Horos and 3D Slicer and compare the segmentations to commercial Visage software. We compare the time for 3D segmentation of tumors and demonstrate how to use a novel plugin that we developed in 3D slicer for the efficient and accurate segmentation of the cystic component of a tumor.
△ Less
Submitted 30 August, 2020;
originally announced August 2020.
-
Large scale simulation of pressure induced phase-field fracture propagation using Utopia
Authors:
Patrick Zulian,
Alena Kopaničáková,
Maria Giuseppina Chiara Nestola,
Andreas Fink,
Nur Aiman Fadel,
Joost Vandevondele,
Rolf Krause
Abstract:
Non-linear phase field models are increasingly used for the simulation of fracture propagation models. The numerical simulation of fracture networks of realistic size requires the efficient parallel solution of large coupled non-linear systems. Although in principle efficient iterative multi-level methods for these types of problems are available, they are not widely used in practice due to the co…
▽ More
Non-linear phase field models are increasingly used for the simulation of fracture propagation models. The numerical simulation of fracture networks of realistic size requires the efficient parallel solution of large coupled non-linear systems. Although in principle efficient iterative multi-level methods for these types of problems are available, they are not widely used in practice due to the complexity of their parallel implementation.
Here, we present Utopia, which is an open-source C++ library for parallel non-linear multilevel solution strategies. Utopia provides the advantages of high-level programming interfaces while at the same time a framework to access low-level data-structures without breaking code encapsulation. Complex numerical procedures can be expressed with few lines of code, and evaluated by different implementations, libraries, or computing hardware. In this paper, we investigate the parallel performance of our implementation of the recursive multilevel trust-region (RMTR) method based on the Utopia library. RMTR is a globally convergent multilevel solution strategy designed to solve non-convex constrained minimization problems. In particular, we solve pressure-induced phase-field fracture propagation in large and complex fracture networks. Solving such problems is deemed challenging even for a few fractures, however, here we are considering networks of realistic size with up to 1000 fractures.
△ Less
Submitted 25 July, 2020;
originally announced July 2020.
-
Tha3aroon at NSURL-2019 Task 8: Semantic Question Similarity in Arabic
Authors:
Ali Fadel,
Ibraheem Tuffaha,
Mahmoud Al-Ayyoub
Abstract:
In this paper, we describe our team's effort on the semantic text question similarity task of NSURL 2019. Our top performing system utilizes several innovative data augmentation techniques to enlarge the training data. Then, it takes ELMo pre-trained contextual embeddings of the data and feeds them into an ON-LSTM network with self-attention. This results in sequence representation vectors that ar…
▽ More
In this paper, we describe our team's effort on the semantic text question similarity task of NSURL 2019. Our top performing system utilizes several innovative data augmentation techniques to enlarge the training data. Then, it takes ELMo pre-trained contextual embeddings of the data and feeds them into an ON-LSTM network with self-attention. This results in sequence representation vectors that are used to predict the relation between the question pairs. The model is ranked in the 1st place with 96.499 F1-score (same as the second place F1-score) and the 2nd place with 94.848 F1-score (differs by 1.076 F1-score from the first place) on the public and private leaderboards, respectively.
△ Less
Submitted 28 December, 2019;
originally announced December 2019.
-
Neural Arabic Text Diacritization: State of the Art Results and a Novel Approach for Machine Translation
Authors:
Ali Fadel,
Ibraheem Tuffaha,
Bara' Al-Jawarneh,
Mahmoud Al-Ayyoub
Abstract:
In this work, we present several deep learning models for the automatic diacritization of Arabic text. Our models are built using two main approaches, viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN), with several enhancements such as 100-hot encoding, embeddings, Conditional Random Field (CRF) and Block-Normalized Gradient (BNG). The models are tested on the only freely…
▽ More
In this work, we present several deep learning models for the automatic diacritization of Arabic text. Our models are built using two main approaches, viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN), with several enhancements such as 100-hot encoding, embeddings, Conditional Random Field (CRF) and Block-Normalized Gradient (BNG). The models are tested on the only freely available benchmark dataset and the results show that our models are either better or on par with other models, which require language-dependent post-processing steps, unlike ours. Moreover, we show that diacritics in Arabic can be used to enhance the models of NLP tasks such as Machine Translation (MT) by proposing the Translation over Diacritization (ToD) approach.
△ Less
Submitted 8 November, 2019;
originally announced November 2019.
-
Arabic Text Diacritization Using Deep Neural Networks
Authors:
Ali Fadel,
Ibraheem Tuffaha,
Bara' Al-Jawarneh,
Mahmoud Al-Ayyoub
Abstract:
Diacritization of Arabic text is both an interesting and a challenging problem at the same time with various applications ranging from speech synthesis to helping students learning the Arabic language. Like many other tasks or problems in Arabic language processing, the weak efforts invested into this problem and the lack of available (open-source) resources hinder the progress towards solving thi…
▽ More
Diacritization of Arabic text is both an interesting and a challenging problem at the same time with various applications ranging from speech synthesis to helping students learning the Arabic language. Like many other tasks or problems in Arabic language processing, the weak efforts invested into this problem and the lack of available (open-source) resources hinder the progress towards solving this problem. This work provides a critical review for the currently existing systems, measures and resources for Arabic text diacritization. Moreover, it introduces a much-needed free-for-all cleaned dataset that can be easily used to benchmark any work on Arabic diacritization. Extracted from the Tashkeela Corpus, the dataset consists of 55K lines containing about 2.3M words. After constructing the dataset, existing tools and systems are tested on it. The results of the experiments show that the neural Shakkala system significantly outperforms traditional rule-based approaches and other closed-source tools with a Diacritic Error Rate (DER) of 2.88% compared with 13.78%, which the best DER for the non-neural approach (obtained by the Mishkal tool).
△ Less
Submitted 25 April, 2019;
originally announced May 2019.