-
Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiT
Authors:
Aisha Khatun,
Anisur Rahman,
Md Saiful Islam,
Hemayet Ahmed Chowdhury,
Ayesha Tasnim
Abstract:
Authorship Attribution is the task of creating an appropriate characterization of text that captures the authors' writing style to identify the original author of a given piece of text. With increased anonymity on the internet, this task has become increasingly crucial in various security and plagiarism detection fields. Despite significant advancements in other languages such as English, Spanish,…
▽ More
Authorship Attribution is the task of creating an appropriate characterization of text that captures the authors' writing style to identify the original author of a given piece of text. With increased anonymity on the internet, this task has become increasingly crucial in various security and plagiarism detection fields. Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field due to its complex linguistic feature and sentence structure. Moreover, existing systems are not scalable when the number of author increases, and the performance drops for small number of samples per author. In this paper, we propose the use of Average-Stochastic Gradient Descent Weight-Dropped Long Short-Term Memory (AWD-LSTM) architecture and an effective transfer learning approach that addresses the problem of complex linguistic features extraction and scalability for authorship attribution in Bangla Literature (AABL). We analyze the effect of different tokenization, such as word, sub-word, and character level tokenization, and demonstrate the effectiveness of these tokenizations in the proposed model. Moreover, we introduce the publicly available Bangla Authorship Attribution Dataset of 16 authors (BAAD16) containing 17,966 sample texts and 13.4+ million words to solve the standard dataset scarcity problem and release six variations of pre-trained language models for use in any Bangla NLP downstream task. For evaluation, we used our developed BAAD16 dataset as well as other publicly available datasets. Empirically, our proposed model outperformed state-of-the-art models and achieved 99.8% accuracy in the BAAD16 dataset. Furthermore, we showed that the proposed system scales much better even with an increasing number of authors, and performance remains steady despite few training samples.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
BadODD: Bangladeshi Autonomous Driving Object Detection Dataset
Authors:
Mirza Nihal Baig,
Rony Hajong,
Mahdi Murshed Patwary,
Mohammad Shahidur Rahman,
Husne Ara Chowdhury
Abstract:
We propose a comprehensive dataset for object detection in diverse driving environments across 9 districts in Bangladesh. The dataset, collected exclusively from smartphone cameras, provided a realistic representation of real-world scenarios, including day and night conditions. Most existing datasets lack suitable classes for autonomous navigation on Bangladeshi roads, making it challenging for re…
▽ More
We propose a comprehensive dataset for object detection in diverse driving environments across 9 districts in Bangladesh. The dataset, collected exclusively from smartphone cameras, provided a realistic representation of real-world scenarios, including day and night conditions. Most existing datasets lack suitable classes for autonomous navigation on Bangladeshi roads, making it challenging for researchers to develop models that can handle the intricacies of road scenarios. To address this issue, the authors proposed a new set of classes based on characteristics rather than local vehicle names. The dataset aims to encourage the development of models that can handle the unique challenges of Bangladeshi road scenarios for the effective deployment of autonomous vehicles. The dataset did not consist of any online images to simulate real-world conditions faced by autonomous vehicles. The classification of vehicles is challenging because of the diverse range of vehicles on Bangladeshi roads, including those not found elsewhere in the world. The proposed classification system is scalable and can accommodate future vehicles, making it a valuable resource for researchers in the autonomous vehicle sector.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Atmospheric Influence on the Path Loss at High Frequencies for Deployment of 5G Cellular Communication Networks
Authors:
Rashed Hasan Ratul,
S M Mehedi Zaman,
Hasib Arman Chowdhury,
Md. Zayed Hassan Sagor,
Mohammad Tawhid Kawser,
Mirza Muntasir Nishat
Abstract:
Over the past few decades, the development of cellular communication technology has spanned several generations in order to add sophisticated features in the updated versions. Moreover, different high-frequency bands are considered for advanced cellular generations. The presence of updated generations like 4G and 5G is driven by the rising demand for a greater data rate and a better experience for…
▽ More
Over the past few decades, the development of cellular communication technology has spanned several generations in order to add sophisticated features in the updated versions. Moreover, different high-frequency bands are considered for advanced cellular generations. The presence of updated generations like 4G and 5G is driven by the rising demand for a greater data rate and a better experience for end users. However, because 5G-NR operates at a high frequency and has significant propagation, atmospheric fluctuations like temperature, humidity, and rain rate might result in poorer signal reception, and higher path loss effects unlike the prior generation, which employed frequencies below 6 GHz. This paper makes an attempt to provide a comparative analysis about the influence of different relative atmospheric conditions on 5G cellular communication for various operating frequencies in any urban microcell (UMi) environment maintaining the real outdoor propagation conditions. In addition, the simulation dataset based on environmental factors has been validated by the prediction of path loss using multiple regression techniques. Consequently, this study also aims to address the performance analysis of regression techniques for stable estimations of path loss at high frequencies for different atmospheric conditions for 5G mobile generations due to various possible radio link quality issues and fluctuations in different seasons in South Asia. Furthermore, in comparison to contemporary studies, the Machine Learning models have outperformed in predicting the path loss for the four seasons in South Asian regions.
△ Less
Submitted 27 July, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
A Continuous Space Neural Language Model for Bengali Language
Authors:
Hemayet Ahmed Chowdhury,
Md. Azizul Haque Imon,
Anisur Rahman,
Aisha Khatun,
Md. Saiful Islam
Abstract:
Language models are generally employed to estimate the probability distribution of various linguistic units, making them one of the fundamental parts of natural language processing. Applications of language models include a wide spectrum of tasks such as text summarization, translation and classification. For a low resource language like Bengali, the research in this area so far can be considered…
▽ More
Language models are generally employed to estimate the probability distribution of various linguistic units, making them one of the fundamental parts of natural language processing. Applications of language models include a wide spectrum of tasks such as text summarization, translation and classification. For a low resource language like Bengali, the research in this area so far can be considered to be narrow at the very least, with some traditional count based models being proposed. This paper attempts to address the issue and proposes a continuous-space neural language model, or more specifically an ASGD weight dropped LSTM language model, along with techniques to efficiently train it for Bengali Language. The performance analysis with some currently existing count based models illustrated in this paper also shows that the proposed architecture outperforms its counterparts by achieving an inference perplexity as low as 51.2 on the held out data set for Bengali.
△ Less
Submitted 11 January, 2020;
originally announced January 2020.
-
A Subword Level Language Model for Bangla Language
Authors:
Aisha Khatun,
Anisur Rahman,
Hemayet Ahmed Chowdhury,
Md. Saiful Islam,
Ayesha Tasnim
Abstract:
Language models are at the core of natural language processing. The ability to represent natural language gives rise to its applications in numerous NLP tasks including text classification, summarization, and translation. Research in this area is very limited in Bangla due to the scarcity of resources, except for some count-based models and very recent neural language models being proposed, which…
▽ More
Language models are at the core of natural language processing. The ability to represent natural language gives rise to its applications in numerous NLP tasks including text classification, summarization, and translation. Research in this area is very limited in Bangla due to the scarcity of resources, except for some count-based models and very recent neural language models being proposed, which are all based on words and limited in practical tasks due to their high perplexity. This paper attempts to approach this issue of perplexity and proposes a subword level neural language model with the AWD-LSTM architecture and various other techniques suitable for training in Bangla language. The model is trained on a corpus of Bangla newspaper articles of an appreciable size consisting of more than 28.5 million word tokens. The performance comparison with various other models depicts the significant reduction in perplexity the proposed model provides, reaching as low as 39.84, in just 20 epochs.
△ Less
Submitted 15 November, 2019;
originally announced November 2019.
-
Sentiment Analysis of Comments on Rohingya Movement with Support Vector Machine
Authors:
Hemayet Ahmed Chowdhury,
Tanvir Alam Nibir,
Md. Saiful Islam
Abstract:
The Rohingya Movement and Crisis caused a huge uproar in the political and economic state of Bangladesh. Refugee movement is a recurring event and a large amount of data in the form of opinions remains on social media such as Facebook, with very little analysis done on them.To analyse the comments based on all Rohingya related posts, we had to create and modify a classifier based on the Support Ve…
▽ More
The Rohingya Movement and Crisis caused a huge uproar in the political and economic state of Bangladesh. Refugee movement is a recurring event and a large amount of data in the form of opinions remains on social media such as Facebook, with very little analysis done on them.To analyse the comments based on all Rohingya related posts, we had to create and modify a classifier based on the Support Vector Machine algorithm. The code is implemented in python and uses scikit-learn library. A dataset on Rohingya analysis is not currently available so we had to use our own data set of 2500 positive and 2500 negative comments. We specifically used a support vector machine with linear kernel. A previous experiment was performed by us on the same dataset using the naive bayes algorithm, but that did not yield impressive results.
△ Less
Submitted 22 March, 2018;
originally announced March 2018.
-
Plagiarism: Taxonomy, Tools and Detection Techniques
Authors:
Hussain A Chowdhury,
Dhruba K Bhattacharyya
Abstract:
To detect plagiarism of any form, it is essential to have broad knowledge of its possible forms and classes, and existence of various tools and systems for its detection. Based on impact or severity of damages, plagiarism may occur in an article or in any production in a number of ways. This survey presents a taxonomy of various plagiarism forms and include discussion on each of these forms. Over…
▽ More
To detect plagiarism of any form, it is essential to have broad knowledge of its possible forms and classes, and existence of various tools and systems for its detection. Based on impact or severity of damages, plagiarism may occur in an article or in any production in a number of ways. This survey presents a taxonomy of various plagiarism forms and include discussion on each of these forms. Over the years, a good number tools and techniques have been introduced to detect plagiarism. This paper highlights few promising methods for plagiarism detection based on machine learning techniques. We analyse the pros and cons of these methods and finally we highlight a list of issues and research challenges related to this evolving research problem.
△ Less
Submitted 19 January, 2018;
originally announced January 2018.