subscribe to arXiv mailings

Enabling Cost-Effective UI Automation Testing with Retrieval-Based LLMs: A Case Study in WeChat

Authors: Sidong Feng, Haochuan Lu, Jianqin Jiang, Ting Xiong, Likun Huang, Yinglin Liang, Xiaoqin Li, Yuetang Deng, Aldeida Aleti

Abstract: UI automation tests play a crucial role in ensuring the quality of mobile applications. Despite the growing popularity of machine learning techniques to generate these tests, they still face several challenges, such as the mismatch of UI elements. The recent advances in Large Language Models (LLMs) have addressed these issues by leveraging their semantic understanding capabilities. However, a sign… ▽ More UI automation tests play a crucial role in ensuring the quality of mobile applications. Despite the growing popularity of machine learning techniques to generate these tests, they still face several challenges, such as the mismatch of UI elements. The recent advances in Large Language Models (LLMs) have addressed these issues by leveraging their semantic understanding capabilities. However, a significant gap remains in applying these models to industrial-level app testing, particularly in terms of cost optimization and knowledge limitation. To address this, we introduce CAT to create cost-effective UI automation tests for industry apps by combining machine learning and LLMs with best practices. Given the task description, CAT employs Retrieval Augmented Generation (RAG) to source examples of industrial app usage as the few-shot learning context, assisting LLMs in generating the specific sequence of actions. CAT then employs machine learning techniques, with LLMs serving as a complementary optimizer, to map the target element on the UI screen. Our evaluations on the WeChat testing dataset demonstrate the CAT's performance and cost-effectiveness, achieving 90% UI automation with $0.34 cost, outperforming the state-of-the-art. We have also integrated our approach into the real-world WeChat testing platform, demonstrating its usefulness in detecting 141 bugs and enhancing the developers' testing process. △ Less

Submitted 12 September, 2024; originally announced September 2024.

arXiv:2409.03792 [pdf, other]

doi 10.1016/j.jss.2024.112183

Experimental evaluation of architectural software performance design patterns in microservices

Authors: Willem Meijer, Catia Trubiani, Aldeida Aleti

Abstract: Microservice architectures and design patterns enhance the development of large-scale applications by promoting flexibility. Industrial practitioners perceive the importance of applying architectural patterns but they struggle to quantify their impact on system quality requirements. Our research aims to quantify the effect of design patterns on system performance metrics, e.g., service latency and… ▽ More Microservice architectures and design patterns enhance the development of large-scale applications by promoting flexibility. Industrial practitioners perceive the importance of applying architectural patterns but they struggle to quantify their impact on system quality requirements. Our research aims to quantify the effect of design patterns on system performance metrics, e.g., service latency and resource utilization, even more so when the patterns operate in real-world environments subject to heterogeneous workloads. We built a cloud infrastructure to host a well-established benchmark system that represents our test bed, complemented by the implementation of three design patterns: Gateway Aggregation, Gateway Offloading, Pipe and Filters. Real performance measurements are collected and compared with model-based predictions that we derived as part of our previous research, thus further consolidating the actual impact of these patterns. Our results demonstrate that, despite the difficulty to parameterize our benchmark system, model-based predictions are in line with real experimentation, since the performance behaviors of patterns, e.g., bottleneck switches, are mostly preserved. In summary, this is the first work that experimentally demonstrates the performance behavior of microservices-based architectural patterns. Results highlight the complexity of evaluating the performance of design patterns and emphasize the need for complementing theoretical models with empirical data. △ Less

Submitted 20 August, 2024; originally announced September 2024.

Comments: The Journal of Systems & Software (2024)

arXiv:2406.11753 [pdf, other]

A Semantic-based Layer Freezing Approach to Efficient Fine-Tuning of Language Models

Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

Abstract: Finetuning language models (LMs) is crucial for adapting the models to downstream data and tasks. However, full finetuning is usually costly. Existing work, such as parameter-efficient finetuning (PEFT), often focuses on \textit{how to finetune} but neglects the issue of \textit{where to finetune}. As a pioneering work on answering where to finetune (at the layer level), we conduct a semantic anal… ▽ More Finetuning language models (LMs) is crucial for adapting the models to downstream data and tasks. However, full finetuning is usually costly. Existing work, such as parameter-efficient finetuning (PEFT), often focuses on \textit{how to finetune} but neglects the issue of \textit{where to finetune}. As a pioneering work on answering where to finetune (at the layer level), we conduct a semantic analysis of the LM inference process. We first propose a virtual transition of the latent representation and then trace its factual transition. Based on the deviation in transitions, we estimate the gain of finetuning each model layer, and further, narrow down the scope for finetuning. We perform extensive experiments across well-known LMs and datasets. The results show that our approach is effective and efficient, and outperforms the existing baselines. Our approach is orthogonal to existing efficient techniques, such as PEFT methods, offering practical values on LM finetuning. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 13 pages, 5 figures, under peer-review

arXiv:2405.03326 [pdf, other]

PAFOT: A Position-Based Approach for Finding Optimal Tests of Autonomous Vehicles

Authors: Victor Crespo-Rodriguez, Neelofar, Aldeida Aleti

Abstract: Autonomous Vehicles (AVs) are prone to revolutionise the transportation industry. However, they must be thoroughly tested to avoid safety violations. Simulation testing plays a crucial role in finding safety violations of Automated Driving Systems (ADSs). This paper proposes PAFOT, a position-based approach testing framework, which generates adversarial driving scenarios to expose safety violation… ▽ More Autonomous Vehicles (AVs) are prone to revolutionise the transportation industry. However, they must be thoroughly tested to avoid safety violations. Simulation testing plays a crucial role in finding safety violations of Automated Driving Systems (ADSs). This paper proposes PAFOT, a position-based approach testing framework, which generates adversarial driving scenarios to expose safety violations of ADSs. We introduce a 9-position grid which is virtually drawn around the Ego Vehicle (EV) and modify the driving behaviours of Non-Playable Characters (NPCs) to move within this grid. PAFOT utilises a single-objective genetic algorithm to search for adversarial test scenarios. We demonstrate PAFOT on a well-known high-fidelity simulator, CARLA. The experimental results show that PAFOT can effectively generate safety-critical scenarios to crash ADSs and is able to find collisions in a short simulation time. Furthermore, it outperforms other search-based testing techniques by finding more safety-critical scenarios under the same driving conditions within less effective simulation time. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: Pre-print from AST 2024 conference

arXiv:2402.11910 [pdf, other]

Enhancing Large Language Models for Text-to-Testcase Generation

Authors: Saranya Alagarsamy, Chakkrit Tantithamthavorn, Chetan Arora, Aldeida Aleti

Abstract: Context: Test-driven development (TDD) is a widely employed software development practice that involves developing test cases based on requirements prior to writing the code. Although various methods for automated test case generation have been proposed, they are not specifically tailored for TDD, where requirements instead of code serve as input. Objective: In this paper, we introduce a text-to-t… ▽ More Context: Test-driven development (TDD) is a widely employed software development practice that involves developing test cases based on requirements prior to writing the code. Although various methods for automated test case generation have been proposed, they are not specifically tailored for TDD, where requirements instead of code serve as input. Objective: In this paper, we introduce a text-to-testcase generation approach based on a large language model (GPT-3.5) that is fine-tuned on our curated dataset with an effective prompt design. Method: Our approach involves enhancing the capabilities of basic GPT-3.5 for text-to-testcase generation task that is fine-tuned on our curated dataset with an effective prompting design. We evaluated the effectiveness of our approach using a span of five large-scale open-source software projects. Results: Our approach generated 7k test cases for open source projects, achieving 78.5% syntactic correctness, 67.09% requirement alignment, and 61.7% code coverage, which substantially outperforms all other LLMs (basic GPT-3.5, Bloom, and CodeT5). In addition, our ablation study demonstrates the substantial performance improvement of the fine-tuning and prompting components of the GPT-3.5 model. Conclusions: These findings lead us to conclude that fine-tuning and prompting should be considered in the future when building a language model for the text-to-testcase generation task △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2401.16184 [pdf, other]

Vocabulary-Defined Semantics: Latent Space Clustering for Improving In-Context Learning

Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

Abstract: In-context learning enables language models (LM) to adapt to downstream data or tasks by incorporating few samples as demonstrations within the prompts. It offers strong performance without the expense of fine-tuning. However, the performance of in-context learning can be unstable depending on the quality, format, or order of demonstrations, which in turn exacerbates the difficulty of optimization… ▽ More In-context learning enables language models (LM) to adapt to downstream data or tasks by incorporating few samples as demonstrations within the prompts. It offers strong performance without the expense of fine-tuning. However, the performance of in-context learning can be unstable depending on the quality, format, or order of demonstrations, which in turn exacerbates the difficulty of optimization. Prior work, such as Knn Prompting, index samples based on the similarities of logits at the output-side, in addition to the regular retrieval operation at the input-side. They improve in-context learning by leveraging the core ability of next-token prediction, rather than relying solely on the emergent capacity to make analogies. Despite this, the hard-to-optimize issue of in-context learning still exists. In our view, it stems from the process of selecting demonstrations. To address this, we propose complementing in-context learning with an additional clustering operation. We propose a novel approach "vocabulary-defined semantics". Grounded in LM vocabulary, which is the label space of model outputs, the proposed approach computes semantically equivalent latent representations for output labels. Then, taking the representations as centroids, a clustering operation is performed to align the semantic properties between the language model and the downstream data/tasks. Based on extensive experiments across diverse textual understanding datasets and multiple models, our approach outperforms the state-of-the-art in terms of effectiveness and efficiency. On average, it achieves $3\%-49\%$ improvements while requiring only half of the computation time. △ Less

Submitted 14 October, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: under peer-review

arXiv:2312.05356 [pdf, other]

Neuron Patching: Semantic-based Neuron-level Language Model Repair for Code Generation

Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

Abstract: Large Language Models (LLMs) have already gained widespread adoption in software engineering, particularly in code generation tasks. However, updating these models with new knowledge can be prohibitively expensive, yet it is essential to maximize their utility, such as implementing a hotfix technique to address urgent or critical LLM errors. In this paper, we propose \textsc{MENT}, a novel and eff… ▽ More Large Language Models (LLMs) have already gained widespread adoption in software engineering, particularly in code generation tasks. However, updating these models with new knowledge can be prohibitively expensive, yet it is essential to maximize their utility, such as implementing a hotfix technique to address urgent or critical LLM errors. In this paper, we propose \textsc{MENT}, a novel and effective model editing approach to repair LLMs in coding tasks. \textsc{MENT} is effective, efficient, and reliable, capable of correcting a neural model by patching just one or two neurons. As pioneering work on neuron-level model editing of generative models, we formalize the editing process and introduce the involved concepts. We also introduce new measures to evaluate its generalization ability and establish a benchmark for further study. Our approach is evaluated on three coding tasks: line-level code generation, shellcode generation, and intent-to-bash translation. The experimental results demonstrate that the proposed approach significantly outperforms the state-of-the-art in both effectiveness and efficiency measures. Furthermore, we showcase the applications of \textsc{MENT} for LLM reasoning in software engineering. By editing LLM knowledge, the directly or indirectly dependent behaviors of API invocation in the chain-of-thought change accordingly. This illustrates the significance of repairing LLMs in the context of software engineering. △ Less

Submitted 5 August, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: 12 pages, 7 figures, 7 tables, under peer-review

arXiv:2312.02392 [pdf, other]

doi 10.1109/TSE.2022.3228334

Instance Space Analysis of Search-Based Software Testing

Authors: Neelofar Neelofar, Kate Smith-Miles, Mario Andres Munoz, Aldeida Aleti

Abstract: Search-based software testing (SBST) is now a mature area, with numerous techniques developed to tackle the challenging task of software testing. SBST techniques have shown promising results and have been successfully applied in the industry to automatically generate test cases for large and complex software systems. Their effectiveness, however, is problem-dependent. In this paper, we revisit the… ▽ More Search-based software testing (SBST) is now a mature area, with numerous techniques developed to tackle the challenging task of software testing. SBST techniques have shown promising results and have been successfully applied in the industry to automatically generate test cases for large and complex software systems. Their effectiveness, however, is problem-dependent. In this paper, we revisit the problem of objective performance evaluation of SBST techniques considering recent methodological advances -- in the form of Instance Space Analysis (ISA) -- enabling the strengths and weaknesses of SBST techniques to be visualized and assessed across the broadest possible space of problem instances (software classes) from common benchmark datasets. We identify features of SBST problems that explain why a particular instance is hard for an SBST technique, reveal areas of hard and easy problems in the instance space of existing benchmark datasets, and identify the strengths and weaknesses of state-of-the-art SBST techniques. In addition, we examine the diversity and quality of common benchmark datasets used in experimental evaluations. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Journal ref: IEEE Transactions on Software Engineering, 49(4), 2642-2660 (2022)

arXiv:2311.08049 [pdf, other]

Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of System-level Testing of Autonomous Vehicles

Authors: Neelofar Neelofar, Aldeida Aleti

Abstract: AI-powered systems have gained widespread popularity in various domains, including Autonomous Vehicles (AVs). However, ensuring their reliability and safety is challenging due to their complex nature. Conventional test adequacy metrics, designed to evaluate the effectiveness of traditional software testing, are often insufficient or impractical for these systems. White-box metrics, which are speci… ▽ More AI-powered systems have gained widespread popularity in various domains, including Autonomous Vehicles (AVs). However, ensuring their reliability and safety is challenging due to their complex nature. Conventional test adequacy metrics, designed to evaluate the effectiveness of traditional software testing, are often insufficient or impractical for these systems. White-box metrics, which are specifically designed for these systems, leverage neuron coverage information. These coverage metrics necessitate access to the underlying AI model and training data, which may not always be available. Furthermore, the existing adequacy metrics exhibit weak correlations with the ability to detect faults in the generated test suite, creating a gap that we aim to bridge in this study. In this paper, we introduce a set of black-box test adequacy metrics called "Test suite Instance Space Adequacy" (TISA) metrics, which can be used to gauge the effectiveness of a test suite. The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing. Additionally, we introduce a framework that permits testers to visualise the diversity and coverage of the test suite in a two-dimensional space, facilitating the identification of areas that require improvement. We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs. A strong correlation, coupled with the short computation time, indicates their effectiveness and efficiency in estimating the adequacy of testing AVs. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: 12 pages, 7 figures

arXiv:2309.03554 [pdf, other]

Software Testing of Generative AI Systems: Challenges and Opportunities

Authors: Aldeida Aleti

Abstract: Software Testing is a well-established area in software engineering, encompassing various techniques and methodologies to ensure the quality and reliability of software systems. However, with the advent of generative artificial intelligence (GenAI) systems, new challenges arise in the testing domain. These systems, capable of generating novel and creative outputs, introduce unique complexities tha… ▽ More Software Testing is a well-established area in software engineering, encompassing various techniques and methodologies to ensure the quality and reliability of software systems. However, with the advent of generative artificial intelligence (GenAI) systems, new challenges arise in the testing domain. These systems, capable of generating novel and creative outputs, introduce unique complexities that require novel testing approaches. In this paper, I aim to explore the challenges posed by generative AI systems and discuss potential opportunities for future research in the field of testing. I will touch on the specific characteristics of GenAI systems that make traditional testing techniques inadequate or insufficient. By addressing these challenges and pursuing further research, we can enhance our understanding of how to safeguard GenAI and pave the way for improved quality assurance in this rapidly evolving domain. △ Less

Submitted 11 September, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

arXiv:2303.06283 [pdf, other]

Closing the Loop for Software Remodularisation -- REARRANGE: An Effort Estimation Approach for Software Clustering-based Remodularisation

Authors: Alvin Jian Jia Tan, Chun Yong Chong, Aldeida Aleti

Abstract: Software remodularization through clustering is a common practice to improve internal software quality. However, the true benefit of software clustering is only realized if developers follow through with the recommended refactoring suggestions, which can be complex and time-consuming. Simply producing clustering results is not enough to realize the benefits of remodularization. For the recommended… ▽ More Software remodularization through clustering is a common practice to improve internal software quality. However, the true benefit of software clustering is only realized if developers follow through with the recommended refactoring suggestions, which can be complex and time-consuming. Simply producing clustering results is not enough to realize the benefits of remodularization. For the recommended refactoring operations to have an impact, developers must follow through with them. However, this is often a difficult task due to certain refactoring operations' complexity and time-consuming nature. △ Less

Submitted 10 March, 2023; originally announced March 2023.

Comments: Accepted for publication at ICSE23 Poster Track

arXiv:2302.10352 [pdf, other]

A3Test: Assertion-Augmented Automated Test Case Generation

Authors: Saranya Alagarsamy, Chakkrit Tantithamthavorn, Aldeida Aleti

Abstract: Test case generation is an important activity, yet a time-consuming and laborious task. Recently, AthenaTest -- a deep learning approach for generating unit test cases -- is proposed. However, AthenaTest can generate less than one-fifth of the test cases correctly, due to a lack of assertion knowledge and test signature verification. In this paper, we propose A3Test, a DL-based test case generatio… ▽ More Test case generation is an important activity, yet a time-consuming and laborious task. Recently, AthenaTest -- a deep learning approach for generating unit test cases -- is proposed. However, AthenaTest can generate less than one-fifth of the test cases correctly, due to a lack of assertion knowledge and test signature verification. In this paper, we propose A3Test, a DL-based test case generation approach that is augmented by assertion knowledge with a mechanism to verify naming consistency and test signatures. A3Test leverages the domain adaptation principles where the goal is to adapt the existing knowledge from an assertion generation task to the test case generation task. We also introduce a verification approach to verify naming consistency and test signatures. Through an evaluation of 5,278 focal methods from the Defects4j dataset, we find that our A3Test (1) achieves 147% more correct test cases and 15% more method coverage, with a lower number of generated test cases than AthenaTest; (2) still outperforms the existing pre-trained models for the test case generation task; (3) contributes substantially to performance improvement via our own proposed assertion pre-training and the verification components; (4) is 97.2% much faster while being more accurate than AthenaTest. △ Less

Submitted 20 February, 2023; originally announced February 2023.

Comments: Under Review at ACM Transactions on Software Engineering and Methodology

arXiv:2212.07566 [pdf, other]

Identifying and Explaining Safety-critical Scenarios for Autonomous Vehicles via Key Features

Authors: Neelofar, Aldeida Aleti

Abstract: Ensuring the safety of autonomous vehicles (AVs) is of utmost importance and testing them in simulated environments is a safer option than conducting in-field operational tests. However, generating an exhaustive test suite to identify critical test scenarios is computationally expensive as the representation of each test is complex and contains various dynamic and static features, such as the AV u… ▽ More Ensuring the safety of autonomous vehicles (AVs) is of utmost importance and testing them in simulated environments is a safer option than conducting in-field operational tests. However, generating an exhaustive test suite to identify critical test scenarios is computationally expensive as the representation of each test is complex and contains various dynamic and static features, such as the AV under test, road participants (vehicles, pedestrians, and static obstacles), environmental factors (weather and light), and the road's structural features (lanes, turns, road speed, etc.). In this paper, we present a systematic technique that uses Instance Space Analysis (ISA) to identify the significant features of test scenarios that affect their ability to reveal the unsafe behaviour of AVs. ISA identifies the features that best differentiate safety-critical scenarios from normal driving and visualises the impact of these features on test scenario outcomes (safe/unsafe) in 2D. This visualization helps to identify untested regions of the instance space and provides an indicator of the quality of the test suite in terms of the percentage of feature space covered by testing. To test the predictive ability of the identified features, we train five Machine Learning classifiers to classify test scenarios as safe or unsafe. The high precision, recall, and F1 scores indicate that our proposed approach is effective in predicting the outcome of a test scenario without executing it and can be used for test generation, selection, and prioritization. △ Less

Submitted 28 November, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

Comments: 28 pages, 6 figures

ACM Class: D.2.5

arXiv:2207.11082 [pdf, other]

doi 10.1007/s10664-024-10503-2

Test-based Patch Clustering for Automatically-Generated Patches Assessment

Authors: Matias Martinez, Maria Kechagia, Anjana Perera, Justyna Petke, Federica Sarro, Aldeida Aleti

Abstract: Previous studies have shown that Automated Program Repair (APR) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by apr tools need to be validated by human pr… ▽ More Previous studies have shown that Automated Program Repair (APR) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by apr tools need to be validated by human programmers, which can be very costly, and prevents apr tool adoption in practice. Our work aims to minimize the number of plausible patches that programmers have to review, thereby reducing the time required to find a correct patch. We introduce a novel light-weight test-based patch clustering approach called xTestCluster, which clusters patches based on their dynamic behavior. xTestCluster is applied after the patch generation phase in order to analyze the generated patches from one or more repair tools and to provide more information about those patches for facilitating patch assessment. The novelty of xTestCluster lies in using information from execution of newly generated test cases to cluster patches generated by multiple APR approaches. A cluster is formed of patches that fail on the same generated test cases. The output from xTestCluster gives developers a) a way of reducing the number of patches to analyze, as they can focus on analyzing a sample of patches from each cluster, b) additional information attached to each patch. After analyzing 902 plausible patches from 21 Java APR tools, our results show that xTestCluster is able to reduce the number of patches to review and analyze with a median of 50%. xTestCluster can save a significant amount of time for developers that have to review the multitude of patches generated by apr tools, and provides them with new test cases that expose the differences in behavior between generated patches. △ Less

Submitted 27 August, 2024; v1 submitted 22 July, 2022; originally announced July 2022.

Comments: Published in Springer Empirical Software Engineering, Volume 29, article number 116 (2024)

arXiv:2110.02682 [pdf, other]

How good does a Defect Predictor need to be to guide Search-Based Software Testing?

Authors: Anjana Perera, Burak Turhan, Aldeida Aleti, Marcel Böhme

Abstract: Defect predictors, static bug detectors and humans inspecting the code can locate the parts of the program that are buggy before they are discovered through testing. Automated test generators such as search-based software testing (SBST) techniques can use this information to direct their search for test cases to likely buggy code, thus speeding up the process of detecting existing bugs. However, o… ▽ More Defect predictors, static bug detectors and humans inspecting the code can locate the parts of the program that are buggy before they are discovered through testing. Automated test generators such as search-based software testing (SBST) techniques can use this information to direct their search for test cases to likely buggy code, thus speeding up the process of detecting existing bugs. However, often the predictions given by these tools or humans are imprecise, which can misguide the SBST technique and may deteriorate its performance. In this paper, we study the impact of imprecision in defect prediction on the bug detection effectiveness of SBST. Our study finds that the recall of the defect predictor, i.e., the probability of correctly identifying buggy code, has a significant impact on bug detection effectiveness of SBST with a large effect size. On the other hand, the effect of precision, a measure for false alarms, is not of meaningful practical significance as indicated by a very small effect size. In particular, the SBST technique finds 7.5 less bugs on average (out of 420 bugs) for every 5% decrements of the recall. In the context of combining defect prediction and SBST, our recommendation for practice is to increase the recall of defect predictors at the expense of precision, while maintaining a precision of at least 75%. To account for the imprecision of defect predictors, in particular low recall values, SBST techniques should be designed to search for test cases that also cover the predicted non-buggy parts of the program, while prioritising the parts that have been predicted as buggy. △ Less

Submitted 6 October, 2021; originally announced October 2021.

Comments: 12 pages, 4 figures

ACM Class: D.2.5

arXiv:2109.12645 [pdf, other]

doi 10.1145/3324884.3416612

Defect Prediction Guided Search-Based Software Testing

Authors: Anjana Perera, Aldeida Aleti, Marcel Böhme, Burak Turhan

Abstract: Today, most automated test generators, such as search-based software testing (SBST) techniques focus on achieving high code coverage. However, high code coverage is not sufficient to maximise the number of bugs found, especially when given a limited testing budget. In this paper, we propose an automated test generation technique that is also guided by the estimated degree of defectiveness of the s… ▽ More Today, most automated test generators, such as search-based software testing (SBST) techniques focus on achieving high code coverage. However, high code coverage is not sufficient to maximise the number of bugs found, especially when given a limited testing budget. In this paper, we propose an automated test generation technique that is also guided by the estimated degree of defectiveness of the source code. Parts of the code that are likely to be more defective receive more testing budget than the less defective parts. To measure the degree of defectiveness, we leverage Schwa, a notable defect prediction technique. We implement our approach into EvoSuite, a state of the art SBST tool for Java. Our experiments on the Defects4J benchmark demonstrate the improved efficiency of defect prediction guided test generation and confirm our hypothesis that spending more time budget on likely defective parts increases the number of bugs found in the same time budget. △ Less

Submitted 26 September, 2021; originally announced September 2021.

Comments: 13 pages, 8 figures

ACM Class: D.2.5

Journal ref: In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE '20), 2020

arXiv:2107.01766 [pdf, other]

E-SC4R: Explaining Software Clustering for Remodularisation

Authors: Alvin Jian Jia Tan, Chun Yong Chong, Aldeida Aleti

Abstract: Maintenance of existing software requires a large amount of time for comprehending the source code. The architecture of a software, however, may not be clear to maintainers if up to date documentations are not available. Software clustering is often used as a remodularisation and architecture recovery technique to help recover a semantic representation of the software design. Due to the diverse do… ▽ More Maintenance of existing software requires a large amount of time for comprehending the source code. The architecture of a software, however, may not be clear to maintainers if up to date documentations are not available. Software clustering is often used as a remodularisation and architecture recovery technique to help recover a semantic representation of the software design. Due to the diverse domains, structure, and behaviour of software systems, the suitability of different clustering algorithms for different software systems are not investigated thoroughly. Research that introduce new clustering techniques usually validate their approaches on a specific domain, which might limit its generalisability. If the chosen test subjects could only represent a narrow perspective of the whole picture, researchers might risk not being able to address the external validity of their findings. This work aims to fill this gap by introducing a new approach, Explaining Software Clustering for Remodularisation, to evaluate the effectiveness of different software clustering approaches. This work focuses on hierarchical clustering and Bunch clustering algorithms and provides information about their suitability according to the features of the software, which as a consequence, enables the selection of the most optimum algorithm and configuration from our existing pool of choices for a particular software system. The proposed framework is tested on 30 open source software systems with varying sizes and domains, and demonstrates that it can characterise both the strengths and weaknesses of the analysed software clustering algorithms using software features extracted from the code. The proposed approach also provides a better understanding of the algorithms behaviour through the application of dimensionality reduction techniques. △ Less

Submitted 2 October, 2021; v1 submitted 4 July, 2021; originally announced July 2021.

Comments: 31 pages

arXiv:2012.01708 [pdf, other]

Feature-Based Software Design Pattern Detection

Authors: Najam Nazar, Aldeida Aleti, Yaokun Zheng

Abstract: Software design patterns are standard solutions to common problems in software design and architecture. Knowing that a particular module implements a design pattern is a shortcut to design comprehension. Manually detecting design patterns is a time consuming and challenging task, therefore, researchers have proposed automatic design pattern detection techniques. However, these techniques show low… ▽ More Software design patterns are standard solutions to common problems in software design and architecture. Knowing that a particular module implements a design pattern is a shortcut to design comprehension. Manually detecting design patterns is a time consuming and challenging task, therefore, researchers have proposed automatic design pattern detection techniques. However, these techniques show low performance for certain design patterns. In this work, we introduce a design pattern detection approach, DPD_F that improves the performance over the state-of-the-art by using code features with machine learning classifiers to automatically train a design pattern detector. DPD_F creates a semantic representation of Java source code using the code features and the call graph, and applies the \textit{Word2Vec} algorithm on the semantic representation to construct the word-space geometric model of the Java source code. DPD$_F$ then builds a Machine Learning classifier trained on a labelled dataset and identifies software design patterns with over 80% Precision and over 79\% Recall. Additionally, we have compared DPD_F with two existing design pattern detection techniques namely FeatureMaps & MARPLE-DPD. Empirical results demonstrate that our approach outperforms the state-of-the-art approaches by approximately 35% and 15% respectively in terms of Precision. The run-time performance also supports the practical applicability of our classifier. △ Less

Submitted 2 December, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

Comments: Accepted in Journal of Systems and Software (JSS)

arXiv:2002.03968 [pdf, other]

doi 10.1007/s10664-021-09989

E-APR: Mapping the Effectiveness of Automated Program Repair

Authors: Aldeida Aleti, Matias Martinez

Abstract: Automated Program Repair (APR) is a fast growing area with numerous new techniques being developed to tackle one of the most challenging software engineering problems. APR techniques have shown promising results, giving us hope that one day it will be possible for software to repair itself. In this paper, we focus on the problem of objective performance evaluation of APR techniques. We introduce a… ▽ More Automated Program Repair (APR) is a fast growing area with numerous new techniques being developed to tackle one of the most challenging software engineering problems. APR techniques have shown promising results, giving us hope that one day it will be possible for software to repair itself. In this paper, we focus on the problem of objective performance evaluation of APR techniques. We introduce a new approach, Explaining Automated Program Repair (E-APR), which identifies features of buggy programs that explain why a particular instance is difficult for an APR technique. E-APR is used to examine the diversity and quality of the buggy programs used by most researchers, and analyse the strengths and weaknesses of existing APR techniques. E-APR visualises an instance space of buggy programs, with each buggy program represented as a point in the space. The instance space is constructed to reveal areas of hard and easy buggy programs, and enables the strengths and weaknesses of APR techniques to be identified. △ Less

Submitted 8 June, 2021; v1 submitted 10 February, 2020; originally announced February 2020.

Journal ref: Empirical Software Engineering 2021

arXiv:2001.02872 [pdf, other]

The Neighbours' Similar Fitness Property for Local Search

Authors: Mark Wallace, Aldeida Aleti

Abstract: For most practical optimisation problems local search outperforms random sampling - despite the "No Free Lunch Theorem". This paper introduces a property of search landscapes termed Neighbours' Similar Fitness (NSF) that underlies the good performance of neighbourhood search in terms of local improvement. Though necessary, NSF is not sufficient to ensure that searching for improvement among the ne… ▽ More For most practical optimisation problems local search outperforms random sampling - despite the "No Free Lunch Theorem". This paper introduces a property of search landscapes termed Neighbours' Similar Fitness (NSF) that underlies the good performance of neighbourhood search in terms of local improvement. Though necessary, NSF is not sufficient to ensure that searching for improvement among the neighbours of a good solution is better than random search. The paper introduces an additional (natural) property which supports a general proof that, for NSF landscapes, neighbourhood search beats random search. △ Less

Submitted 9 January, 2020; originally announced January 2020.

arXiv:1912.02535 [pdf, other]

Is perturbation an effective restart strategy?

Authors: Aldeida Aleti, Mark Wallace, Markus Wagner

Abstract: Premature convergence can be detrimental to the performance of search methods, which is why many search algorithms include restart strategies to deal with it. While it is common to perturb the incumbent solution with diversification steps of various sizes with the hope that the search method will find a new basin of attraction leading to a better local optimum, it is usually not clear how big the… ▽ More Premature convergence can be detrimental to the performance of search methods, which is why many search algorithms include restart strategies to deal with it. While it is common to perturb the incumbent solution with diversification steps of various sizes with the hope that the search method will find a new basin of attraction leading to a better local optimum, it is usually not clear how big the perturbation step should be. We introduce a new property of fitness landscapes termed "Neighbours with Similar Fitness" and we demonstrate that the effectiveness of a restart strategy depends on this property. △ Less

Submitted 5 December, 2019; originally announced December 2019.

arXiv:1910.12415 [pdf, other]

doi 10.1016/j.eswa.2021.115675

Robotic Hierarchical Graph Neurons. A novel implementation of HGN for swarm robotic behaviour control

Authors: Phillip Smith, Aldeida Aleti, Vincent C. S. Lee, Robert Hunjet, Asad Khan

Abstract: This paper explores the use of a novel form of Hierarchical Graph Neurons (HGN) for in-operation behaviour selection in a swarm of robotic agents. This new HGN is called Robotic-HGN (R-HGN), as it matches robot environment observations to environment labels via fusion of match probabilities from both temporal and intra-swarm collections. This approach is novel for HGN as it addresses robotic obser… ▽ More This paper explores the use of a novel form of Hierarchical Graph Neurons (HGN) for in-operation behaviour selection in a swarm of robotic agents. This new HGN is called Robotic-HGN (R-HGN), as it matches robot environment observations to environment labels via fusion of match probabilities from both temporal and intra-swarm collections. This approach is novel for HGN as it addresses robotic observations being pseudo-continuous numbers, rather than categorical values. Additionally, the proposed approach is memory and computation-power conservative and thus is acceptable for use in mobile devices such as single-board computers, which are often used in mobile robotic agents. This R-HGN approach is validated against individual behaviour implementation and random behaviour selection. This contrast is made in two sets of simulated environments: environments designed to challenge the held behaviours of the R-HGN, and randomly generated environments which are more challenging for the robotic swarm than R-HGN training conditions. R-HGN has been found to enable appropriate behaviour selection in both these sets, allowing significant swarm performance in pre-trained and unexpected environment conditions. △ Less

Submitted 27 October, 2019; originally announced October 2019.

Journal ref: Expert Systems with Applications 2021

arXiv:1910.12412 [pdf, other]

Swarm Behaviour Evolution via Rule Sharing and Novelty Search

Authors: Phillip Smith, Robert Hunjet, Aldeida Aleti, Asad Khan

Abstract: We present in this paper an exertion of our previous work by increasing the robustness and coverage of the evolution search via hybridisation with a state-of-the-art novelty search and accelerate the individual agent behaviour searches via a novel behaviour-component sharing technique. Via these improvements, we present Swarm Learning Classifier System 2.0 (SLCS2), a behaviour evolving algorithm w… ▽ More We present in this paper an exertion of our previous work by increasing the robustness and coverage of the evolution search via hybridisation with a state-of-the-art novelty search and accelerate the individual agent behaviour searches via a novel behaviour-component sharing technique. Via these improvements, we present Swarm Learning Classifier System 2.0 (SLCS2), a behaviour evolving algorithm which is robust to complex environments, and seen to out-perform a human behaviour designer in challenging cases of the data-transfer task in a range of environmental conditions. Additionally, we examine the impact of tailoring the SLCS2 rule generator for specific environmental conditions. We find this leads to over-fitting, as might be expected, and thus conclude that for greatest environment flexibility a general rule generator should be utilised. △ Less

Submitted 27 October, 2019; originally announced October 2019.

arXiv:1910.09811 [pdf, other]

doi 10.3847/1538-4357/ab4fea

A data-driven model of nucleosynthesis with chemical tagging in a lower-dimensional latent space

Authors: Andrew R. Casey, John C. Lattanzio, Aldeida Aleti, David L. Dowe, Joss Bland-Hawthorn, Sven Buder, Geraint F. Lewis, Sarah L. Martell, Thomas Nordlander, Jeffrey D. Simpson, Sanjib Sharma, Daniel B. Zucker

Abstract: Chemical tagging seeks to identify unique star formation sites from present-day stellar abundances. Previous techniques have treated each abundance dimension as being statistically independent, despite theoretical expectations that many elements can be produced by more than one nucleosynthetic process. In this work we introduce a data-driven model of nucleosynthesis where a set of latent factors (… ▽ More Chemical tagging seeks to identify unique star formation sites from present-day stellar abundances. Previous techniques have treated each abundance dimension as being statistically independent, despite theoretical expectations that many elements can be produced by more than one nucleosynthetic process. In this work we introduce a data-driven model of nucleosynthesis where a set of latent factors (e.g., nucleosynthetic yields) contribute to all stars with different scores, and clustering (e.g., chemical tagging) is modelled by a mixture of multivariate Gaussians in a lower-dimensional latent space. We use an exact method to simultaneously estimate the factor scores for each star, the partial assignment of each star to each cluster, and the latent factors common to all stars, even in the presence of missing data entries. We use an information-theoretic Bayesian principle to estimate the number of latent factors and clusters. Using the second Galah data release we find that six latent factors are preferred to explain N = 2,566 stars with 17 chemical abundances. We identify the rapid- and slow-neutron capture processes, as well as latent factors consistent with Fe-peak and α-element production, and another where K and Zn dominate. When we consider N ~ 160,000 stars with missing abundances we find another 7 factors, as well as 16 components in latent space. Despite these components showing separation in chemistry that is explained through different yield contributions, none show significant structure in their positions or motions. We argue that more data, and joint priors on cluster membership that are constrained by dynamical models, are necessary to realise chemical tagging at a galactic-scale. We release software that allows for model parameters to be optimised in seconds given a fixed number of latent factors, components, and $10^7$ abundance measurements. △ Less

Submitted 22 October, 2019; originally announced October 2019.

Comments: Accepted to ApJ

arXiv:1801.04644 [pdf, other]

doi 10.1016/j.jss.2018.01.010

An Efficient Method for Uncertainty Propagation in Robust Software Performance Estimation

Authors: Aldeida Aleti, Catia Trubiani, André van Hoorn, Pooyan Jamshidi

Abstract: Software engineers often have to estimate the performance of a software system before having full knowledge of the system parameters, such as workload and operational profile. These uncertain parameters inevitably affect the accuracy of quality evaluations, and the ability to judge if the system can continue to fulfil performance requirements if parameter results are different from expected. Previ… ▽ More Software engineers often have to estimate the performance of a software system before having full knowledge of the system parameters, such as workload and operational profile. These uncertain parameters inevitably affect the accuracy of quality evaluations, and the ability to judge if the system can continue to fulfil performance requirements if parameter results are different from expected. Previous work has addressed this problem by modelling the potential values of uncertain parameters as probability distribution functions, and estimating the robustness of the system using Monte Carlo-based methods. These approaches require a large number of samples, which results in high computational cost and long waiting times. To address the computational inefficiency of existing approaches, we employ Polynomial Chaos Expansion (PCE) as a rigorous method for uncertainty propagation and further extend its use to robust performance estimation. The aim is to assess if the software system is robust, i.e., it can withstand possible changes in parameter values, and continue to meet performance requirements. PCE is a very efficient technique, and requires significantly less computations to accurately estimate the distribution of performance indices. Through three very different case studies from different phases of software development and heterogeneous application domains, we show that PCE can accurately (>97\%) estimate the robustness of various performance indices, and saves up to 225 hours of performance evaluation time when compared to Monte Carlo Simulation. △ Less

Submitted 14 January, 2018; originally announced January 2018.

Showing 1–25 of 25 results for author: Aleti, A