Abstract
Automated program repair is increasingly gaining traction, due to its potential to reduce debugging cost greatly. The feasibility of automated program repair has been shown in a number of works, and the research focus is gradually shifting toward the quality of generated patches. One promising direction is to control the quality of generated patches by controlling the quality of test-suites used for automated program repair. In this paper, we ask the following research question: “Can traditional test-suite metrics proposed for the purpose of software testing also be used for the purpose of automated program repair?” We empirically investigate whether traditional test-suite metrics such as statement/branch coverage and mutation score are effective in controlling the reliability of generated repairs (the likelihood that repairs cause regression errors). We conduct the largest-scale experiments of this kind to date with real-world software, and for the first time perform a correlation study between various test-suite metrics and the reliability of generated repairs. Our results show that in general, with the increase of traditional test suite metrics, the reliability of repairs tend to increase. In particular, such a trend is most strongly observed in statement coverage. Our results imply that the traditional test suite metrics proposed for software testing can also be used for automated program repair to improve the reliability of repairs.
Similar content being viewed by others
Notes
One exception is DirectFix (Mechtaev et al. 2015) where fault localization and edit parts are fused.
A mutant m is considered killed when the test result of m for at least on test in the provided test-suite is different from the test result of the original program for the same test.
Only positive tests are considered; an output change for negative tests is not a regression.
We used the original GenProg benchmark. At the time of writing this paper, the benchmark was updated after a few problems in the test scripts of php and libtiff are reported in Qi et al. (2015).
The grep subject in CoREBench contains real errors unlike the grep in SIR that contains seeded errors.
While php contains 8471 tests, we randomly selected 200 tests out of them to deal with long running time of the php tests.
tot_info includes non-linear arithmetic expressions which are not currently supported by the underlying SMT solver SemFix uses.
We extended its parser to handle the large subjects (php, libtiff, grep, and findutils).
The minimum statement/branch coverage of php is 0 because some tests do not execute the marked source files.
The “coverage” referred to in Smith et al. (2015) essentially means how many tests of a given test-universe are covered.
References
Andrews JH, Briand LC, Labiche Y, Namin AS (2006) Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Trans Softw Eng 32(8):608–624
Artzi S, Dolby J, Tip F, Pistoia M (2010) Directed test generation for effective fault localization. In: Proceedings of the 19th International Symposium on Software Testing and Analysis, ISSTA ’10, pp 49–60
Assiri FY, Bieman JM (2014) An assessment of the quality of automated program operator repair. In: Proceedings of the 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation, ICSE ’14, pp 273–282
Baudry B, Fleurey F, Le Traon Y (2006) Improving test suites for efficient fault localization. In: 82–91
Böhme M, Roychoudhury A (2014) CoREBench: Studying complexity of regression errors. In: Proceedings of the 2014 International Symposium on Software Testing and Analysis, ISSTA ’14, pp 105–115
Böhme M, Oliveira BCdS, Roychoudhury A (2013a) Partition-based regression verification. In: Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pp 302–311
Böhme M, Oliveira BCdS, Roychoudhury A (2013b) Regression tests to expose change interaction errors. In: Proceedings of the 2013 Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE ’13, pp 334–344
Cadar C, Engler D (2005) Execution generated test cases: How to make systems code crash itself. In: Proceedings of the 12th International Conference on Model Checking Software, SPIN ’05, pp 2–23
Cadar C, Dunbar D, Engler D (2008). In: KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’ 08, pp 209–224
Dallmeier V, Zeller A, Meyer B (2009) Generating fixes from object behavior anomalies. In: Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering, ASE ’09, pp 550–554
Debroy V, Wong WE (2010) Using mutation to automatically suggest fixes for faulty programs. In: Proceedings of the Third International Conference on Software Testing, Verification and Validation, ICST ’10, pp 65–74
Debroy V, Wong WE (2014) Combining mutation and fault localization for automated program debugging. J Syst Softw 90:45–60
Do H, Elbaum SG, Rothermel G (2005) Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empir Softw Eng 10(4):405–435
Elkarablieh B, Khurshid S (2008) Juzi: A tool for repairing complex data structures. In: Proceedings of the 30th International Conference on Software Engineering, ICSE ’08, pp 855–858
Godefroid P, Klarlund N, Sen K (2005) DART: Directed automated random testing. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pp 213–223
Gopinath D, Malik MZ, Khurshid S (2011) Specification-based program repair using SAT. In: Proceedings of the 17th International Conference on Tools and Algorithms for the Construction and Analysis of Systems: Part of the Joint European Conferences on Theory and Practice of Software, TACAS ’11/ETAPS ’11, pp 173–188
He H, Gupta N (2004) Automated debugging using path-based weakest preconditions. In: Proceedings of the 7th International Conference on Fundamental Approaches to Software Engineering, FASE ’04, pp 267–280
Jia Y, Harman M (2011) An analysis and survey of the development of mutation testing. IEEE Trans Softw Eng 37(5):649–678
Jobstmann B, Griesmayer A, Bloem R (2005) Program repair as a game. In: Proceedings of the 17th International Conference on Computer Aided Verification, CAV ’05, pp 226–238
Jones JA, Harrold MJ, Stasko JT (2002) Visualization of test information to assist fault localization. In: Proceedings of the 24th International Conference on Software Engineering, ICSE ’02, pp 467–477
Ke Y, Stolee KT, Le Goues C, Brun Y (2015) Repairing programs with semantic code search (t). In: Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering, ASE ’15, pp 295–306
Kendall MG (1945) The treatment of ties in ranking problems. Biometrika 33 (3):239–251
Kim D, Nam J, Song J, Kim S (2013) Automatic patch generation learned from human-written patches. In: Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pp 802–811
Kong X, Zhang L, Wong WE, Li B (2015) Experience report: How do techniques, programs, and tests impact automated program repair?. In: Proceedings of the 2015 IEEE 26th International Symposium on Software Reliability Engineering, ISSRE ’15, pp 194–204
Könighofer R, Bloem R (2011) Automated error localization and correction for imperative programs. In: Proceedings of the International Conference on Formal Methods in Computer-Aided Design, FMCAD ’11, pp 91–100
Le Goues C, Dewey-Vogt M, Forrest S, Weimer W (2012a) A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In: Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pp 3–13
Le Goues C, Nguyen T, Forrest S, Weimer W (2012b) GenProg: A generic method for automatic software repair. IEEE Trans Softw Eng 38(1):54–72
Le Goues C, Forrest S, Weimer W (2013) Current challenges in automatic software repair. Softw Qual J 21(3):421–443
Liblit B, Aiken A, Zheng AX, Jordan MI (2003) Bug isolation via remote program sampling. In: Proceedings of the ACM SIGPLAN 2003 conference on Programming Language Design and Implementation, PLDI ’03, pp 141–154
Long F, Rinard M (2015) Staged program repair with condition synthesis. In: Proceedings of the 2015 Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE ’15, pp 166–178
Long F, Rinard M (2016a) An analysis of the search spaces for generate and validate patch generation systems. In: Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pp 702–713
Long F, Rinard M (2016b) Automatic patch generation by learning correct code. In: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’16, pp 298–312
Long F, Sidiroglou-Douskos S, Rinard M (2014) Automatic runtime error repair and containment via recovery shepherding. In: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pp 227–238
Maldonado JC, Delamaro ME, Fabbri SCPF, Simão A d S, Sugeta T, Vincenzi AMR, Masiero PC (2001) Proteum: A family of tools to support specification and program testing based on mutation. In: Wong W E (ed) Mutation Testing for the New Century, Kluwer Academic Publishers, Norwell, pp 113–116
Mechtaev S, Yi J, Roychoudhury A (2015) DirectFix: Looking for simple program repairs. In: Proceedings of the 37th IEEE/ACM International Conference on Software Engineering, ICSE ’15, pp 448–458
Mechtaev S, Yi J, Roychoudhury A (2016) Angelix: Scalable multiline program patch synthesis via symbolic analysis. In: Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pp 691–701
Miller W, Spooner DL (1976) Automatic generation of floating-point test data. IEEE Trans Softw Eng 2(3):223–226
Namin AS, Andrews JH (2009) The influence of size and coverage on test suite effectiveness. In: Proceedings of the 8th International Symposium on Software Testing and Analysis, ISSTA ’09, pp 57–68
Nguyen HDT, Qi D, Roychoudhury A, Chandra S (2013) SemFix: Program repair via semantic analysis. In: Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pp 772–781
Pearson K (1895) Note on regression and inheritance in the case of two parents. Proc Royal Soc Lond 58:240–242
Pei Y, Furia C, Nordio M, Wei Y, Meyer B, Zeller A (2014) Automated fixing of programs with contracts. IEEE Trans Softw Eng 40(5):427–449
Perkins JH, Kim S, Larsen S, Amarasinghe S, Bachrach J, Carbin M, Pacheco C, Sherwood F, Sidiroglou S, Sullivan G, Wong WF, Zibin Y, Ernst MD, Rinard M (2009) Automatically patching errors in deployed software. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, pp 87–102
Person S, Yang G, Rungta N, Khurshid S (2011) Directed incremental symbolic execution. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, pp 504–515
Qi Y, Mao X, Lei Y (2013) Efficient automated program repair through fault-recorded testing prioritization. In: Proceedings of the 2013 IEEE International Conference on Software Maintenance, ICSM ’13, pp 180–189
Qi Y, Mao X, Lei Y, Dai Z, Wang C (2014) The strength of random search on automated program repair. In: Proceedings of the 36th International Conference on Software Engineering, ICSE ’14, pp 254–265
Qi Z, Long F, Achour S, Rinard M (2015) An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In: Proceedings of the 2015 International Symposium on Software Testing and Analysis, ISSTA, pp 24–36
Samimi H, Aung ED, Millstein T (2010) Falling back on executable specifications. In: Proceedings of the 24th European Conference on Object-oriented Programming, ECOOP’10, pp 552–576
Samimi H, Schäfer M, Artzi S, Millstein T, Tip F, Hendren L (2012) Automated repair of HTML generation errors in PHP applications using string constraint solving. In: Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pp 277–287
Santelices R, Chittimalli PK, Apiwattanapong T, Orso A, Harrold MJ (2008) Test-suite augmentation for evolving software. In: Proceedings of the 23rd IEEE/ACM International Conference on Automated Software Engineering, ASE ’08, pp 218–227
Shoenauer M, Xanthakis S (1993) Constrained GA optimization. In: Proceedings of the 5th International Conference on Genetic Algorithms, ICGA ’93, pp 573–580
Smith EK, Barr ET, Le Goues C, Brun Y (2015) Is the cure worse than the disease? overfitting in automated program repair. In: Proceedings of the 2015 Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE ’15, pp 532–543
Tan SH, Roychoudhury A (2015) relifix: Automated repair of software regressions. In: Proceedings of the 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, ICSE ’15, pp 471–482
Tan SH, Yoshida H, Prasad MR, Roychoudhury A (2016) Anti-patterns in search-based program repair. In: Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE’16, pp 727–738
Weimer W, Fry ZP, Forrest S (2013) Leveraging program equivalence for adaptive program repair: Models and first results. In: Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, ASE ’13, pp 356–366
White DR, Arcuri A, Clark JA (2011) Evolutionary improvement of programs. IEEE Trans Evol Comput 15(4):515–538
Xuan J, Martinez M, Demarco F, Clement M, Marcote SRL, Durieux T, Berre DL, Monperrus M (2017) Nopol: Automatic repair of conditional statement bugs in Java programs. IEEE Trans Softw Eng 43(1):34–55
Yao X, Harman M, Jia Y (2014) A study of equivalent and stubborn mutation operators using human analysis of equivalence. In: Proceedings of the 36th International Conference on Software Engineering, ICSE ’14, pp 919–930
Acknowledgements
This research is supported in part by the National Research Foundation, Prime Minister’s Office, Singapore under its National Cybersecurity R&D Program (TSUNAMi project, Award No. NRF2014NCR-NCR001-21) and administered by the National Cybersecurity R&D Directorate. The first author thanks Innopolis University for its support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Martin Monperrus and Westley Weimer
Rights and permissions
About this article
Cite this article
Yi, J., Tan, S.H., Mechtaev, S. et al. A correlation study between automated program repair and test-suite metrics. Empir Software Eng 23, 2948–2979 (2018). https://doi.org/10.1007/s10664-017-9552-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-017-9552-y