×

Comparing software fault predictions of pure and zero-inflated Poisson regression models. (English) Zbl 1101.68480

Summary: Predicting the software quality prior to system tests and operations has proven to be useful for achieving effective reliability improvements. Poisson (pure) regression modelling is the most commonly used count modelling technique for predicting the expected number of faults in software modules. It is best suited to when the distribution of the fault data (dependent variable) is not biased, that is equidispersed fault data, whose mean equals the variance. However, in software fault data we often observe a large portion of zeros (no faults), especially in high-assurance systems. In such cases a pure Poisson Regression Model (PRM) may yield inaccurate fault predictions. A zero-inflated Poisson (ZIP) model changes the mean structure of a PRM, resulting in improved predictive quality. To illustrate the same, we examined software data collected from a full-scale industrial software system. Fault prediction models were calibrated using both pure Poisson and ZIP regression techniques. To prevent claims based on a biased data split (for the fit and test data sets), the data set was randomly split 50 times, and models were calibrated using each of these split combinations. A comparative hypothesis test between the pure Poisson and ZIP modelling techniques was performed. The test revealed that the ZIP model fitted better than its counterpart. Our comprehensive empirical comparative study presented in this paper showed that the ZIP model yielded better predictions than the PRM and also demonstrated better robustness in prediction accuracy across the 50 data splits.

MSC:

68N99 Theory of software
62P30 Applications of statistics in engineering and industry; control charts
62M20 Inference from stochastic processes and prediction
Full Text: DOI

References:

[1] DOI: 10.1109/52.476287 · doi:10.1109/52.476287
[2] DOI: 10.1142/S0218539399000292 · doi:10.1142/S0218539399000292
[3] Khoshgoftaar TM, Empirical Software Engineering Journal 8 pp pp. 325–350– (2003)
[4] Khoshgoftaar TM Szabo RProceedings of the 9th International Conference on Reliability and Quality in DesignHonolulu, Hawaii, USA, New Brunswich, New JerseyInternational Society of Science and Applied Technologiespp. 173–177 2003
[5] DOI: 10.1023/B:SQJO.0000042059.16470.f0 · doi:10.1023/B:SQJO.0000042059.16470.f0
[6] Khoshgoftaar TM Gao K Szabo RMProceedings of the 12th International Symposium on Software Reliability EngineeringHong Kong, PR China New YorkIEEE Computer Society2001 pp. 66–73
[7] Cameron AC, Regression Analysis of Count Data (1998)
[8] DOI: 10.1016/0001-4575(94)90038-8 · doi:10.1016/0001-4575(94)90038-8
[9] Greene WH, Econometric Analysis (2000)
[10] DOI: 10.2307/3314846 · Zbl 0679.62051 · doi:10.2307/3314846
[11] DOI: 10.1016/0927-5398(96)00004-7 · doi:10.1016/0927-5398(96)00004-7
[12] DOI: 10.1016/0304-4076(86)90002-3 · doi:10.1016/0304-4076(86)90002-3
[13] DOI: 10.2307/1269547 · Zbl 0850.62756 · doi:10.2307/1269547
[14] Khoshgoftaar TM Gao K Szabo R in H. Pham, M.Lu EdsProceedings of the 7th ISSAT International Conference on Reliability and Quality in DesignWashington DC USA 2001 pp. 20–24
[15] Shepperd M, IEEE Transactions on Software Engineering 27 pp pp. 1014–1022– (2001)
[16] Greene WH, Technical Report EC-94-10, Economics Department, New York University (1994)
[17] DOI: 10.1109/32.177367 · doi:10.1109/32.177367
[18] DOI: 10.1023/A:1018972607783 · doi:10.1023/A:1018972607783
[19] DOI: 10.2307/1912557 · Zbl 0701.62106 · doi:10.2307/1912557
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.