Training data selection for cross-project defect prediction

Steffen Herbold

Abstract

Software defect prediction has been a popular research topic in recent years and is considered as a means for the optimization of quality assurance activities. Defect prediction can be done in a within-project or a cross-project scenario. The within-project scenario produces results with a very high quality, but requires historic data of the project, which is often not available. For the cross-project prediction, the data availability is not an issue as data from other projects is readily available, e.g., in repositories like PROMISE. However, the quality of the defect prediction results is too low for practical use. Recent research showed that the selection of appropriate training data can improve the quality of cross-project defect predictions. In this paper, we propose distance-based strategies for the selection of training data based on distributional characteristics of the available data. We evaluate the proposed strategies in a large case study with 44 data sets obtained from 14 open source projects. Our results show that our training data selection strategy improves the achieved success rate of cross-project defect predictions significantly. However, the quality of the results still cannot compete with within-project defect prediction.
Keywords: 
machine learning, defect-prediction, cross-project prediction
Document Type: 
Articles in Conference Proceedings
Booktitle: 
Proceedings of the 9th International Conference on Predictive Models in Software Engineering
Series: 
PROMISE'13
Publisher: 
ACM
Pages: 
6:1-6:10
Month: 
10
Year: 
2013
DOI: 
10.1145/2499393.2499395
2024 © Software Engineering For Distributed Systems Group

Main menu 2