Cross-project defect prediction using a connectivity-based unsupervised classifier
Proceedings of the 38th international conference on software engineering, 2016•dl.acm.org
Defect prediction on projects with limited historical data has attracted great interest from both
researchers and practitioners. Cross-project defect prediction has been the main area of
progress by reusing classifiers from other projects. However, existing approaches require
some degree of homogeneity (eg, a similar distribution of metric values) between the
training projects and the target project. Satisfying the homogeneity requirement often
requires significant effort (currently a very active area of research). An unsupervised …
researchers and practitioners. Cross-project defect prediction has been the main area of
progress by reusing classifiers from other projects. However, existing approaches require
some degree of homogeneity (eg, a similar distribution of metric values) between the
training projects and the target project. Satisfying the homogeneity requirement often
requires significant effort (currently a very active area of research). An unsupervised …
Defect prediction on projects with limited historical data has attracted great interest from both researchers and practitioners. Cross-project defect prediction has been the main area of progress by reusing classifiers from other projects. However, existing approaches require some degree of homogeneity (e.g., a similar distribution of metric values) between the training projects and the target project. Satisfying the homogeneity requirement often requires significant effort (currently a very active area of research).
An unsupervised classifier does not require any training data, therefore the heterogeneity challenge is no longer an issue. In this paper, we examine two types of unsupervised classifiers: a) distance-based classifiers (e.g., k-means); and b) connectivity-based classifiers. While distance-based unsupervised classifiers have been previously used in the defect prediction literature with disappointing performance, connectivity-based classifiers have never been explored before in our community.
We compare the performance of unsupervised classifiers versus supervised classifiers using data from 26 projects from three publicly available datasets (i.e., AEEEM, NASA, and PROMISE). In the cross-project setting, our proposed connectivity-based classifier (via spectral clustering) ranks as one of the top classifiers among five widely-used supervised classifiers (i.e., random forest, naive Bayes, logistic regression, decision tree, and logistic model tree) and five unsupervised classifiers (i.e., k-means, partition around medoids, fuzzy C-means, neural-gas, and spectral clustering). In the within-project setting (i.e., models are built and applied on the same project), our spectral classifier ranks in the second tier, while only random forest ranks in the first tier. Hence, connectivity-based unsupervised classifiers offer a viable solution for cross and within project defect predictions.
