article

The use of decision trees for cost-sensitive classification: an empirical study in software quality prediction

Authors:

Taghi M. KhoshgoftaarAuthors Info & Claims

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Volume 1, Issue 5

Pages 448 - 459

https://doi.org/10.1002/widm.38

Published: 01 September 2011 Publication History

Abstract

This empirical study investigates two commonly used decision tree classification algorithms in the context of cost-sensitive learning. A review of the literature shows that the cost-based performance of a software quality prediction model is usually determined after the model-training process has been completed. In contrast, we incorporate cost-sensitive learning during the model-training process. The C4.5 and Random Forest decision tree algorithms are used to build defect predictors either with, or without, any cost-sensitive learning technique. The paper investigates six different cost-sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling RUS. The data come from case study include 15 software measurement datasets obtained from several high-assurance systems. In addition, to a unique insight into the cost-based performance of defection prediction models, this study is one of the first to use misclassification cost as a parameter during the model-training process. The practical appeal of this research is that it provides a software quality practitioner with a clear process for how to consider during model training and analyze during model evaluation the cost-based performance of a defect prediction model. RUS is ranked as the best cost-sensitive technique among those considered in this study. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 448-459 DOI: 10.1002/widm.38

References

[1]

<label>1</label> Khoshgoftaar TM, Cukic B, Seliya N. An empirical assessment on program module-order models. Qual Technol Quant Manag 2007, Volume 4: pp.171-190.

[2]

<label>2</label> Emam KE, Benlarbi S, Goel N, Rai SN. Comparing case-based reasoning classifiers for predicting high-risk software componenets. J Syst Softw 2001, Volume 55: pp.301-320.

Digital Library

[3]

<label>3</label> Khoshgoftaar TM, Seliya N. Comparative assessment of software quality classification techniques: an empirical case study. Empir Softw Eng J 2004, Volume 9: pp.229-257.

Digital Library

[4]

<label>4</label> Lessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 2008, Volume 34: pp.485-496.

Digital Library

[5]

<label>5</label> Liu Y, Khoshgoftaar TM, Seliya N. Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans Softw Eng 2010, Volume 36: pp.852-864.

Digital Library

[6]

<label>6</label> Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco, CA: Morgan Kaufmann; 2005.

[7]

<label>7</label> Breiman L. Random forests. Mach Learn 2001, Volume 45: pp.5-32.

Digital Library

[8]

<label>8</label> Fan W, Stolfo SJ, Zhang J, Chan PK. Adacost: misclassification cost-sensitive boosting. In: Proceedings of 16th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann; 1999, pp.97-105.

Digital Library

[9]

<label>9</label> Ting KM. A comparative study of cost-sensitive boosting algorithms. In: Proceedings of 17th International Conference on Machine Learning. Stanford, CA: Morgan Kaufmann; 2000, pp.983-990.

Digital Library

[10]

<label>10</label> Sun Y, Kamel MS, Wong AKC, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 2007, Volume 40: pp.3358-3378.

Digital Library

[11]

<label>11</label> Domingos P. Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of Knowledge Discovery and Data Mining. New York: ACM Press; 1999, pp.155-164.

Digital Library

[12]

<label>12</label> Elkan C. The foundations of cost-sensitive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence. Vol. 2. San Francisco, CA: Morgan Kaufmann Publishers Inc.; 2001, pp.973-978.

Digital Library

[13]

<label>13</label> Jiang Y, Cukic B, Menzies T. Cost curve evaluation of fault prediction models. In: Proceedings of the 19th International Symposium on Software Reliability Engineering. Seattle, WA: IEEE Computer Society; 2008, pp.197-206.

Digital Library

[14]

<label>14</label> Khoshgoftaar TM, Seliya N, Herzberg A. Resource-oriented software quality classification models. J Syst Softw 2005, Volume 76: pp.111-126.

Digital Library

[15]

<label>15</label> Khoshgoftaar TM, Liu Y, Seliya N. A multi-objective module-order model for software quality enhancement. IEEE Trans Evolution Comput 2004, Volume 8: pp.593-608.

Digital Library

[16]

<label>16</label> Drummond C, Holte RC. Cost curves: an improved method for visualizing classifier performance. Mach Learn 2006, Volume 65: pp.95-130.

Digital Library

[17]

<label>17</label> Seliya N, Khoshgoftaar TM. Value-based software quality modeling. In: SEKE. Skokie, IL: Knowledge Systems Institute Graduate School; 2009, pp.116-121.

[18]

<label>18</label> Freund Y, Schapire R. Experiments with a new boosting algorithm. In: Proceedings of 13th International Conference on Machine Learning. Bari: Morgan Kaufmann; 1996, pp.148-156.

[19]

<label>19</label> Breiman L. Bagging predictors. Mach Learn 1996, Volume 26: pp.123-140.

Digital Library

[20]

<label>20</label> VanHulse J, Khoshgoftaar TM, Napolitano A. Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning. New York: ACM Press; 2007, pp.935-945.

Digital Library

[21]

<label>21</label> Sayyad Shirabad J, Menzies TJ. The PROMISE repository of software engineering databases, School of Information Technology and Engineering, University of Ottawa, Canada; 2005.

[22]

<label>22</label> Khoshgoftaar TM, Zhong S, Joshi V. Noise elimination with ensemble-classifier filtering for software quality estimation. Intell Data Anal: An Int J 2005, Volume 9: pp.3-27.

[23]

<label>23</label> Seliya N, Khoshgoftaar TM. Software quality analysis of unlabeled program modules with semi-supervised clustering. IEEE Trans Syst Man Cybern 2007, Volume 37: pp.201-211.

Digital Library

[24]

<label>24</label> Khoshgoftaar TM, Allen EB. Logistic regression modeling of software quality. Int J Reliab Qual Saf Eng 1999, Volume 6: pp.303-317.

[25]

<label>25</label> Berenson ML, Levine DM, Goldstein M. Intermediate Statistical Methods and Applications: A Computer Package Approach. Englewood Cliffs, NJ: Prentice-Hall, Inc.; 1989.

Cited By

Bhandari KKumar KSangal A(2022)Data quality issues in software fault prediction: a systematic literature reviewArtificial Intelligence Review10.1007/s10462-022-10371-656:8(7839-7908)Online publication date: 21-Dec-2022
https://dl.acm.org/doi/10.1007/s10462-022-10371-6
Gesi JLi JAhmed ILanubile F(2021)An Empirical Examination of the Impact of Bias on Just-in-time Defect PredictionProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3475791(1-12)Online publication date: 11-Oct-2021
https://dl.acm.org/doi/10.1145/3475716.3475791
Qiao LLi XUmer QGuo P(2020)Deep learning based software defect predictionNeurocomputing10.1016/j.neucom.2019.11.067385:C(100-110)Online publication date: 14-Apr-2020
https://dl.acm.org/doi/10.1016/j.neucom.2019.11.067
Show More Cited By

Recommendations

Cost-sensitive decision tree ensembles for effective imbalanced classification

Real-life datasets are often imbalanced, that is, there are significantly more training samples available for some classes than for others, and consequently the conventional aim of reducing overall classification accuracy is not appropriate when dealing ...
Example-dependent cost-sensitive decision trees

Example-dependent cost-sensitive tree algorithm.Each example is assumed to have different financial cost.Application on credit card fraud detection, credit scoring and direct marketing.Focus on maximizing the financial savings instead of accuracy.Code ...
An Empirical evaluation of CostBoost Extensions for Cost-Sensitive Classification
Compute '15: Proceedings of the 8th Annual ACM India Conference

Data Mining Technique, namely Classification, is used to predict group membership for data samples. Ensemble learning, combining multiple classifiers using bagging, boosting or stacking, are proven data-mining methods, we have used boosting in this ...

Comments

Information & Contributors

Information

Published In

cover image Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery Volume 1, Issue 5

September 2011

91 pages

EISSN:1942-4795

Issue’s Table of Contents

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 September 2011

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bhandari KKumar KSangal A(2022)Data quality issues in software fault prediction: a systematic literature reviewArtificial Intelligence Review10.1007/s10462-022-10371-656:8(7839-7908)Online publication date: 21-Dec-2022
https://dl.acm.org/doi/10.1007/s10462-022-10371-6
Gesi JLi JAhmed ILanubile F(2021)An Empirical Examination of the Impact of Bias on Just-in-time Defect PredictionProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3475791(1-12)Online publication date: 11-Oct-2021
https://dl.acm.org/doi/10.1145/3475716.3475791
Qiao LLi XUmer QGuo P(2020)Deep learning based software defect predictionNeurocomputing10.1016/j.neucom.2019.11.067385:C(100-110)Online publication date: 14-Apr-2020
https://dl.acm.org/doi/10.1016/j.neucom.2019.11.067
Yu XWu MJian YBennin KFu MMa C(2018)Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learningSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-018-3093-122:10(3461-3472)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1007/s00500-018-3093-1
Jing XWu FDong XXu B(2017)An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance ProblemsIEEE Transactions on Software Engineering10.1109/TSE.2016.259784943:4(321-339)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1109/TSE.2016.2597849
Malhotra RKhanna M(2017)An empirical study for software change prediction using imbalanced dataEmpirical Software Engineering10.1007/s10664-016-9488-722:6(2806-2851)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s10664-016-9488-7
Wang TZhang ZJing XZhang L(2016)Multiple kernel ensemble learning for software defect predictionAutomated Software Engineering10.1007/s10515-015-0179-123:4(569-590)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1007/s10515-015-0179-1
Hu YFeng BMo XZhang XNgai EFan MLiu M(2015)Cost-sensitive and ensemble-based prediction model for outsourced software project risk predictionDecision Support Systems10.1016/j.dss.2015.02.00372:C(11-23)Online publication date: 1-Apr-2015
https://dl.acm.org/doi/10.1016/j.dss.2015.02.003
Jing XYing SZhang ZWu SLiu JJalote PBriand LHoek A(2014)Dictionary learning based software defect predictionProceedings of the 36th International Conference on Software Engineering10.1145/2568225.2568320(414-423)Online publication date: 31-May-2014
https://dl.acm.org/doi/10.1145/2568225.2568320

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents