Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

The use of decision trees for cost-sensitive classification: an empirical study in software quality prediction

Published: 01 September 2011 Publication History

Abstract

This empirical study investigates two commonly used decision tree classification algorithms in the context of cost-sensitive learning. A review of the literature shows that the cost-based performance of a software quality prediction model is usually determined after the model-training process has been completed. In contrast, we incorporate cost-sensitive learning during the model-training process. The C4.5 and Random Forest decision tree algorithms are used to build defect predictors either with, or without, any cost-sensitive learning technique. The paper investigates six different cost-sensitive learning techniques: AdaCost, Adc2, Csb2, MetaCost, Weighting, and Random Undersampling RUS. The data come from case study include 15 software measurement datasets obtained from several high-assurance systems. In addition, to a unique insight into the cost-based performance of defection prediction models, this study is one of the first to use misclassification cost as a parameter during the model-training process. The practical appeal of this research is that it provides a software quality practitioner with a clear process for how to consider during model training and analyze during model evaluation the cost-based performance of a defect prediction model. RUS is ranked as the best cost-sensitive technique among those considered in this study. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 448-459 DOI: 10.1002/widm.38

References

[1]
<label>1</label> Khoshgoftaar TM, Cukic B, Seliya N. An empirical assessment on program module-order models. Qual Technol Quant Manag 2007, Volume 4: pp.171-190.
[2]
<label>2</label> Emam KE, Benlarbi S, Goel N, Rai SN. Comparing case-based reasoning classifiers for predicting high-risk software componenets. J Syst Softw 2001, Volume 55: pp.301-320.
[3]
<label>3</label> Khoshgoftaar TM, Seliya N. Comparative assessment of software quality classification techniques: an empirical case study. Empir Softw Eng J 2004, Volume 9: pp.229-257.
[4]
<label>4</label> Lessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 2008, Volume 34: pp.485-496.
[5]
<label>5</label> Liu Y, Khoshgoftaar TM, Seliya N. Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans Softw Eng 2010, Volume 36: pp.852-864.
[6]
<label>6</label> Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco, CA: Morgan Kaufmann; 2005.
[7]
<label>7</label> Breiman L. Random forests. Mach Learn 2001, Volume 45: pp.5-32.
[8]
<label>8</label> Fan W, Stolfo SJ, Zhang J, Chan PK. Adacost: misclassification cost-sensitive boosting. In: Proceedings of 16th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann; 1999, pp.97-105.
[9]
<label>9</label> Ting KM. A comparative study of cost-sensitive boosting algorithms. In: Proceedings of 17th International Conference on Machine Learning. Stanford, CA: Morgan Kaufmann; 2000, pp.983-990.
[10]
<label>10</label> Sun Y, Kamel MS, Wong AKC, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 2007, Volume 40: pp.3358-3378.
[11]
<label>11</label> Domingos P. Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of Knowledge Discovery and Data Mining. New York: ACM Press; 1999, pp.155-164.
[12]
<label>12</label> Elkan C. The foundations of cost-sensitive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence. Vol. 2. San Francisco, CA: Morgan Kaufmann Publishers Inc.; 2001, pp.973-978.
[13]
<label>13</label> Jiang Y, Cukic B, Menzies T. Cost curve evaluation of fault prediction models. In: Proceedings of the 19th International Symposium on Software Reliability Engineering. Seattle, WA: IEEE Computer Society; 2008, pp.197-206.
[14]
<label>14</label> Khoshgoftaar TM, Seliya N, Herzberg A. Resource-oriented software quality classification models. J Syst Softw 2005, Volume 76: pp.111-126.
[15]
<label>15</label> Khoshgoftaar TM, Liu Y, Seliya N. A multi-objective module-order model for software quality enhancement. IEEE Trans Evolution Comput 2004, Volume 8: pp.593-608.
[16]
<label>16</label> Drummond C, Holte RC. Cost curves: an improved method for visualizing classifier performance. Mach Learn 2006, Volume 65: pp.95-130.
[17]
<label>17</label> Seliya N, Khoshgoftaar TM. Value-based software quality modeling. In: SEKE. Skokie, IL: Knowledge Systems Institute Graduate School; 2009, pp.116-121.
[18]
<label>18</label> Freund Y, Schapire R. Experiments with a new boosting algorithm. In: Proceedings of 13th International Conference on Machine Learning. Bari: Morgan Kaufmann; 1996, pp.148-156.
[19]
<label>19</label> Breiman L. Bagging predictors. Mach Learn 1996, Volume 26: pp.123-140.
[20]
<label>20</label> VanHulse J, Khoshgoftaar TM, Napolitano A. Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning. New York: ACM Press; 2007, pp.935-945.
[21]
<label>21</label> Sayyad Shirabad J, Menzies TJ. The PROMISE repository of software engineering databases, School of Information Technology and Engineering, University of Ottawa, Canada; 2005.
[22]
<label>22</label> Khoshgoftaar TM, Zhong S, Joshi V. Noise elimination with ensemble-classifier filtering for software quality estimation. Intell Data Anal: An Int J 2005, Volume 9: pp.3-27.
[23]
<label>23</label> Seliya N, Khoshgoftaar TM. Software quality analysis of unlabeled program modules with semi-supervised clustering. IEEE Trans Syst Man Cybern 2007, Volume 37: pp.201-211.
[24]
<label>24</label> Khoshgoftaar TM, Allen EB. Logistic regression modeling of software quality. Int J Reliab Qual Saf Eng 1999, Volume 6: pp.303-317.
[25]
<label>25</label> Berenson ML, Levine DM, Goldstein M. Intermediate Statistical Methods and Applications: A Computer Package Approach. Englewood Cliffs, NJ: Prentice-Hall, Inc.; 1989.

Cited By

View all
  • (2022)Data quality issues in software fault prediction: a systematic literature reviewArtificial Intelligence Review10.1007/s10462-022-10371-656:8(7839-7908)Online publication date: 21-Dec-2022
  • (2021)An Empirical Examination of the Impact of Bias on Just-in-time Defect PredictionProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3475791(1-12)Online publication date: 11-Oct-2021
  • (2020)Deep learning based software defect predictionNeurocomputing10.1016/j.neucom.2019.11.067385:C(100-110)Online publication date: 14-Apr-2020
  • Show More Cited By
  1. The use of decision trees for cost-sensitive classification: an empirical study in software quality prediction

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
      Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery  Volume 1, Issue 5
      September 2011
      91 pages

      Publisher

      John Wiley & Sons, Inc.

      United States

      Publication History

      Published: 01 September 2011

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Data quality issues in software fault prediction: a systematic literature reviewArtificial Intelligence Review10.1007/s10462-022-10371-656:8(7839-7908)Online publication date: 21-Dec-2022
      • (2021)An Empirical Examination of the Impact of Bias on Just-in-time Defect PredictionProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3475791(1-12)Online publication date: 11-Oct-2021
      • (2020)Deep learning based software defect predictionNeurocomputing10.1016/j.neucom.2019.11.067385:C(100-110)Online publication date: 14-Apr-2020
      • (2018)Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learningSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-018-3093-122:10(3461-3472)Online publication date: 1-May-2018
      • (2017)An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance ProblemsIEEE Transactions on Software Engineering10.1109/TSE.2016.259784943:4(321-339)Online publication date: 1-Apr-2017
      • (2017)An empirical study for software change prediction using imbalanced dataEmpirical Software Engineering10.1007/s10664-016-9488-722:6(2806-2851)Online publication date: 1-Dec-2017
      • (2016)Multiple kernel ensemble learning for software defect predictionAutomated Software Engineering10.1007/s10515-015-0179-123:4(569-590)Online publication date: 1-Dec-2016
      • (2015)Cost-sensitive and ensemble-based prediction model for outsourced software project risk predictionDecision Support Systems10.1016/j.dss.2015.02.00372:C(11-23)Online publication date: 1-Apr-2015
      • (2014)Dictionary learning based software defect predictionProceedings of the 36th International Conference on Software Engineering10.1145/2568225.2568320(414-423)Online publication date: 31-May-2014

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media