Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

On the Robustness of Decision Tree Learning Under Label Noise

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10234))

Included in the following conference series:

Abstract

In most practical problems of classifier learning, the training data suffers from label noise. Most theoretical results on robustness to label noise involve either estimation of noise rates or non-convex optimization. Further, none of these results are applicable to standard decision tree learning algorithms. This paper presents some theoretical analysis to show that, under some assumptions, many popular decision tree learning algorithms are inherently robust to label noise. We also present some sample complexity results which provide some bounds on the sample size for the robustness to hold with a high probability. Through extensive simulations we illustrate this robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For simplicity, we do not consider pruning of the tree.

References

  1. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)

    MATH  Google Scholar 

  2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  3. Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)

    MATH  Google Scholar 

  4. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27 (2011)

    Article  Google Scholar 

  5. du Plessis, M.C., Niu, G, Sugiyama, M.: Analysis of learning from positive and unlabeled data. In: Advances in Neural Information Processing Systems (2014)

    Google Scholar 

  6. Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25, 845–869 (2014)

    Article  Google Scholar 

  7. Ghosh, A., Manwani, N., Sastry, P.S.: Making risk minimization tolerant to label noise. Neurocomputing 160, 93–107 (2015)

    Article  Google Scholar 

  8. Lichman, M.: UCI machine learning repository (2013)

    Google Scholar 

  9. Long, P.M., Servedio, R.A.: Random classification noise defeats all convex potential boosters. Mach. Learn. 78(3), 287–304 (2010)

    Article  MathSciNet  Google Scholar 

  10. Manwani, N., Sastry, P.S.: Geometric decision tree. IEEE Trans. Syst. Man Cybern. 42(1), 181–192 (2012)

    Article  Google Scholar 

  11. Manwani, N., Sastry, P.S.: Noise tolerance under risk minimization. IEEE Trans. Cybern. 43(3), 1146–1151 (2013)

    Article  Google Scholar 

  12. Natarajan, N., Dhillon, I.S., Ravikumar, P.K., Tewari, A.: Learning with noisy labels. In: Advances in Neural Information Processing Systems (2013)

    Google Scholar 

  13. Nettleton, D.F., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 33(4), 275–306 (2010)

    Article  Google Scholar 

  14. Patrini, G., Nielsen, F., Nock, R., Carioni, M.: Loss factorization, weakly supervised learning and label noise robustness. In: Proceedings of The 33rd International Conference on Machine Learning, pp. 708–717 (2016)

    Google Scholar 

  15. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  16. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Google Scholar 

  17. Scott, C., Blanchard, G., Handy, G.: Classification with asymmetric label noise: consistency and maximal denoising. In: The 26th Annual Conference on Learning Theory, 12–14 June 2013, pp. 489–511 (2013)

    Google Scholar 

  18. van Rooyen, B., Menon, A., Williamson, R.C.: Learning with symmetric label noise: the importance of being unhinged. In: Advances in Neural Information Processing Systems, pp. 10–18 (2015)

    Google Scholar 

  19. Wu, X., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2007)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. S. Sastry .

Editor information

Editors and Affiliations

A Proof Sketch of Lemmas 1, 2

A Proof Sketch of Lemmas 1, 2

Let \(n^+ ({\tilde{n}}^+)\) and \(n^- ({\tilde{n}}^- )\) denote the positive and negative samples at the node under noise-free case (noisy case). Taking positive class as majority, we note \(\rho = (n^+ - n^-)/n\). Using Hoeffding bound it is easy to show \(\Pr [{\tilde{n}}^+ - {\tilde{n}}^- < 0]\le \exp \left( -\frac{\rho ^2 n (1 - 2 \eta )^2}{2}\right) \). This gives bound for samples needed as \(n > \frac{2}{\rho ^2 (1 - 2 \eta )^2} \ln (\frac{1}{\delta })\), completing proof of Lemma 1.

Let \(n, n_l, n_r\) be the number of samples at \(v, v_l, v_r\) and recall \(n_l=an\) and \(n_r=(1-a)n\). Recall that \({\tilde{p}}, {\tilde{p}}_l, {\tilde{p}}_r\) are fraction of positive samples at \(v, v_l, v_r\) and \(p^{\eta }, p^{\eta }_l, p^{\eta }_r\) are their large sample values. Then, using Hoeffding bounds we get (with \(\epsilon _1=\epsilon \), \(\epsilon _2=\epsilon /\sqrt{a}\) and \(\epsilon _3=\epsilon /\sqrt{1-a}\)),

$$\begin{aligned} \Pr \left[ \big (|\tilde{p}-p^{\eta }|\ge \epsilon _1 \big ) \cup \big (|\tilde{p}_l-p_l^{\eta }|\ge \epsilon _2\big ) \cup \big (|\tilde{p}_r-p^{\eta }_r|\ge \epsilon _3\big )\right] \le 6e^{-2n\epsilon ^2} \end{aligned}$$
(5)

When this event happens, with some algebraic manipulation, one can show for Gini impurity, \(|\hat{\text {gain}}_{\text {Gini}}^{\eta }(f)-\text {gain}_{\text {Gini}}^{\eta }(f)|\le 6(1-2\eta )\epsilon \) where \(\hat{\text {gain}}_{\text {Gini}}^{\eta }\) is the random Gini-gain under noise with sample size n and \(\text {gain}_{\text {Gini}}^{\eta }\) is its large sample limit. This gives us the bound as needed in Lemma 2. We can prove the lemma for other criteria also similarly.

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Ghosh, A., Manwani, N., Sastry, P.S. (2017). On the Robustness of Decision Tree Learning Under Label Noise. In: Kim, J., Shim, K., Cao, L., Lee, JG., Lin, X., Moon, YS. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10234. Springer, Cham. https://doi.org/10.1007/978-3-319-57454-7_53

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-57454-7_53

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-57453-0

  • Online ISBN: 978-3-319-57454-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics