Abstract
In most practical problems of classifier learning, the training data suffers from label noise. Most theoretical results on robustness to label noise involve either estimation of noise rates or non-convex optimization. Further, none of these results are applicable to standard decision tree learning algorithms. This paper presents some theoretical analysis to show that, under some assumptions, many popular decision tree learning algorithms are inherently robust to label noise. We also present some sample complexity results which provide some bounds on the sample size for the robustness to hold with a high probability. Through extensive simulations we illustrate this robustness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For simplicity, we do not consider pruning of the tree.
References
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27 (2011)
du Plessis, M.C., Niu, G, Sugiyama, M.: Analysis of learning from positive and unlabeled data. In: Advances in Neural Information Processing Systems (2014)
Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25, 845–869 (2014)
Ghosh, A., Manwani, N., Sastry, P.S.: Making risk minimization tolerant to label noise. Neurocomputing 160, 93–107 (2015)
Lichman, M.: UCI machine learning repository (2013)
Long, P.M., Servedio, R.A.: Random classification noise defeats all convex potential boosters. Mach. Learn. 78(3), 287–304 (2010)
Manwani, N., Sastry, P.S.: Geometric decision tree. IEEE Trans. Syst. Man Cybern. 42(1), 181–192 (2012)
Manwani, N., Sastry, P.S.: Noise tolerance under risk minimization. IEEE Trans. Cybern. 43(3), 1146–1151 (2013)
Natarajan, N., Dhillon, I.S., Ravikumar, P.K., Tewari, A.: Learning with noisy labels. In: Advances in Neural Information Processing Systems (2013)
Nettleton, D.F., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 33(4), 275–306 (2010)
Patrini, G., Nielsen, F., Nock, R., Carioni, M.: Loss factorization, weakly supervised learning and label noise robustness. In: Proceedings of The 33rd International Conference on Machine Learning, pp. 708–717 (2016)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Scott, C., Blanchard, G., Handy, G.: Classification with asymmetric label noise: consistency and maximal denoising. In: The 26th Annual Conference on Learning Theory, 12–14 June 2013, pp. 489–511 (2013)
van Rooyen, B., Menon, A., Williamson, R.C.: Learning with symmetric label noise: the importance of being unhinged. In: Advances in Neural Information Processing Systems, pp. 10–18 (2015)
Wu, X., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Proof Sketch of Lemmas 1, 2
A Proof Sketch of Lemmas 1, 2
Let \(n^+ ({\tilde{n}}^+)\) and \(n^- ({\tilde{n}}^- )\) denote the positive and negative samples at the node under noise-free case (noisy case). Taking positive class as majority, we note \(\rho = (n^+ - n^-)/n\). Using Hoeffding bound it is easy to show \(\Pr [{\tilde{n}}^+ - {\tilde{n}}^- < 0]\le \exp \left( -\frac{\rho ^2 n (1 - 2 \eta )^2}{2}\right) \). This gives bound for samples needed as \(n > \frac{2}{\rho ^2 (1 - 2 \eta )^2} \ln (\frac{1}{\delta })\), completing proof of Lemma 1.
Let \(n, n_l, n_r\) be the number of samples at \(v, v_l, v_r\) and recall \(n_l=an\) and \(n_r=(1-a)n\). Recall that \({\tilde{p}}, {\tilde{p}}_l, {\tilde{p}}_r\) are fraction of positive samples at \(v, v_l, v_r\) and \(p^{\eta }, p^{\eta }_l, p^{\eta }_r\) are their large sample values. Then, using Hoeffding bounds we get (with \(\epsilon _1=\epsilon \), \(\epsilon _2=\epsilon /\sqrt{a}\) and \(\epsilon _3=\epsilon /\sqrt{1-a}\)),
When this event happens, with some algebraic manipulation, one can show for Gini impurity, \(|\hat{\text {gain}}_{\text {Gini}}^{\eta }(f)-\text {gain}_{\text {Gini}}^{\eta }(f)|\le 6(1-2\eta )\epsilon \) where \(\hat{\text {gain}}_{\text {Gini}}^{\eta }\) is the random Gini-gain under noise with sample size n and \(\text {gain}_{\text {Gini}}^{\eta }\) is its large sample limit. This gives us the bound as needed in Lemma 2. We can prove the lemma for other criteria also similarly.
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Ghosh, A., Manwani, N., Sastry, P.S. (2017). On the Robustness of Decision Tree Learning Under Label Noise. In: Kim, J., Shim, K., Cao, L., Lee, JG., Lin, X., Moon, YS. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10234. Springer, Cham. https://doi.org/10.1007/978-3-319-57454-7_53
Download citation
DOI: https://doi.org/10.1007/978-3-319-57454-7_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57453-0
Online ISBN: 978-3-319-57454-7
eBook Packages: Computer ScienceComputer Science (R0)