Finding Influential Training Samples for Gradient Boosted Decision Trees

Sharchilev, Boris; Ustinovsky, Yury; Serdyukov, Pavel; de Rijke, Maarten

Computer Science > Machine Learning

arXiv:1802.06640 (cs)

[Submitted on 19 Feb 2018 (v1), last revised 12 Mar 2018 (this version, v2)]

Title:Finding Influential Training Samples for Gradient Boosted Decision Trees

Authors:Boris Sharchilev, Yury Ustinovsky, Pavel Serdyukov, Maarten de Rijke

View PDF

Abstract:We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model's predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency.

Comments:	Added the "Acknowledgements" section
Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1802.06640 [cs.LG]
	(or arXiv:1802.06640v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1802.06640

Submission history

From: Boris Sharchilev [view email]
[v1] Mon, 19 Feb 2018 14:19:40 UTC (94 KB)
[v2] Mon, 12 Mar 2018 19:12:03 UTC (94 KB)

Computer Science > Machine Learning

Title:Finding Influential Training Samples for Gradient Boosted Decision Trees

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Finding Influential Training Samples for Gradient Boosted Decision Trees

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators