Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
64 views

Scikit-Learn - Permutation - Importance - Py at 38fba057 Scikit-Learn - Scikit-Learn GitHub

This Python file contains code for calculating permutation importance, a technique for evaluating the importance of features in machine learning models. It takes in an already fitted estimator, data X, targets y, and optionally other parameters like the scoring metric, number of repeats, and number of jobs. It calculates a baseline score for the estimator on the data, then repeatedly shuffles each feature column and re-scores to calculate the drop in performance, which is the permutation importance of that feature. It returns a Bunch object containing the mean, standard deviation, and raw scores of feature importances.

Uploaded by

June June
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Scikit-Learn - Permutation - Importance - Py at 38fba057 Scikit-Learn - Scikit-Learn GitHub

This Python file contains code for calculating permutation importance, a technique for evaluating the importance of features in machine learning models. It takes in an already fitted estimator, data X, targets y, and optionally other parameters like the scoring metric, number of repeats, and number of jobs. It calculates a baseline score for the estimator on the data, then repeatedly shuffles each feature column and re-scores to calculate the drop in performance, which is the permutation importance of that feature. It returns a Bunch object containing the mean, standard deviation, and raw scores of feature importances.

Uploaded by

June June
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

scikit-learn / scikit-learn

Code Issues 1.6k Pull requests 740 Discussions Actions Projects


2beed55847

scikit-learn / sklearn / inspection / _permutation_importance.py

thomasjpfan
FIX Passes global configuration when spawning joblib jobs (#176… …
History


9
contributors

156 lines (128 sloc)



6.41 KB

1 """Permutation importance for estimators."""


2 import numpy as np
3 from joblib import Parallel
4
5 from ..metrics import check_scoring
6 from ..utils import Bunch
7 from ..utils import check_random_state
8 from ..utils import check_array
9 from ..utils.validation import _deprecate_positional_args
10 from ..utils.fixes import delayed
11
12
13 def _weights_scorer(scorer, estimator, X, y, sample_weight):
14 if sample_weight is not None:
15 return scorer(estimator, X, y, sample_weight)
16 return scorer(estimator, X, y)
17
18
19 def _calculate_permutation_scores(estimator, X, y, sample_weight, col_idx,
20 random_state, n_repeats, scorer):
21 """Calculate score when `col_idx` is permuted."""
22 random_state = check_random_state(random_state)
23
24 # Work on a copy of X to to ensure thread-safety in case of threading based
25 # parallelism. Furthermore, making a copy is also useful when the joblib
26 # backend is 'loky' (default) or the old 'multiprocessing': in those cases,
27 # if X is large it will be automatically be backed by a readonly memory map
28 # (memmap). X.copy() on the other hand is always guaranteed to return a
29 # writable data-structure whose columns can be shuffled inplace.
30 X_permuted = X.copy()
31 scores = np.zeros(n_repeats)
32 shuffling_idx = np.arange(X.shape[0])
33 for n_round in range(n_repeats):
34 random_state.shuffle(shuffling_idx)
35 if hasattr(X_permuted, "iloc"):
36 col = X_permuted.iloc[shuffling_idx, col_idx]
37 col.index = X_permuted.index
38 X_permuted.iloc[:, col_idx] = col
39 else:
40 X_permuted[:, col_idx] = X_permuted[shuffling_idx, col_idx]
41 feature_score = _weights_scorer(
42 scorer, estimator, X_permuted, y, sample_weight
43 )
44 scores[n_round] = feature_score
45
46 return scores
47
48
49 @_deprecate_positional_args
50 def permutation_importance(estimator, X, y, *, scoring=None, n_repeats=5,
51 n_jobs=None, random_state=None, sample_weight=None):
52 """Permutation importance for feature evaluation [BRE]_.
53
54 The :term:`estimator` is required to be a fitted estimator. `X` can be the
55 data set used to train the estimator or a hold-out set. The permutation
56 importance of a feature is calculated as follows. First, a baseline metric,
57 defined by :term:`scoring`, is evaluated on a (potentially different)
58 dataset defined by the `X`. Next, a feature column from the validation set
59 is permuted and the metric is evaluated again. The permutation importance
60 is defined to be the difference between the baseline metric and metric from
61 permutating the feature column.
62
63 Read more in the :ref:`User Guide <permutation_importance>`.
64
65 Parameters
66 ----------
67 estimator : object
68 An estimator that has already been :term:`fitted` and is compatible
69 with :term:`scorer`.
70
71 X : ndarray or DataFrame, shape (n_samples, n_features)
72 Data on which permutation importance will be computed.
73
74 y : array-like or None, shape (n_samples, ) or (n_samples, n_classes)
75 Targets for supervised or `None` for unsupervised.
76
77 scoring : string, callable or None, default=None
78 Scorer to use. It can be a single
79 string (see :ref:`scoring_parameter`) or a callable (see
80 :ref:`scoring`). If None, the estimator's default scorer is used.
81
82 n_repeats : int, default=5
83 Number of times to permute a feature.
84
85 n_jobs : int or None, default=None
86 Number of jobs to run in parallel. The computation is done by computing
87 permutation score for each columns and parallelized over the columns.
88 `None` means 1 unless in a :obj:`joblib.parallel_backend` context.
89 `-1` means using all processors. See :term:`Glossary <n_jobs>`
90 for more details.
91
92 random_state : int, RandomState instance, default=None
93 Pseudo-random number generator to control the permutations of each
94 feature.
95 Pass an int to get reproducible results across function calls.
96 See :term: `Glossary <random_state>`.
97
98 sample_weight : array-like of shape (n_samples,), default=None
99 Sample weights used in scoring.
100
101 .. versionadded:: 0.24
102
103 Returns
104 -------
105 result : :class:`~sklearn.utils.Bunch`
106 Dictionary-like object, with the following attributes.
107
108 importances_mean : ndarray, shape (n_features, )
109 Mean of feature importance over `n_repeats`.
110 importances_std : ndarray, shape (n_features, )
111 Standard deviation over `n_repeats`.
112 importances : ndarray, shape (n_features, n_repeats)
113 Raw permutation importance scores.
114
115 References
116 ----------
117 .. [BRE] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32,
118 2001. https://doi.org/10.1023/A:1010933404324
119
120 Examples
121 --------
122 >>> from sklearn.linear_model import LogisticRegression
123 >>> from sklearn.inspection import permutation_importance
124 >>> X = [[1, 9, 9],[1, 9, 9],[1, 9, 9],
125 ... [0, 9, 9],[0, 9, 9],[0, 9, 9]]
126 >>> y = [1, 1, 1, 0, 0, 0]
127 >>> clf = LogisticRegression().fit(X, y)
128 >>> result = permutation_importance(clf, X, y, n_repeats=10,
129 ... random_state=0)
130 >>> result.importances_mean
131 array([0.4666..., 0. , 0. ])
132 >>> result.importances_std
133 array([0.2211..., 0. , 0. ])
134 """
135 if not hasattr(X, "iloc"):
136 X = check_array(X, force_all_finite='allow-nan', dtype=None)
137
138 # Precompute random seed from the random state to be used
139 # to get a fresh independent RandomState instance for each
140 # parallel call to _calculate_permutation_scores, irrespective of
141 # the fact that variables are shared or not depending on the active
142 # joblib backend (sequential, thread-based or process-based).
143 random_state = check_random_state(random_state)
144 random_seed = random_state.randint(np.iinfo(np.int32).max + 1)
145
146 scorer = check_scoring(estimator, scoring=scoring)
147 baseline_score = _weights_scorer(scorer, estimator, X, y, sample_weight)
148
149 scores = Parallel(n_jobs=n_jobs)(delayed(_calculate_permutation_scores)(
150 estimator, X, y, sample_weight, col_idx, random_seed, n_repeats, scorer
151 ) for col_idx in range(X.shape[1]))
152
153 importances = baseline_score - np.array(scores)
154 return Bunch(importances_mean=np.mean(importances, axis=1),
155 importances_std=np.std(importances, axis=1),
156 importances=importances)

You might also like