[MRG] Estimator tags #8022

amueller · 2016-12-08T22:59:38Z

Also see #6510

Todo

remove fixme's, enable all tests again
rename tags so they are false by default
rewrite tagging mechanism
rethink _required_parameters

For another PR:

add tests on compound estimators like pipeline, gridsearchcv, ...
fix MultiOutputEstimator
replace _estimator_type with tags
tag for sample weight support: has_fit_parameter doesn't work for meta-estimators
remove cross_decomposition special cases (maybe)
remove positive data special cases
input type for sparse data?
binary only classifiers
biclustering
make sure that enough tags are defined? ChainClassifier had no ClassifierMixin :-/

I decided to leave the _estimator_type for later because this PR is already too big.

Fun issue I just thought about: Can we even define the tags for a pipeline?
How do we know if a pipeline can handle missing data? Because the scalers pass it through (and the tag says they "handle" missing data) and the imputers impute it. So either these need different tags or we just punt on defining tags for pipelines and if you want to run a pipeline the author of the pipeline needs to set the tags specifically for that pipeline if they want to run the tests. That seems not the best, though.

amueller · 2016-12-09T21:55:51Z

This starts to look good, though I have no idea what's happening with GaussianRandomProjectionHash.

There's an issue with DummyRegressor and DummyClassifier not reaching the accuracy we're testing for (And the same is true for some of the NB classifiers). I'm not entirely sure how to handle that with tags -- interestingly, I haven't used any tags at all so far.
But a tag "this usually yield bad results" seems kinda odd.

amueller · 2016-12-12T23:43:11Z

Solves #6079 #6715 #7737 #6981 (potentially but not yet for the last one).
Also #7289.

amueller · 2016-12-13T00:25:46Z

So I'm allowing np.float64 as a parameter to __init__ because that was used in some of the vectorizers. Alternatively we could change the default to a string instead of a type.

amueller · 2016-12-13T01:34:12Z

This should be of interest to @GaelVaroquaux @jnothman @jakevdp and @mblondel.
It would be great to get some initial feedback.
This PR does three things:

it changes the common tests to work on instances instead of classes
it changes all the hard-coded assumptions about estimators by tags and the _required_parameters class attribute.
it fixes API inconsistencies to make the tests pass.

If you want, I can make the first thing it's own PR, because this looks like it's gonna be a big one.
Changing to instances instead of classes for tests is motivated by using instance methods for the tags and #7289. With instance-level common tests, we can write
check_estimator(make_pipeline(PCA(), SVC(kernel="linear"))), which I think is good.
Though the separate PR for the instance-level tests will be pretty short compared to the rest.

Tests are currently failing because I'm still working on the third bullet point.

amueller · 2016-12-13T15:56:27Z

OneVsOneClassifier returns a decision_function of shape (n_samples, 2) for a two-class problem. It's the only classifier that does so, as far as I can see. OneVsOneClassifier returns (n_samples, 1).

amueller · 2016-12-13T17:07:55Z

hm so will _get_tags() be part of the API requirements? or do I need to hasattr every time? hm... should probably not rely on it being present, so a helper it is.

amueller · 2016-12-13T20:25:52Z

I decided to make TfidfTransformer support dense input. That was more simple then writing a new input validation tool that always converts to sparse, no matter what the input. Also seems slightly more sensible than the previous behavior. It does mean, though, that if you provide dense input, it provides dense output, which is a change from before.

amueller · 2016-12-14T02:45:03Z

Ok, so @GaelVaroquaux @jnothman I need some input. My current solution for _get_tags is not working. This is a bit lengthy but I'm stuck and I don't understand the motivation for our current class hierarchy.

I'm confused by the method resolution order in our class hierarchy.
We always write the BaseEstimator to the left, which means it comes in the mru before the mixins, which means the mixins can't actually change any of the BaseEstimator behavior.
So if I declare some basic tags in BaseEstimator and want to extend them in ClassifierMixin, I need to call super in BaseEstimator, and then handle the return from the mixin. That seems really weird to me.

I assumed I can call super in the mixin and they'll all resolve before BaseEstimator, but exactly the opposite is the case.

I can see the following ways to fix this:

actually call super in the BaseEstimator and in all mixins and check if super has the _get_tags method (it could always be object as far as I can see). The code in BaseEstimator then is a bit weird because it needs to take the defaults and overwrite them with what was given by the super call. This solution feel really wrong to me.
Make all mixins inherit from BaseEstimator and remove BaseEstimator from all the estimators (if we add it to the mixins, the compiler requires us to remove it from the estimators to have a well-defined mru). This puts BaseEstimator on the top of the class hierarchy and it will always resolve last and everything is good.
Change the inheritance order in all estimators to have BaseEstimator at the very right. That puts it last in the mru and everthing is good.
Add a base-class (I call it KingOfDiamonds) on top of BaseEstimator and all the mixins, and implement the default _get_tags() there. BaseEstimator doesn't even need to implement _get_tags then, only KingOfDiamonds needs to. It will always be last in mru and everything is good.

From a code perspective 4 is least intrusive (least number of lines changed), but it adds another layer to the inheritance hierarchy.
3 is the least intrusive in terms of class hierarchy, it "only" changes the mru. It relies on the order of the classes in multiple inheritance, though, which seems somewhat fragile (*).
2 is somewhere in between because it changes both the inheritance structure and the code, but it's relatively robust and doesn't introduce another class.

With the current setup, if two mixins try to change the same tag, but in opposite directions, then the order of the mixins will matter. That seems like a bad idea that can be avoided in practice, though.

What is the motivation to have BaseEstimator to the left in the first place, and why don't we let the mixins inherit from it?

jnothman · 2016-12-14T03:47:58Z

I think mixins are meant to come before base classes in MRO, although I've not confirmed this intuition wiht online resources (I'm offline). From the perspective of purity, 3 is therefore my preferred solution, though I appreciate it is technically not backwards compatible, and is somewhat brittle.

I think another partial solution is that _get_tags could be conventionally defined with a helper, like:

def _get_tags(self):
    return base.extend_tags(self, tags={'a': 5, 'b': 6}, allow_overwrite=['b'])

extend_tags would be implemented to allow super's _get_tags to not exist. allow_overwrite would ensure that no tag overwriting can take place, except where whitelisted. At least if I understand the problem, this should help...?

jnothman

A skim. I'm yet to look at estimator_checks. I'm not sure this interface is perfect, and I've been wondering about just having an attribute on the objects for each trait. But that violates everything :)

jnothman · 2016-12-14T06:36:37Z

doc/developers/contributing.rst

+
+The current set of estimator tags are:
+
+input_validation - whether the estimator does input-validation. This is only meant for stateless and dummy transformers!


what of ensembles/metaestimators that largely delegate their validation to some base/wrapped estimator?

I'm currently basing all this on the tests, which I think is ok as a first approach and given the issues that were motivating this. metaestimators and ensembles are instantiated with well-behaved estimators for now, so they pass the tests.

I don't get it, what does this mean? Only stateless and dummy transformers should set this to True? or to False? and why?

Might be clearer to document the default tag values here too

Can I check whether text feature extractors are considered to validate?

This list should be formatted as a RST definition list

jnothman · 2016-12-14T06:37:01Z

doc/developers/contributing.rst

+
+input_validation - whether the estimator does input-validation. This is only meant for stateless and dummy transformers!
+multioutput - whether a regressor supports multi-target outputs or a classifier supports multi-class multi-output.
+multilabel -  whether the estimator supports multilabel output


another one for sparse multilabel?

So far I only added tags that were needed for the test to pass without special cases - actually I didn't need the sparse data one so far or the meta-estimator!

jnothman · 2016-12-14T07:42:35Z

sklearn/base.py

@@ -12,8 +12,18 @@
 from .utils.fixes import signature
 from . import __version__

+_DEFAULT_TAGS = {
+    'input_types': ['2darray'],


by input you mean first arg to Estimator().fit?

yes. Maybe "data_types"? We have 1d, 2d, 3d (patch extractor), sparse, dictionaries and text. Also tuples for dict-vectorizer (or was it hashing vectorizer?) I think.

Would X_types be better than data_types?

jnothman · 2016-12-14T07:42:39Z

sklearn/base.py

    """Mixin class for all meta estimators in scikit-learn."""
    # this is just a tag for the moment
+    def _get_tags(self):
+        tags = super(MetaEstimatorMixin, self)._get_tags().copy()
+        tags.update(is_meta_estimator=True)


Means what?

actually, this is unused right now and might be unnecessary.

jnothman · 2016-12-14T07:42:58Z

sklearn/feature_extraction/text.py


        n_samples, n_features = X.shape

        if self.sublinear_tf:
-            np.log(X.data, X.data)
-            X.data += 1
+            if sp.issparse(X):


Does this need its own test?

probably. I have not added any new tests, there are still enough existing tests failing. Still wip, mostly wanted feedback on the mixins.

jnothman · 2016-12-14T07:43:14Z

sklearn/multiclass.py

@@ -217,6 +218,7 @@ def fit(self, X, y):

        return self

+    @if_delegate_has_method(['_first_estimator', 'estimator'])


needs testing? I think this should just be delegating to estimator. if the base estimator does not support partial_fit, nor can the fitted _first_estimator, no?

jnothman · 2016-12-14T07:43:15Z

sklearn/multiclass.py

@@ -505,6 +511,7 @@ def fit(self, X, y):

        return self

+    @if_delegate_has_method(delegate='estimator')


needs testing?

jnothman · 2016-12-14T07:43:17Z

sklearn/multiclass.py

@@ -569,6 +576,8 @@ def predict(self, X):
            Predicted multi-class targets.
        """
        Y = self.decision_function(X)
+        if self.n_classes_ == 2:
+            return self.classes_[(Y > 0).astype(np.int)]


needs testing?

jnothman · 2016-12-14T07:43:19Z

sklearn/multiclass.py

@@ -601,7 +610,8 @@ def decision_function(self, X):
                                 for est, Xi in zip(self.estimators_, Xs)]).T
        Y = _ovr_decision_function(predictions,
                                   confidences, len(self.classes_))
-
+        if self.n_classes_ == 2:
+            return Y[:, 1]


needs testing?

Yeah I'll add more explicit testing for all of the changes. Though they are all tested, as that's obviously how I found the issues. Though explicit regression tests are clearly good.

jnothman · 2016-12-14T07:45:28Z

And I meant to say it's great that you're finding all these bugs!

amueller · 2016-12-14T15:35:28Z

@jnothman yeah so far mostly meta-estimators and "weird" estimators that we didn't test :-/

amueller · 2016-12-14T15:48:16Z

I think another partial solution is that _get_tags could be conventionally defined with a helper, like:

def _get_tags(self):
return base.extend_tags(self, tags={'a': 5, 'b': 6}, allow_overwrite=['b'])

extend_tags would be implemented to allow super's _get_tags to not exist. allow_overwrite would ensure that no tag overwriting can take place, except where whitelisted. At least if I understand the problem, this should help...?

I'm not sure I understand the solution. base is sklearn.base, right?
What would happen if something tries to overwrite something that is already defined? Silently ignored, right?

So with silent pass I guess this would work if BaseEstimator has everything in allow_overwrite, and all the others have nothing in allow_overwrite. Because BaseEstimator._get_tags is called last, it would then not overwrite all the tags defined in the mixins which have been defined earlier.
So yes, I think that would work. I don't think it's pretty, though.

What do you not like about the interface? Is it this exact inheritance issue? Having one attribute per tag would have the same issue, right?
I'm actually quite happy with this interface. I was first a bit bummed that I need a helper function but I don't think there's a way around that.

Actually, another way to resolve the issue is to not give BaseEstimator a _get_tags at all and assume that all calls are done with the helper.

amueller · 2016-12-14T16:01:56Z

Here's a blog complaining about people doing mixins wrong (i.e. the way we do it):
https://www.ianlewis.org/en/mixins-and-python

Another person running into this here:
http://nedbatchelder.com/blog/201210/multiple_inheritance_is_hard.html

But "The hitchhikers guide to Python" and therefore requests does mixins to the right, while the Python Cookbook does Mixins to the left... I'm gonna take this to the twitters

amueller · 2016-12-14T17:03:07Z

CPython does "to the left" here: https://hg.python.org/cpython/file/3.5/Lib/socketserver.py#l639

amueller · 2016-12-14T17:18:01Z

David Beazley agrees with us: https://twitter.com/dabeaz/status/809084586487664641

amueller · 2016-12-14T18:14:12Z

currently I made some breaking changes to see if it's possible to make things go smoothly.
I'll probably create deprecations and add some special cases back into the tests, but the special cases will go away once the deprecation finished.

…put validation

amueller · 2019-02-21T21:27:01Z

sklearn/utils/estimator_checks.py


-SUPPORT_STRING = ['SimpleImputer', 'MissingIndicator']


I didn't add a tag to SimpleImputer but the tests are not failing? hm...

amueller · 2019-02-22T17:58:13Z

Should be good now.

amueller · 2019-02-22T18:38:16Z

failures are related to the extract patches docstring (unrelated to this PR)

jnothman · 2019-02-23T21:29:55Z

@glemaitre, if you have a moment before the sprint, please approve this and merge?

glemaitre · 2019-02-23T21:57:05Z

LGTM. Looking forward to using this in contrib. Thanks @amueller

amueller · 2019-02-25T16:41:45Z

yay thank you all!! @glemaitre @jnothman @rth I'm so happy!

albertcthomas · 2019-03-03T15:57:02Z

Looking forward to using this for my custom estimators! Thank you @amueller and everyone involved!

amueller · 2019-03-12T20:17:40Z

@albertcthomas let us know what's missing please!

This reverts commit b0532c4.

amueller changed the title ~~make common tests work on estimator instances, not classes~~ [WIP] make common tests work on estimator instances, not classes Dec 8, 2016

This was referenced Dec 8, 2016

Should cloning create equal parameters #8023

Open

Ability to specify parameters for common tests #7289

Closed

amueller force-pushed the tags branch from ce2a361 to d28f52e Compare December 9, 2016 18:41

amueller mentioned this pull request Dec 12, 2016

Implement estimator tags #6599

Closed

18 tasks

amueller force-pushed the tags branch from 069d314 to 1301cf3 Compare December 12, 2016 22:47

amueller changed the title ~~[WIP] make common tests work on estimator instances, not classes~~ [WIP] Estimator tags and making common tests work on classes... Dec 12, 2016

amueller mentioned this pull request Dec 12, 2016

TfidfTransformer has insufficient input validation in fit #8044

Closed

This was referenced Dec 13, 2016

OneVsOneClassifier decision function shape non-standard #8049

Open

[MRG] fix test for new default of SVC decision function #8050

Closed

amueller force-pushed the tags branch from ff11c12 to 71b11f3 Compare December 13, 2016 19:02

jnothman reviewed Dec 14, 2016

View reviewed changes

amueller mentioned this pull request Dec 16, 2016

Common test for sample_weight as list #8064

Closed

make common tests work on estimator instances, not classes

98d9aff

amueller added 4 commits February 21, 2019 15:49

very certain I fixed this before... more generic error message for in…

d759329

…put validation

add tags to iterativeimputer

4715e1b

fix pep8 from a suggestion ;)

873d916

fix missing indicator test

d67df1c

amueller commented Feb 21, 2019

View reviewed changes

remove outdated comment

83fa5f3

glemaitre approved these changes Feb 23, 2019

View reviewed changes

glemaitre merged commit ab2f539 into scikit-learn:master Feb 23, 2019

TomDLT mentioned this pull request Mar 5, 2019

[MRG+1] Adding multi output checks to common tests #13392

Merged

amueller mentioned this pull request Mar 12, 2019

Estimator Tags todo / proposed tags #13438

Open

wdevazelhes mentioned this pull request Apr 4, 2019

[MRG] Make CalibratedClassifierCV a MetaEstimator #13575

Merged

amueller removed this from PR phase in Andy's pets Apr 4, 2019

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

API: Estimator tags (scikit-learn#8022)

b0532c4

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "API: Estimator tags (scikit-learn#8022)"

b5b1b09

This reverts commit b0532c4.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "API: Estimator tags (scikit-learn#8022)"

81b44ed

This reverts commit b0532c4.

StrikerRUS mentioned this pull request May 5, 2019

[python][sklearn] added estimator's tags microsoft/LightGBM#2150

Merged

amueller mentioned this pull request Jun 11, 2019

Estimator tag overwriting and update in _get_tags #14044

Closed

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

API: Estimator tags (scikit-learn#8022)

9306544

amueller mentioned this pull request Jul 24, 2019

check_estimator fails for (non hard-coded) estimators that support 2-D y #8683

Closed

chkoar mentioned this pull request Nov 19, 2019

Estimator Tags scikit-learn-contrib/imbalanced-learn#650

Closed

thomasjpfan mentioned this pull request Aug 21, 2021

Revisting the tags interface #20804

Open

glemaitre mentioned this pull request Jan 9, 2024

[MAINT] Adapt decoder for scikit-learn > 1.3 nilearn/nilearn#4188

Merged

lucyleeow mentioned this pull request May 3, 2024

[BUG?] CalibratedClassifierCV is not marked as a meta-estimator #5518

Closed


		The current set of estimator tags are:

		input_validation - whether the estimator does input-validation. This is only meant for stateless and dummy transformers!

		@@ -217,6 +218,7 @@ def fit(self, X, y):

		return self

		@if_delegate_has_method(['_first_estimator', 'estimator'])

		@@ -505,6 +511,7 @@ def fit(self, X, y):

		return self

		@if_delegate_has_method(delegate='estimator')

[MRG] Estimator tags #8022

[MRG] Estimator tags #8022

Conversation

amueller commented Dec 8, 2016 • edited Loading

amueller commented Dec 9, 2016

amueller commented Dec 12, 2016

amueller commented Dec 13, 2016

amueller commented Dec 13, 2016

amueller commented Dec 13, 2016

amueller commented Dec 13, 2016 • edited Loading

amueller commented Dec 13, 2016

amueller commented Dec 14, 2016

jnothman commented Dec 14, 2016

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Dec 14, 2016

amueller commented Dec 14, 2016

amueller commented Dec 14, 2016

amueller commented Dec 14, 2016

amueller commented Dec 14, 2016

amueller commented Dec 14, 2016

amueller commented Dec 14, 2016

Choose a reason for hiding this comment

amueller commented Feb 22, 2019

amueller commented Feb 22, 2019

jnothman commented Feb 23, 2019

glemaitre commented Feb 23, 2019

amueller commented Feb 25, 2019

albertcthomas commented Mar 3, 2019

amueller commented Mar 12, 2019

amueller commented Dec 8, 2016 •

edited

Loading

amueller commented Dec 13, 2016 •

edited

Loading