Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Estimator tags #8022

Merged
merged 219 commits into from
Feb 23, 2019
Merged

[MRG] Estimator tags #8022

merged 219 commits into from
Feb 23, 2019

Conversation

amueller
Copy link
Member

@amueller amueller commented Dec 8, 2016

Fixes #6599, #7737, #6554, #6715

Also see #6510

Todo

  • remove fixme's, enable all tests again
  • rename tags so they are false by default
  • rewrite tagging mechanism
  • rethink _required_parameters

For another PR:

  • add tests on compound estimators like pipeline, gridsearchcv, ...
  • fix MultiOutputEstimator
  • replace _estimator_type with tags
  • tag for sample weight support: has_fit_parameter doesn't work for meta-estimators
  • remove cross_decomposition special cases (maybe)
  • remove positive data special cases
  • input type for sparse data?
  • binary only classifiers
  • biclustering
  • make sure that enough tags are defined? ChainClassifier had no ClassifierMixin :-/

I decided to leave the _estimator_type for later because this PR is already too big.

Fun issue I just thought about: Can we even define the tags for a pipeline?
How do we know if a pipeline can handle missing data? Because the scalers pass it through (and the tag says they "handle" missing data) and the imputers impute it. So either these need different tags or we just punt on defining tags for pipelines and if you want to run a pipeline the author of the pipeline needs to set the tags specifically for that pipeline if they want to run the tests. That seems not the best, though.

@amueller amueller changed the title make common tests work on estimator instances, not classes [WIP] make common tests work on estimator instances, not classes Dec 8, 2016
@amueller
Copy link
Member Author

amueller commented Dec 9, 2016

This starts to look good, though I have no idea what's happening with GaussianRandomProjectionHash.

There's an issue with DummyRegressor and DummyClassifier not reaching the accuracy we're testing for (And the same is true for some of the NB classifiers). I'm not entirely sure how to handle that with tags -- interestingly, I haven't used any tags at all so far.
But a tag "this usually yield bad results" seems kinda odd.

@amueller amueller mentioned this pull request Dec 12, 2016
18 tasks
@amueller amueller changed the title [WIP] make common tests work on estimator instances, not classes [WIP] Estimator tags and making common tests work on classes... Dec 12, 2016
@amueller
Copy link
Member Author

Solves #6079 #6715 #7737 #6981 (potentially but not yet for the last one).
Also #7289.

@amueller
Copy link
Member Author

So I'm allowing np.float64 as a parameter to __init__ because that was used in some of the vectorizers. Alternatively we could change the default to a string instead of a type.

@amueller
Copy link
Member Author

This should be of interest to @GaelVaroquaux @jnothman @jakevdp and @mblondel.
It would be great to get some initial feedback.
This PR does three things:

  • it changes the common tests to work on instances instead of classes
  • it changes all the hard-coded assumptions about estimators by tags and the _required_parameters class attribute.
  • it fixes API inconsistencies to make the tests pass.

If you want, I can make the first thing it's own PR, because this looks like it's gonna be a big one.
Changing to instances instead of classes for tests is motivated by using instance methods for the tags and #7289. With instance-level common tests, we can write
check_estimator(make_pipeline(PCA(), SVC(kernel="linear"))), which I think is good.
Though the separate PR for the instance-level tests will be pretty short compared to the rest.

Tests are currently failing because I'm still working on the third bullet point.

@amueller
Copy link
Member Author

OneVsOneClassifier returns a decision_function of shape (n_samples, 2) for a two-class problem. It's the only classifier that does so, as far as I can see. OneVsOneClassifier returns (n_samples, 1).

@amueller
Copy link
Member Author

amueller commented Dec 13, 2016

hm so will _get_tags() be part of the API requirements? or do I need to hasattr every time? hm... should probably not rely on it being present, so a helper it is.

@amueller
Copy link
Member Author

I decided to make TfidfTransformer support dense input. That was more simple then writing a new input validation tool that always converts to sparse, no matter what the input. Also seems slightly more sensible than the previous behavior. It does mean, though, that if you provide dense input, it provides dense output, which is a change from before.

@amueller
Copy link
Member Author

Ok, so @GaelVaroquaux @jnothman I need some input. My current solution for _get_tags is not working. This is a bit lengthy but I'm stuck and I don't understand the motivation for our current class hierarchy.

I'm confused by the method resolution order in our class hierarchy.
We always write the BaseEstimator to the left, which means it comes in the mru before the mixins, which means the mixins can't actually change any of the BaseEstimator behavior.
So if I declare some basic tags in BaseEstimator and want to extend them in ClassifierMixin, I need to call super in BaseEstimator, and then handle the return from the mixin. That seems really weird to me.

I assumed I can call super in the mixin and they'll all resolve before BaseEstimator, but exactly the opposite is the case.

I can see the following ways to fix this:

  1. actually call super in the BaseEstimator and in all mixins and check if super has the _get_tags method (it could always be object as far as I can see). The code in BaseEstimator then is a bit weird because it needs to take the defaults and overwrite them with what was given by the super call. This solution feel really wrong to me.

  2. Make all mixins inherit from BaseEstimator and remove BaseEstimator from all the estimators (if we add it to the mixins, the compiler requires us to remove it from the estimators to have a well-defined mru). This puts BaseEstimator on the top of the class hierarchy and it will always resolve last and everything is good.

  3. Change the inheritance order in all estimators to have BaseEstimator at the very right. That puts it last in the mru and everthing is good.

  4. Add a base-class (I call it KingOfDiamonds) on top of BaseEstimator and all the mixins, and implement the default _get_tags() there. BaseEstimator doesn't even need to implement _get_tags then, only KingOfDiamonds needs to. It will always be last in mru and everything is good.

From a code perspective 4 is least intrusive (least number of lines changed), but it adds another layer to the inheritance hierarchy.
3 is the least intrusive in terms of class hierarchy, it "only" changes the mru. It relies on the order of the classes in multiple inheritance, though, which seems somewhat fragile (*).
2 is somewhere in between because it changes both the inheritance structure and the code, but it's relatively robust and doesn't introduce another class.

  • With the current setup, if two mixins try to change the same tag, but in opposite directions, then the order of the mixins will matter. That seems like a bad idea that can be avoided in practice, though.

What is the motivation to have BaseEstimator to the left in the first place, and why don't we let the mixins inherit from it?

@jnothman
Copy link
Member

I think mixins are meant to come before base classes in MRO, although I've not confirmed this intuition wiht online resources (I'm offline). From the perspective of purity, 3 is therefore my preferred solution, though I appreciate it is technically not backwards compatible, and is somewhat brittle.

I think another partial solution is that _get_tags could be conventionally defined with a helper, like:

def _get_tags(self):
    return base.extend_tags(self, tags={'a': 5, 'b': 6}, allow_overwrite=['b'])

extend_tags would be implemented to allow super's _get_tags to not exist. allow_overwrite would ensure that no tag overwriting can take place, except where whitelisted. At least if I understand the problem, this should help...?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A skim. I'm yet to look at estimator_checks. I'm not sure this interface is perfect, and I've been wondering about just having an attribute on the objects for each trait. But that violates everything :)


The current set of estimator tags are:

input_validation - whether the estimator does input-validation. This is only meant for stateless and dummy transformers!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what of ensembles/metaestimators that largely delegate their validation to some base/wrapped estimator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently basing all this on the tests, which I think is ok as a first approach and given the issues that were motivating this. metaestimators and ensembles are instantiated with well-behaved estimators for now, so they pass the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it, what does this mean? Only stateless and dummy transformers should set this to True? or to False? and why?

Might be clearer to document the default tag values here too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I check whether text feature extractors are considered to validate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list should be formatted as a RST definition list


input_validation - whether the estimator does input-validation. This is only meant for stateless and dummy transformers!
multioutput - whether a regressor supports multi-target outputs or a classifier supports multi-class multi-output.
multilabel - whether the estimator supports multilabel output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another one for sparse multilabel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I only added tags that were needed for the test to pass without special cases - actually I didn't need the sparse data one so far or the meta-estimator!

sklearn/base.py Outdated
@@ -12,8 +12,18 @@
from .utils.fixes import signature
from . import __version__

_DEFAULT_TAGS = {
'input_types': ['2darray'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by input you mean first arg to Estimator().fit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Maybe "data_types"? We have 1d, 2d, 3d (patch extractor), sparse, dictionaries and text. Also tuples for dict-vectorizer (or was it hashing vectorizer?) I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Maybe "data_types"? We have 1d, 2d, 3d (patch extractor), sparse, dictionaries and text. Also tuples for dict-vectorizer (or was it hashing vectorizer?) I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would X_types be better than data_types?

sklearn/base.py Outdated
"""Mixin class for all meta estimators in scikit-learn."""
# this is just a tag for the moment
def _get_tags(self):
tags = super(MetaEstimatorMixin, self)._get_tags().copy()
tags.update(is_meta_estimator=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Means what?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this is unused right now and might be unnecessary.


n_samples, n_features = X.shape

if self.sublinear_tf:
np.log(X.data, X.data)
X.data += 1
if sp.issparse(X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need its own test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably. I have not added any new tests, there are still enough existing tests failing. Still wip, mostly wanted feedback on the mixins.

@@ -217,6 +218,7 @@ def fit(self, X, y):

return self

@if_delegate_has_method(['_first_estimator', 'estimator'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs testing? I think this should just be delegating to estimator. if the base estimator does not support partial_fit, nor can the fitted _first_estimator, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True

@@ -505,6 +511,7 @@ def fit(self, X, y):

return self

@if_delegate_has_method(delegate='estimator')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs testing?

@@ -569,6 +576,8 @@ def predict(self, X):
Predicted multi-class targets.
"""
Y = self.decision_function(X)
if self.n_classes_ == 2:
return self.classes_[(Y > 0).astype(np.int)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #9100

@@ -601,7 +610,8 @@ def decision_function(self, X):
for est, Xi in zip(self.estimators_, Xs)]).T
Y = _ovr_decision_function(predictions,
confidences, len(self.classes_))

if self.n_classes_ == 2:
return Y[:, 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll add more explicit testing for all of the changes. Though they are all tested, as that's obviously how I found the issues. Though explicit regression tests are clearly good.

@jnothman
Copy link
Member

And I meant to say it's great that you're finding all these bugs!

@amueller
Copy link
Member Author

@jnothman yeah so far mostly meta-estimators and "weird" estimators that we didn't test :-/

@amueller
Copy link
Member Author

I think another partial solution is that _get_tags could be conventionally defined with a helper, like:

def _get_tags(self):
return base.extend_tags(self, tags={'a': 5, 'b': 6}, allow_overwrite=['b'])

extend_tags would be implemented to allow super's _get_tags to not exist. allow_overwrite would ensure that no tag overwriting can take place, except where whitelisted. At least if I understand the problem, this should help...?

I'm not sure I understand the solution. base is sklearn.base, right?
What would happen if something tries to overwrite something that is already defined? Silently ignored, right?

So with silent pass I guess this would work if BaseEstimator has everything in allow_overwrite, and all the others have nothing in allow_overwrite. Because BaseEstimator._get_tags is called last, it would then not overwrite all the tags defined in the mixins which have been defined earlier.
So yes, I think that would work. I don't think it's pretty, though.

What do you not like about the interface? Is it this exact inheritance issue? Having one attribute per tag would have the same issue, right?
I'm actually quite happy with this interface. I was first a bit bummed that I need a helper function but I don't think there's a way around that.

Actually, another way to resolve the issue is to not give BaseEstimator a _get_tags at all and assume that all calls are done with the helper.

@amueller
Copy link
Member Author

Here's a blog complaining about people doing mixins wrong (i.e. the way we do it):
https://www.ianlewis.org/en/mixins-and-python

Another person running into this here:
http://nedbatchelder.com/blog/201210/multiple_inheritance_is_hard.html

But "The hitchhikers guide to Python" and therefore requests does mixins to the right, while the Python Cookbook does Mixins to the left... I'm gonna take this to the twitters

@amueller
Copy link
Member Author

CPython does "to the left" here: https://hg.python.org/cpython/file/3.5/Lib/socketserver.py#l639

@amueller
Copy link
Member Author

David Beazley agrees with us: https://twitter.com/dabeaz/status/809084586487664641

@amueller
Copy link
Member Author

currently I made some breaking changes to see if it's possible to make things go smoothly.
I'll probably create deprecations and add some special cases back into the tests, but the special cases will go away once the deprecation finished.


SUPPORT_STRING = ['SimpleImputer', 'MissingIndicator']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add a tag to SimpleImputer but the tests are not failing? hm...

@amueller
Copy link
Member Author

Should be good now.

@amueller
Copy link
Member Author

failures are related to the extract patches docstring (unrelated to this PR)

@jnothman
Copy link
Member

@glemaitre, if you have a moment before the sprint, please approve this and merge?

@glemaitre glemaitre merged commit ab2f539 into scikit-learn:master Feb 23, 2019
@glemaitre
Copy link
Member

LGTM. Looking forward to using this in contrib. Thanks @amueller

@amueller
Copy link
Member Author

yay thank you all!! @glemaitre @jnothman @rth I'm so happy!

@albertcthomas
Copy link
Contributor

Looking forward to using this for my custom estimators! Thank you @amueller and everyone involved!

@amueller
Copy link
Member Author

@albertcthomas let us know what's missing please!

@amueller amueller removed this from PR phase in Andy's pets Apr 4, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement estimator tags