Advanced Scikit Learn
Advanced Scikit Learn
2
Classification
Regression
Clustering
Semi-Supervised Learning
Feature Selection
Feature Extraction
Manifold Learning
Dimensionality Reduction
Kernel Approximation
Hyperparameter Optimization
Evaluation Metrics
Out-of-core learning
…...
3
4
Overview
● Reminder: Basic sklearn concepts
●
Model building and evaluation:
– Pipelines and Feature Unions
– Randomized Parameter Search
– Scoring Interface
● Out of Core learning
– Feature Hashing
– Kernel Approximation
● New stuff in 0.16.0
– Overview
– Calibration
5
Supervised Machine Learning
clf = RandomForestClassifier()
Training Data
clf.fit(X_train, y_train)
Model
Training Labels
6
Supervised Machine Learning
clf = RandomForestClassifier()
Training Data
clf.fit(X_train, y_train)
Model
Training Labels
7
Supervised Machine Learning
clf = RandomForestClassifier()
Training Data
clf.fit(X_train, y_train)
Model
Training Labels
8
Unsupervised Transformations
pca = PCA(n_components=3)
9
Unsupervised Transformations
pca = PCA(n_components=3)
10
Basic API
estimator.fit(X, [y])
estimator.predict estimator.transform
Classification Preprocessing
Feature extraction
11
Cross-Validation
>> [ 0.92 1. 1. 1. 1. ]
12
Cross-Validation
>> [ 0.92 1. 1. 1. 1. ]
13
Cross-Validation
>> [ 0.92 1. 1. 1. 1. ]
cv_labels = LeaveOneLabelOut(labels)
scores_pout = cross_val_score(SVC(), X, y, cv=cv_labels)
14
Cross -Validated Grid Search
15
Cross -Validated Grid Search
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
16
Training Labels Training Data
Model
17
Training Labels Training Data
Model
18
Training Labels Training Data
Feature
Extraction
Scaling
Feature
Selection
Model
19
Training Labels Training Data
Feature
Extraction
Scaling
Feature
Selection
Model
20
Cross Validation
Training Labels Training Data
Feature
Extraction
Scaling
Feature
Selection
Model
21
Cross Validation
Pipelines
from sklearn.pipeline import make_pipeline
22
Combining Pipelines and
Grid Search
Proper cross-validation
param_grid = {'svc__C': 10. ** np.arange(-3, 3),
'svc__gamma': 10. ** np.arange(-3, 3)}
23
Combining Pipelines and
Grid Search II
Searching over parameters of the preprocessing step
24
Feature Union
Training Labels Training Data
Feature Feature
Extraction I Extraction II
Model
25
Feature Union
char_and_word = make_union(CountVectorizer(analyzer="char"),
CountVectorizer(analyzer="word"))
26
Feature Union
char_and_word = make_union(CountVectorizer(analyzer="char"),
CountVectorizer(analyzer="word"))
27
Randomized Parameter Search
28
Randomized Parameter Search
29
Source: Bergstra and Bengio
Randomized Parameter Search
30
Source: Bergstra and Bengio
Randomized Parameter Search
params = {'featureunion__countvectorizer-1__ngram_range':
[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range':
[(1, 1), (1, 2), (2, 2)],
'linearsvc__C': 10. ** np.arange(-3, 3)}
31
Randomized Parameter Search
params = {'featureunion__countvectorizer-1__ngram_range':
[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range':
[(1, 1), (1, 2), (2, 2)],
'linearsvc__C': expon()}
32
Randomized Parameter Search
params = {'featureunion__countvectorizer-1__ngram_range':
[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range':
[(1, 1), (1, 2), (2, 2)],
'linearsvc__C': expon()}
rs = RandomizedSearchCV(text_pipe,
param_distributions=param_distributins, n_iter=50)
33
Randomized Parameter Search
● Always use distributions for continuous
variables.
● Don't use for low dimensional spaces.
● Future: Bayesian optimization based search.
34
Generalized Cross-Validation and Path Algorithms
35
rfe = RFE(LogisticRegression())
36
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)
37
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)
38
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)
rfecv = RFECV(LogisticRegression())
39
rfe = RFE(LogisticRegression())
param_grid = {'n_features_to_select': range(1, n_features)}
gridsearch = GridSearchCV(rfe, param_grid)
grid.fit(X, y)
rfecv = RFECV(LogisticRegression())
rfecv.fit(X, y)
40
41
Linear Models Feature Selection Tree-Based models [possible]
RidgeCV [RandomForestClassifierCV]
RidgeClassifierCV [GradientBoostingClassifierCV]
LarsCV
ElasticNetCV
...
42
Scoring Functions
43
GridSeachCV
RandomizedSearchCV
cross_val_score
...CV
Default:
Accuracy (classification)
R2 (regression)
44
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
45
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
46
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
47
Scoring with imbalanced data
cross_val_score(SVC(), X_train, y_train)
>>> array([ 0.9, 0.9, 0.9])
48
Available metrics
print(SCORERS.keys())
>> ['adjusted_rand_score',
'f1',
'mean_absolute_error',
'r2',
'recall',
'median_absolute_error',
'precision',
'log_loss',
'mean_squared_error',
'roc_auc',
'average_precision',
'accuracy']
49
Defining your own scoring
50
Out of Core Learning
51
Or: save ourself the effort
52
Think twice!
● Old laptop: 4GB Ram
● 1073741824 float32
● Or 1mio data points with 1000 features
● EC2 : 256 GB Ram
● 68719476736 float32
● Or 68mio data points with 1000 features
53
Supported Algorithms
● All SGDClassifier derivatives
● Naive Bayes
● MinibatchKMeans
● IncrementalPCA
● MiniBatchDictionaryLearning
54
Out of Core Learning
sgd = SGDClassifier()
for i in range(9):
X_batch, y_batch = cPickle.load(open("batch_%02d" % i))
sgd.partial_fit(X_batch, y_batch, classes=range(10))
55
Stateless Transformers
● Normalizer
● HashingVectorizer
● RBFSampler (and other kernel approx)
56
Text data and the hashing trick
57
Bag Of Word Representations
CountVectorizer / TfidfVectorizer
58
Bag Of Word Representations
CountVectorizer / TfidfVectorizer
tokenizer
59
Bag Of Word Representations
CountVectorizer / TfidfVectorizer
tokenizer
60
Hashing Trick
HashingVectorizer
tokenizer
hashing
[0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, ... 0 ] 61
Out of Core Text Classification
sgd = SGDClassifier()
hashing_vectorizer = HashingVectorizer()
for i in range(9):
text_batch, y_batch = cPickle.load(open("text_%02d" % I))
X_batch = hashing_vectorizer.transform(text_batch)
sgd.partial_fit(X_batch, y_batch, classes=range(10))
62
Kernel Approximations
63
Reminder: Kernel Trick
64
Reminder: Kernel Trick
65
Reminder: Kernel Trick
Classifier linear → need only
66
Reminder: Kernel Trick
Classifier linear → need only
Linear:
Polynomial:
RBF:
Sigmoid:
67
Complexity
● Solving kernelized SVM:
~O(n_samples ** 3)
● Solving linear (primal) SVM:
~O(n_samples * n_features)
68
Undoing the Kernel Trick
● Kernel approximation:
● k=
= RBFSampler
69
Usage
sgd = SGDClassifier()
kernel_approximation = RBFSampler(gamma=.001, n_components=400)
for i in range(9):
X_batch, y_batch = cPickle.load(open("batch_%02d" % i))
if i == 0:
kernel_approximation.fit(X_batch)
X_transformed = kernel_approximation.transform(X_batch)
sgd.partial_fit(X_transformed, y_batch, classes=range(10))
70
Highlights from 0.16.0
71
Highlights from 0.16.0
● Multinomial Logistic Regression,
LogisticRegressionCV.
● IncrementalPCA.
● Probability callibration of classifiers.
● Birch clustering.
● LSHForest.
● More robust integration with pandas.
72
Probability Calibration
SVC().decision_function()
→ CalibratedClassifierCV(SVC()).predict_proba()
RandomForestClassifier().predict_proba()
→ CalibratedClassifierCV(RandomForestClassifier()).predict_proba()
73
74
CDS is hiring Research Engineers
75
Thank you for your attention.
@t3kcit
@amueller
t3kcit@gmail.comx
76
Bias Variance Tradeoff
(why we do cross validation and grid searches)
77
Overfitting and Underfitting
Training
Accuracy
Model complexity
78
Overfitting and Underfitting
Training
Accuracy
Generalization
Model complexity
79
Overfitting and Underfitting
Training
Sweet spot
Accuracy
Generalization
Underfitting Overfitting
Model complexity
80
Linear SVM
81
Linear SVM
82
(RBF) Kernel SVM
83
(RBF) Kernel SVM
84
(RBF) Kernel SVM
85
(RBF) Kernel SVM
86
Decision Trees
87
Decision Trees
88
Decision Trees
89
Decision Trees
90
Decision Trees
91
Decision Trees
92
Random Forests
93
Random Forests
94
Random Forests
95
Know where you are on the bias-variance tradeoff
96
Validation Curves
train_scores, test_scores = validation_curve(SVC(), X, y,
param_name="gamma", param_range=param_range)
97
Learning Curves
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y,train_sizes=train_sizes)
98