30. 行列の分割
arrayを少しずつ処理するためにnp.arrayを分割す
るコード。numpy.array_splitという便利なもの
があるのに今気づいた…。
30
def split_array(xs):
N = len(xs)
for start, end in split(N):
yield xs[start:end]
……
# deflate(inflate(xs))相当
new_xs = np.r_[tuple(
deflate(inflate(xs_))
for xs_ in split_array(xs))]
self.model.fit(new_xs, ys)
def split(N, divide=10):
M = N / divide
for i in range(divide):
start = i * M
end = (i + 1) * M
if i == divide - 1:
end = N
yield start, end
58. 1層目のモデル
-Model 1: RandomForest(R). Dataset: X
-Model 2: Logistic Regression(scikit). Dataset: Log(X+1)
-Model 3: Extra Trees Classifier(scikit). Dataset: Log(X+1)
(but could be raw)
-Model 4: KNeighborsClassifier(scikit). Dataset:
Scale( Log(X+1) )
-Model 7: Multinomial Naive Bayes(scikit). Dataset:
Log(X+1)
Log(X+1)。これ計算してplotまではしたけど形が
イマイチで特徴量として採用してなかった…
58
59. 1層目のモデル:FM
-Model 5: libfm. Dataset: Sparse(X). Each feature
value is a unique level.
libfm*はSVDなどのような行列分解系の方法を
SVMと組み合わせることによって、SVMが苦手
なスパースなデータに対してよい性能を出す
Factorization Machines**の実装。
59
* http://www.libfm.org/
** http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
60. 1層目のモデル:NN
-Model 6: H2O NN. Bag of 10 runs. Dataset:
sqrt( X + 3/8)
H2Oは多層のNN(いわゆるディープラーニング)
の実装を容易にするライブラリ。
60
http://0xdata.com/product/
61. 1層目のモデル:NN
-Model 8: Lasagne NN(CPU). Bag of 2 NN runs.
First with Dataset Scale( Log(X+1) ) and second
with Dataset Scale( X )
-Model 9: Lasagne NN(CPU). Bag of 6 runs.
Dataset: Scale( Log(X+1) )
Lasagneは多層のニューラルネットを実装するこ
とを容易にするライブラリ。Python。
61
http://lasagne.readthedocs.org/en/latest/index.html
62. 1層目のモデル:t-SNE
-Model 10: T-sne. Dimension reduction to 3
dimensions. Also stacked 2 kmeans features
using the T-sne 3 dimensions. Dataset: Log(X+1)
t-SNEは” t-distributed stochastic neighbor
embedding”の略。次元削減の方法。
62
t-distributed stochastic neighbor embedding - Wikipedia, the free encyclopedia
http://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding
sklearn.manifold.TSNE — scikit-learn 0.16.1 documentation
http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
63. 1層目のモデル:Sofia
-Model 11: Sofia(R). Dataset: one against all with
learner_type="logreg-pegasos" and
loop_type="balanced-stochastic". Dataset: Scale(X)
-Model 12: Sofia(R). Trainned one against all with
learner_type="logreg-pegasos" and
loop_type="balanced-stochastic". Dataset: Scale(X, T-
sne Dimension, some 3 level interactions between 13
most important features based in randomForest
importance )
-Model 13: Sofia(R). Trainned one against all with
learner_type="logreg-pegasos" and
loop_type="combined-roc". Dataset: Log(1+X, T-sne
Dimension, some 3 level interactions between 13
most important features based in randomForest
importance )
63
65. 1層目のモデル:xgboost
-Model 14: Xgboost(R). Trainned one against all. Dataset:
(X, feature sum(zeros) by row ). Replaced zeros with NA.
-Model 15: Xgboost(R). Trainned Multiclass Soft-Prob.
Dataset: (X, 7 Kmeans features with different number of
clusters, rowSums(X==0), rowSums(Scale(X)>0.5),
rowSums(Scale(X)< -0.5) )
-Model 16: Xgboost(R). Trainned Multiclass Soft-Prob.
Dataset: (X, T-sne features, Some Kmeans clusters of X)
-Model 17: Xgboost(R): Trainned Multiclass Soft-Prob.
Dataset: (X, T-sne features, Some Kmeans clusters of
log(1+X) )
-Model 18: Xgboost(R): Trainned Multiclass Soft-Prob.
Dataset: (X, T-sne features, Some Kmeans clusters of
Scale(X) )
65
xgboostはGBDTの実装の一つ
https://github.com/dmlc/xgboost
66. 1層目のモデル:NN(GPU)
-Model 19: Lasagne NN(GPU). 2-Layer. Bag of 120
NN runs with different number of epochs.
-Model 20: Lasagne NN(GPU). 3-Layer. Bag of 120
NN runs with different number of epochs.
Lasagnaを使ってNNをGPU実装。
66
67. 1層目のモデル:xgboost
-Model 21: XGboost. Trained on raw features.
Extremely bagged (30 times averaged).
特に工夫なくxgboostを30回走らせて
結果を平均したもの。
データをダウンロードして、まず真っ先に
「xgboostを繰り返すプログラム」を
走らせてから他の実装を始めたと憶測
67
68. 1層目の特徴量:KNN
-Model 22: KNN on features X + int(X == 0)
-Model 23: KNN on features X + int(X == 0) +
log(X + 1)
-Model 24: KNN on raw with 2 neighbours
-Model 25: KNN on raw with 4 neighbours
:
-Model 33: KNN on raw with 1024 neighbours
モデル4にKNN(Scale( Log(X+1) ))があったけど
特徴量としても追加している。異なるチームメン
バーが別個に実装したと憶測。
68
69. 1層目の特徴量:Distance
-Feature 1: Distances to nearest neighbours of
each classes
-Feature 2: Sum of distances of 2 nearest
neighbours of each classes
-Feature 3: Sum of distances of 4 nearest
neighbours of each classes
-Feature 4: Distances to nearest neighbours of
each classes in TFIDF space
-Feature 5: Distances to nearest neighbours of
each classed in T-SNE space (3 dimensions)
最寄りクラスタまでの距離
69
71. 1層目の特徴量:非ゼロ
-Feature 7: Number of non-zeros elements in
each row
非ゼロの特徴の数
-Feature 8: X (That feature was used only
in NN 2nd level training)
生データ
71
75. 計算コスト
> I used AWS Amazon CPU server with 32 cores
for calculating XGboost, etc. And used AWS
Amazon GPU server with 4 Nvidia Grid K520 for
calculating NNs. Models on 2nd level is really
computationally hard.
32コアなのはc3.8xlarge。1時間$1.68。
スポットインスタンスを利用して$0.256。
何時間走らせたのかは不明。
75
77. 6位チーム
“it seems like this competition was all about two
things: finding ways to get different biases in
the individual models, and having a good
method for ensembling beyond just
averaging.”
77
https://www.kaggle.com/c/otto-group-product-classification-
challenge/forums/t/14296/competition-write-up-optimistically-convergent
79. ニューラルネット
全部ReLU。出力層はSoftmax。
LB=0.446: Input -> 0.13 Dropout -> 2500 -> 0.77 Dropout
-> 1300 -> 0.35 Dropout -> 40 -> 0.1 Dropout -> Output
(AdaGrad, 600 epochs, batch=1024, linear decreasing
learning rate starting at 0.04, ending at 0.005)
LB=0.441: Input -> 0.15 Drop -> 2000 -> 0.7 Drop -> 1200
-> 0.4 Drop -> 40 -> Output (AdaGrad, 400 epochs, batch
= 512, learning rate from 0.04 to 0.001)
LB=0.463: Input -> 0.15 Drop -> 180 -> 0.1 GaussianNoise
-> 0.1 Drop -> 180 -> 0.05 Drop -> 180 -> 0.05 Drop ->
140 -> 0.05 Drop -> 140 -> 0.05 Drop -> 140 -> 0.1 Drop -
> 140 -> 0.1 GaussianNoise -> 160 -> Output (AdaGrad,
100 epochs, batch=512, rate = 0.03 to 0.005)
79
80. ニューラルネット
>One thing is that even though the network with
2000 nodes in the first layer had a better LB score,
it ensembled significantly worse than the one
with 2500 nodes.
入力層が2000個の方がLBでのスコアはよかった
が、アンサンブル結果は2500個の方がよかった
→2500個の側が運よくいい特徴を拾っていただけ
ではないかと憶測
80
87. FE不要論
“Minimal feature engineering was needed, as all
the features were count of some events not
disclosed to participants”
「特徴量がイベントの回数である」という説明を
読み落としていた…。
これだとラベル化は筋悪ということになる。
87
https://kaggle2.blob.core.windows.net/forum-message-
attachments/79384/2507/summary.pdf?sv=2012-02-12&se=2015-05-
28T02%3A29%3A24Z&sr=b&sp=r&sig=GSMFMgUpNYb%2B4xZaWtxF%2BNzT3s%2F
Ve3kwvoHjHdqS9qM%3D
98. GAでアンサンブル
DEAP* を使う
the gene was a vector length 20 composed of
different model numbers.
20個以上あるモデルの中から20個のモデルをど
う選ぶかを遺伝子にコーディングし、
選んだモデルを平均したののスコアを最適化する
98
* https://github.com/deap/deap
105. 実行
Epoch 1 of 500 took 144.361s
training loss: 1.344822
validation loss: 0.466205
validation accuracy: 87.58 %%
Epoch 2 of 500 took 140.929s
training loss: 0.592950
validation loss: 0.332910
validation accuracy: 90.45 %%
500ステップのうちの1ステップに2分掛かるということ
は単純計算で全部終わるまでに17時間かかるよな…。
105
106. 実行中
約1時間後
Epoch 30 of 500 took 154.024s
training loss: 0.130883
validation loss: 0.089886
validation accuracy: 97.31 %%
約2時間半後
Epoch 60 of 500 took 160.892s
training loss: 0.077516
validation loss: 0.067623
validation accuracy: 97.86 %%
約3時間半後
Epoch 170 of 500 took 80.355s
training loss: 0.026319
validation loss: 0.056770
validation accuracy: 98.43 %%
106
107. 学習終了
約15時間後
Epoch 500 of 500 took 157.920s
training loss: 0.007425
validation loss: 0.064233
validation accuracy: 98.48 %%
python mnist.py 40219.81s user 130.37s
system 74% cpu 14:58:08.57 total
でも一番性能がいいのは
Epoch 397 of 500 took 94.733s
training loss: 0.010540
validation loss: 0.061012
validation accuracy: 98.59 %%
107