Winning The Kaggle Algorithmic Trading Challenge With The Composition of Many Models and Feature Engineering
Winning The Kaggle Algorithmic Trading Challenge With The Composition of Many Models and Feature Engineering
Winning The Kaggle Algorithmic Trading Challenge With The Composition of Many Models and Feature Engineering
net/publication/233731605
CITATIONS READS
4 2,626
2 authors, including:
Ildefons Magrans
Allen Institute for Brain Science
65 PUBLICATIONS 5,310 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ildefons Magrans on 04 June 2014.
Abstract
This letter presents the ideas and methods of the winning solution2 for the Kaggle
Algorithmic Trading Challenge. This analysis challenge took place between 11th
November 2011 and 8th January 2012, and 264 competitors submitted solutions.
The objective of this competition was to develop empirical predictive models to
explain stock market prices following a liquidity shock. The winning system builds
upon the optimal composition of several models and a feature extraction and se-
lection strategy. We used Random Forest as a modeling technique to train all
sub-models as a function of an optimal feature set. The modeling approach can
cope with highly complex data having low Maximal Information Coefficients be-
tween the dependent variable and the feature set and provides a feature ranking
metric which we used in our feature selection algorithm.
Keywords
1 Introduction
The goal of the Kaggle Algorithmic Trading Challenge was to encourage the development
of empirical models to predict the short term response of Order-Driven Markets (ODM)
following large liquidity shocks [1]. A liquidity shock is defined as any trade that changes
the best bid or ask price. Liquidity shocks occur when a large trade (or series of smaller
trades) consumes all available volume at the best price.
2
Solution designed and implemented by Ildefons Magrans de Abril.
Winning the Kaggle Algorithmic Trading Challenge 2
This letter presents an empirical model meant to predict the short-term response of
the top of the bid and ask books following a liquidity shock. This kind of model can be
used as a core component of a simulation tool to optimize execution strategies of large
transactions. Compared to existing finance research models [2] [3], we were not interested
in understanding the underlying processes responsible for the price dynamics. On the
other hand, by chasing the optimal predictor we may have uncovered interesting insights
that could be a source of research inspiration.
The challenge data consists of training and test datasets. The training dataset is
meant to fit a predictive model and contestants are asked to submit predictions based
on the test dataset using this model. The training dataset consists of 754018 samples of
trade and quote data observations before and after a liquidity shock for several different
securities of the London Stock Exchange (LSE). Changes to the state of the order book
occur in the form of trades and quotes. A quote event occurs whenever the best bid or
the ask price is updated. A trade event takes place when shares are bought or sold.
The test dataset consists of 50000 samples similar to the training dataset but without
the post-liquidity shock observations (i.e., time interval 51-100). Due to a data bug, the
quotes at time 51 and 50 were the same. Therefore, the final objective was to predict the
post-liquidity shock observations in the interval 52-100. In addition to the bid and ask
price time series, each training and test sample contains some few variables to distinguish
the particular security (security id ), to indicate whether the trade has been initiated by
a buyer or a seller (initiator ), the volume-weighted average price of the trade causing the
liquidity shock (trade vwap) and the total size of the trade causing the liquidity shock
(trade volume) [1].
2 Model
The search for an optimal model was guided by one hypothesis and an additional self-
imposed constraint:
Hypothesis: The predictive potential closer to the liquidity shock should be higher and
it should degrade with the distance. The rationale of this hypothesis is that future events
will depend also on post-liquidity shock events that still need to be predicted. Therefore,
the prediction error will tend to increase with the distance from the liquidity shock.
Constraint: Feature extraction should generate semantically meaningful features. This
self-imposed constraint was motivated by one of the authors’ will to generate a predictive
model with the highest possible explanatory capacity.
In the following sections we will show how these two points were strong potentials that
helped to reach a good solution and finally to win the competition.
2.1 Architecture
The model architecture consists of separate models for bid and ask. Bid and ask models
are each further divided into K sub-models responsible for predicting a constant price at
Winning the Kaggle Algorithmic Trading Challenge 3
specific future time intervals between 52 and 100. The set P consists of K disjoint time
intervals and its union is the full interval 52–100:
∑
K ∑
K
Mbid (t) = ai,t Mbid,i (t), Mask (t) = ai,t Mask,i (t),
i=1{ i=1
1 if t ∈ Ci ,
where ai,t = t ∈ [52, 100].
0 otherwise,
Ci represents the i-th K post-liquidity shock time interval and Mbid/ask,i (t) is the sub-
model responsible for predicting the constant price on the interval Ci .
Dividing the future time interval into the set of intervals P should avoid mixing the
low signal-to-noise-ratio of “far-away” prices with the more predictable prices close to
the liquidity shock. An additional model feature is that time intervals should have an
increasing length (i.e., length(Ci+1 )≥length(Ci )). This feature is a consequence of the
main hypothesis because an always increasing prediction error may require averaging
longer price time series to obtain a constant price prediction with an acceptable error.
Algorithm 1 is responsible for dividing the future time interval 52–100 into disjoint
and consecutive sub-intervals of increasing length (line 3). It is implemented as a greedy
algorithm and it is able to partition the post-liquidity shock time interval with O(n) time
complexity.
two sections describe in detail the feature extraction and selection methods. However,
a key component of the feature selection method, the feature selection algorithm, will
be presented later in Section 2.3 because it has a strong dependency on the modeling
approach that we have chosen.
the future bid price (Fb ) and a second feature sub-set common to all sub-models that
describe future ask price (Fa ).
Feature selection algorithm: It is an algorithm to choose the suitable feature sets
(i.e., Fb and Fa ). The details of this algorithm will be presented in the following section
together with our modeling approach.
3 Validation
The first step to validate the model ideas was to select a suitable feature subset. We
applied the feature selection algorithm described in Section 2.4 to the same sample dataset
used in the same section to evaluate the suitability of different modeling approaches.
According to the discussion in Section 2.2.2, we should have applied our feature selection
algorithm separately to single piece bid and ask models defined in the full time interval 52–
100. However, due to time constraints, we only optimized one side Fb . Fa was estimated
from Fb by just taking the ask side of price features. The selected feature sets Fb and Fa
were used by all bid and ask sub-models respectively as suggested by the optimization
problem relaxation described in Section 2.2.2.
The following step was to learn the optimal set P of time intervals. We used again the
same sample dataset used in Section 2.3 to evaluate the suitability of different modeling
approaches and computed the Fb feature subset found in the previous step. The applica-
tion of the partitioning algorithm (Section 2.1) on the bid model (i.e., Mbid (t)) delivered
the following set of time intervals: {52–52, 53–53, 54–55, 56–58, 59–64, 65–73, 74–100}.
Our final model fitting setup consisted of three datasets of 50000 samples each ran-
domly sampled from the training dataset and with the same security id proportions as
the test dataset. We computed Fb and Fa for each sample and trained a complete model
with each training dataset. Finally, the models were applied to the test dataset and the
Winning the Kaggle Algorithmic Trading Challenge 7
0.81
WƌŝǀĂƚĞƐĐŽƌĞƐ
WƵďůŝĐƐĐŽƌĞƐ
0.80
&ĞĂƚƵƌĞƐĞůĞĐƚŝŽŶ͕
DĂŶLJͲƉŝĞĐĞŵŽĚĞů
0.79
WƌĞĚŝĐƚŝŽŶ
ĂǀĞƌĂŐĞ
RMSE
0.78
0.77
0.76
&ĞĂƚƵƌĞĞdžƚƌĂĐƚŝŽŶ͕KŶĞͲƉŝĞĐĞŵŽĚĞů
0 10 20 30 40 50
䠯䡑䠾䡉䡅䡏䡏䡅䡋䡊䡏
submissions
Figure 1: Evolution of public (dashed line) and private (solid line) scores. Feature extrac-
tion, selection, model partitioning and final prediction average were performed sequen-
tially as indicated in the plot.
three predictions from each sub-model were averaged. Fig. 1 shows the evolution of the
public and private scores during the two months of the competition. The private score is
the RMSE on the full test set. The private score was only made public after finishing the
competition. The public score is the RMSE calculated on approximately 30% of the test
data. The public score was available in real time to all competitors.
Public and private scores are more correlated during the initial process of adding new
features than during the second process of selecting the optimal feature subset. This
lack of correlation could be due to a methodological bug during the execution of the
feature selection algorithm: During this step the authors used a single dataset of 50000
samples instead of the three datasets used in the other validations stages. This could
have biased our feature selection towards those features that better explain this particular
dataset. Finally, the best solution was obtained by averaging the predictions obtained
from each of three models fitted respectively with the three sample datasets (last three
submissions). In order to test the main hypothesis, we also submitted an additional
solution based on a single-piece model (4th and 5th last submissions). The many-piece
model solution clearly provided the final edge to win. Therefore, we consider that this is
a strong positive indicator for the model hypothesis discussed in Section 2. It is worth
mentioning that, according to information disclosed in the challenge forums, a common
modeling choice among most competitors was to use a single time interval with constant
post liquidity shock bid and ask prices. This is an additional clue that points to the
suitable implementation of our main hypothesis as a key component of our solution.
Winning the Kaggle Algorithmic Trading Challenge 8
4 Conclusions
This letter presented our solution for the Kaggle Algorithmic Trading Challenge. Our
main design hypothesis was that the predictive potential close to the liquidity shock
should be higher and it should be degraded with the distance. This hypothesis guided
the design of our model architecture and it required a complex feature extraction and
selection strategy. An additional self-imposed constraint on this strategy was to uniquely
generate semantically meaningful features. This self-imposed constraint was motivated by
the authors’ will to generate a predictive model with the highest explanatory potential,
but it also helped to identify features which had not been initially selected using a simple
backward feature elimination method.
Acknowledgments: The authors were partially supported by the FIRST program.
References
[1] Kaggle Algorithmic Trading Challenge, http://www.kaggle.com/c/
AlgorithmicTradingChallenge/data
[2] F. Lillo, J. D.Farmer and R. N. Mantegna, “Master curve for price-impact function,”
Nature, vol.421, pp. 129–130, 2003.
[3] T. Foucault, O. Kadan and E. Kandel, “Limit Order Book as a Market for Liquidity,”
Review of Financial Studies, vol.18, no.4, pp.1171–1217, 2005.
[4] J. Ulrich, “Package TTR: Technical trading Rules,” CRAN Repository: http://
cran.r-project.org/web/packages/TTR/TTR.pdf
[6] A. Liaw, “Package randomForest: Breiman and Cutler’s Random Forest for Clas-
sification and Regression,” CRAN Repository: http://cran.r-project.org/web/
packages/randomForest/randomForest.pdf
[9] T. Van Nguyen, B. Mishra, “Modeling Hospitalization Outcomes with Random De-
cision Trees and Bayesian Feature Selection,” Unpublished, Site: http://www.cs.
nyu.edu/mishra/PUBLICATIONS/mypub.html