Strat Dev Process
Strat Dev Process
Strat Dev Process
gies
Brian G. Peterson
updated 24 March 2017
• execution platform
• access and proximity to markets
• products you can trade
• data availability and management
• business structure
• capital available
• fees/rebates
Choosing a benchmark
The choice of what benchmarks to measure yourself against provides
both opportunities for better understanding your performance and
developing & backtesting systematic trading strategies 2
1. archetypal strategies
In many cases, you can use a well-known archetypal strategy as
a benchmark. For example, a momentum strategy could use an
MA-cross strategy as a benchmark.
If you are comparing two or more candidate indicators or sig-
nal processes, you can also use one as a benchmark for the other
(more on this later).
2. alternative indices
EDHEC, Barclays, HFR, and others offer alternative investment
style indices. One of these may be a good fit as a benchmark for
your strategy. Be aware that they tend to have a downward bias
imposed by averaging, or are even averages of averages. So ‘beat-
ing the benchmark’ in your backtest would be a bare minimum
first step if using one of these as your benchmark. As an exam-
ple, many CTA’s are trend followers, so a momentum strategy in
commodities may often be fairly compared to a CTA index.
4. market observables
Market observable benchmarks are perhaps the simplest avail-
able benchmarks, and can tie in well to business objectives and to
optimization.
Some easily observable benchmarks in this category:
• $/day, %/day
• $/contract
• % daily volume
• % open interest
Choosing an Objective3 3
When measuring results against objec-
tives, start by making sure the objectives are
In strategy creation, it is very dangerous to start without a clear correct. - Ben Horowitz (2014)
objective.
Market observable benchmarks (see above) may form the core of
the first objectives you use to evaluate strategy ideas.
developing & backtesting systematic trading strategies 4
All of these ratios have the properties that you want to maximize
them in optimization, and that they are roughly comparable across
different trading systems.
Additional properties of the trading system that you may target
when evaluating strategy performance but which we would not put
into the category of business requirements include:
Many ideas will fail the process at this point. The idea will be in-
triguing or interesting, but not able to be formulated as a conjecture.
Or, once formatted as a conjecture, you won’t be able to formulate
an expected outcome. Often, the conjecture is a statement about the
nature of markets. This may be an interesting, and even verifiable
point, but not be a prediction about market behavior. So a good con-
jecture also makes a prediction, or an observation that amounts to a
developing & backtesting systematic trading strategies 6
Evaluating Indicators
points that precede a short term trend centered around ten periods.
The downside of this type of empirical analysis is that it is inherently
a one-off process, developed for a specific indicator.
the robustness of the bar generating process itself, e.g. by varying the
start time of the first bar. We will almost never run complete strategy
tests on bar data, preferring to generate the periodic indicator, and
then apply the signal and rules processes to higher frequency data.
In this way the data used to generate the indicator is just what is re-
quired by the indicator model, with more realistic market data used
for generating signals and for evaluating rules and orders.
The analysis on indicators described here provides another oppor-
tunity to reject your hypothesis, and go back to the drawing board.
Evaluating Signals
entry and exit signals, the distribution of the period between entries
and exits, degree of overlap between entry and exit signals, and so
on.
Given the empirical data about the signal process, it is possible to
develop a simulation of statistically similar processes. A simulation
of this sort is described by Aronson (2006) for use in evaluating entire
strategy backtest output (more on this below), but it is also useful at
this earlier stage. These randomly simulated signal processes should
display (random) distributions of returns, which can be used to as-
sess return bias introduced by the historical data. If you parameterize
the statistical properties of the simulation, you can compare these
parameterized simulations to the expectation generated from each
parameterization of the strategy’s signal generating process.
It is important to separate a parameterized and conditional simu-
lation from an unconditional simulation. Many trading system Monte
Carlo tools utilize unconditional sampling. They then erroneously
declare that a good system must be due to luck because its expecta-
tions are far in the right-hand tail of the unconditional distribution.
One of the only correct inferences that you can deduce from a un-
conditional simulation would be a mean return bias of the historical
series.
It would make little sense to simulate a system that scalps every
tick when you are evaluating a system with three signals per week.
It is critical to center the simulation parameters around the statistical
properties of the system you are trying to evaluate. The simulated
signal processes should have nearly indistinguishable frequency
parameters (number of enter/exit signals, holding period, etc.) to the
process you are evaluating, to ensure that you are really looking at
comparable things.
Because every signal is a prediction, when analyzing signal pro-
cesses, we can begin to fully apply the literature on model specifica-
tion and testing of predictions. From the simplest available methods
such as mean squared model error or kernel distance from an ideal
process, through extensive evaluation as suggested for Akaike’s In-
formation Criterion(AIC), Bayesian Information Criterion(BIC), ef-
fective number of parameters, cross validation of Hastie, Tibshirani,
and Friedman (2009), and including time series specific models such
as the data driven “revealed performance” approach of Racine and
Parmeter (2012): all available tools from the forecasting literature
should be considered for evaluating proposed signal processes.
It should be clear that evaluating the signal generating process
offers multiple opportunities to re-evaluate assumptions about the
method of action of the strategy, and to detect information bias or
luck before moving on to (developing and) testing the rest of the
developing & backtesting systematic trading strategies 16
strategy.
Evaluating Rules
By the time you get to this stage, you should have experimental con-
firmation that the indicator(s) and signal process(es) provide sta-
tistically significant information about the instruments you are ex-
amining. If not, stop and go back and reexamine your hypotheses.
Assuming that you do have both a theoretical and empirical basis on
which to proceed, it is time to define the strategy’s trading rules.
Much of the work involved in evaluating “technical trading rules”
described in the literature is really an evaluation of signal processes,
described in depth above. Rules should refine the way the strategy
‘listens’ to signals, producing path-dependent actions based on the
current state of the market, your portfolio, and the indicators and
signals. Separate from whether a signal has predictive power or not,
as described above, evaluation of rules is an evaluation of the actions
taken in response to the rule.
entry rules
Most signal processes designed by analysts and described in the lit-
erature correspond to trade entry. Described another way, every entry
rule will likely be tightly coupled to a signal (possibly a composite
signal). If the signal (prediction) has a positive expectation, then the
rule should have potential to make money. It is most likely valuable
with any proposed system to test both aggressive and passive entry
order rules.
If the system makes money in the backtest with a passive entry
on a signal, but loses money with an aggressive entry which crosses
the bid-ask spread, or requires execution within a very short time
frame after the signal, the likelihood that the strategy will work in
production is greatly reduced.
Conversely, if the system is relatively insensitive in the backtest
to the exact entry rule from the signal process, there will likely be a
positive expectation for the entry in production.
Another analysis of entry rules that may be carried out both on
the backtest and in post-trade analysis is to extract the distribution
of duration between entering the order and getting a fill. Differences
between the backtest and production will provide you with infor-
mation to calibrate the backtest expectations. Information from post
trade analysis will provide you with information to calibrate your
execution and microstructure parameters.
developing & backtesting systematic trading strategies 17
exit rules
There are two primary classes of exit rules, signal based and empir-
ical; evaluation and risks for these two groupings is different. Exit
rules, whether being driven by the same signal process as the entry
rules, or based on empirical evidence from the backtest, are often the
difference between a profitable and an unprofitable strategy.
Signal driven exit rules may follow the same signal process as
the entry rules (e.g. band and midpoints models, or RSI-style over-
bought/oversold models), or they may have their own signal process.
For example, a process that seeks to identify slowing or reversal of
momentum may be different from a process to describe the forma-
tion and direction of a trend. When evaluating signal based exits, the
same process from above for testing aggressive vs. passive order logic
is likely valuable. Additionally, testing trailing order logic after the
signal may “let the winners run” in a statistically significant manner.
Empirical exit rules are usually identified after initial tests of other
types of exits, or after parameter optimization (see below). They
include classic risk stops (see below) and profit targets, as well as
trailing take profits or pullback stops. Empirical profit rules are
usually identified using the outputs of things like Mean Adverse
Excursion(MAE)/Mean Favorable Excursion(MFE), for example:
• MFE shows that trades that have advanced x % or ticks are un-
likely to advance further, so the trade should be taken off
• a path fit is still strongly positive, even though the signal process
indicates to be on the lookout for an exit opportunity, so a trailing
take profit may be in order
risk rules
There are several types of risk rules that may be tested in the back-
test, and the goals of adding them should be derived from your
business objectives. By the time you get to adding risk rules in the
backtest, the strategy should be built on a confirmed positive expec-
tation for the signal process, signal-based entry and exit rules should
have been developed and confirmed, and any empirical profit taking
developing & backtesting systematic trading strategies 18
rules should have been evaluated. There are two additional “risk”
rule types that should commonly be considered for addition to the
strategy, empirical risk stops, and business constraints.
Empirical risk stops are generally placed such that they would cut
off the worst losing trades. This is very strategy dependent. You may
be able to get some hints from the signal analysis described above,
some examples:
Or, you may place an empirical risk stop based on what you see
after doing parameter optimization or post trade analysis (see tools
below) to cut off the trade after there is a very low expectation of a
winning trade, for example:
poorly, exit or risk rules added after parameter optimization are just
cherry picking the best returns, while if done properly they may be
used to take advantage of what you’ve learned about the statistical
expectations for the system, and may improve out of sample corre-
spondence to the in-sample periods.
order sizing
We tend to do minimal order sizing in backtests. Most of our back-
tests are run on “one lot” or equivalent portfolios. The reason is that
the backtest is trying to confirm the hypothetical validity of the trad-
ing system, not really trying to optimize execution beyond a certain
point. How much order sizing makes sense in a backtest is depen-
dent on the strategy and your business objectives. Typically, we will
do “leveling” or “pyramiding” studies while backtesting, starting
from our known-profitable business practices.
We will not do “compounding” or “percent of equity” studies, be-
cause our goal in backtesting is to confirm the validity of the positive
expectation and rules embodied in the system, not to project revenue
opportunities. Once we have a strategy model we believe in, and es-
pecially once we have been able to calibrate the model on live trades,
we will tend to drive order sizing from microstructure analysis, post
trade analysis of our execution, and portfolio optimization.
rule burden
Beware of rule burden. Too many rules will make a backtest look
excellent in-sample, and may even work in walk forward analysis,
but are very dangerous in production. One clue that you are over-
fitting by adding too many rules is that you add a rule or rules to
the backtest after running an exhaustive analysis and being disap-
pointed in the results. This is introducing data snooping bias. Some
data snooping is unavoidable, but if you’ve run multiple parameter
optimizations on your training set or (worse) multiple walk forward
analyses, and then added rules after each run, you’re likely introduc-
ing dangerous biases in your output.
Parameter optimization10 10
Every trading system is in some form an
optimization. (Tomasini)
Parameter optimization is important for indicators, signals, rules, and
complete trading systems. Care needs to be taken to do this as safely
as possible. That care needs to start with clearly defined objectives, as
developing & backtesting systematic trading strategies 20
using walk forward analysis is that you can introduce data snooping
biases if you apply it multiple times with differing goals, looking for
the best outcomes11 . As with objective and hypothesis generation, 11
see cross validation section, below,
you should be clear about the definition of success before performing for additional cautionary notes about
correct use of walk forward analysis
the test and polluting your data set with prior knowledge.
Regime Analysis
Evaluating Trades
Entire books have been written extolling the virtues or lamenting the
problems of one performance measure over another. We have chosen
to take a rather inclusive approach both to trade and P&L based
measures and to return based measures (covered later). Generally,
we run as many metrics as we can, and look for consistently good
metrics across all common return and cash based measures. Trade
and P&L based measures have an advantage of being precise and
reconcilable to clearing statements, but disadvantages of not being
easily comparable between products, after compounding, etc.
1. FIFO
FIFO is “first in, first out”, and pairs entry and exit transactions
by time priority. We generally do not calculate statistics on FIFO
because it is impossible to match P&L to clearing statements; very
few institutional investors will track to FIFO. FIFO comes from
accounting for physical inventory, where old (first) inventory is
accounted for in the first sales of that inventory. It can be very
difficult to calculate a cost basis if the quantities of your orders
vary, or you get a large number of partial fills, as any given closing
fill may be an amalgam of multiple opening fills.
2. tax lots
Tax lot “trades” pair individual entry and exit transactions to-
gether to gain some tax advantage such as avoiding short term
gains, harvesting losses, lowering the realized gain, or shifting
the realized gain or loss to another tax period or tax jurisdiction.
This type of analysis can be very beneficial, though it will be very
dependent on the goals of the tax lot construction.
3. flat to flat
Flat to flat “trade” analysis marks the beginning of the trade from
the first transaction to move the position off zero, and marks the
end of the “trade” with the transaction that brings the position
back to zero, or “flat”. It will match brokerage statements of re-
alized P&L when the positions is flat and average cost of open
developing & backtesting systematic trading strategies 25
4. flat to reduced
The flat to reduced methodology marks the beginning of the
“trade” from the first transaction to increase the position, and
ends a “trade” when the position is reduced at all, going closer
to zero. Brokerage accounting practices match this methodology
exactly, they will adjust the average cost of the open position as
positions get larger or further from flat, and will realize gains
whenever the position gets smaller (closer to flat). Flat to reduced
is our preferred methodology most of the time. This methodology
works poorly for bootstrap or other trade resampling methodolo-
gies, as some of the aggregate statistics are highly intercorrelated
because multiple “trades” share the same entry point. You need to
be aware of the skew in some statistics that may be created by this
methodology and either correct for it or not use those statistics.
Specific problems will depend on the strategy and trading pattern.
The flat to reduced method is a good default choice for “trade”
definition because it is the easiest to reconcile to statements, other
execution and analysis software, and accounting and regulatory
standards.
5. increased to reduced
An analytically superior alternative to FIFO is “increased to re-
duced”. This approach marks the beginning of a “trade” any time
the position increases (gets further from flat/zero) and marks the
end of a trade when the position is decreased (gets smaller in ab-
solute terms, closer to flat), utilizing average cost for the cost basis.
Typically, flat to flat periods will be extracted, and then broken
into pieces matching each reduction to an increase, in expanding
order from the first reduction. This is “time and quantity” priority,
and is analytically more repeatable and manageable than FIFO. If
you have a reason for utilizing FIFO-like analysis, consider using
“increased to reduced” instead. This methodology is sometimes
called “average cost FIFO”, and there is an average cost LIFO
variant as well. In contrast to traditional FIFO, realized P&L on a
per-period basis will match the brokerage statements using this
developing & backtesting systematic trading strategies 26
data much earlier in the process when confirming the power of the
strategy components.
The goal of many trading strategies is to produce a smoothly
upward sloping equity curve. To compare the output of your strategy
to an idealized representation, most methodologies utilize linear
models. The linear fit of the equity curve may be tested for slope,
goodness of fit, and confidence bounds. Kestner (2003) proposes
the K-Ratio, which is the p-test statistic on the linear model. Pardo
(2008, 202–9) emphasizes the number of trades, and the slope of the
resulting curve. You can extend the analysis beyond the simple linear
model to generalized linear models or generalized additive models,
which can provide more robust fits that do not vary as much with
small changes to the inputs.
Dangers of aggregate statistics . . .
insight into methods of action in the strategy, and can lead to further
strategy development.
It is important when evaluating MAE/MFE to do this type of anal-
ysis in your test set. One thing that you want to test out of sample is
whether the MAE threshold is stable over time. You want to avoid, as
with other parts of the strategy, going over and “snooping” the data
for the entire test period, or all your target instruments.
Microstructure Analysis
Evaluating Returns
rather than analysis on cash P&L. Some examples include: - tail risk
measures - volatility analysis - factor analysis - factor model Monte
Carlo - style analysis - applicability to asset allocation (see below)
What if you don’t have enough data? Let’s suppose you want
500 observations for each of 50 synthetic assets based on the rule
of thumb above. That is approximately two years of daily returns.
This number of observations would likely produce a high degree
of confidence if you had been running the strategy on 50 synthetic
assets for two years in a stable way. If you want to allocate capital
in alignment with your business objectives before you have enough
data, you can do a number of things: - use the data you have and
re-optimize frequently to check for stability and add more data - use
higher frequency data, e.g. hourly instead of daily - use a technique
such as Factor Model Monte Carlo (Jiang 2007, Zivot 2011,2012) to
construct equal histories - optimize over fewer assets, requiring a
smaller history
Fewer assets? Suppose you have three strategies, each with 20 con-
figurations. Using the rule of thumb above, you’d want a minimum
of 600 days of production results. If instead you use the aggregate
return of each of the three strategies, you could likely get a direc-
tionally correct optimization result from 30 or so observations, or
a month and a half of daily data with all three strategies running.
More data is generally better, if available. This compares well in prac-
tice to the three years of monthly returns (36 observations) often used
for fund of hedge funds investments.
Also discuss:
• rebalancing implications
• discuss implications of Factor Model Monte Carlo
• techniques for backing out from weights to capital to order sizes
Probability of Overfitting13 13
We should recognize the reality that any
simulated (backtest) performance presented
to us likely overstates future prospects. By
how much? -Antti Ilmanen (2011) p. 112
This entire paper has been devoted to avoiding overfit-
ting. At the end of the modeling process, we still need to evaluate
how likely it is that the model may be overfit, and develop a haircut
for the model that may identify how much out of sample deteriora-
tion could be expected.
With all of the methods described in this section, it is important
to note that you are no longer measuring performance; that was
developing & backtesting systematic trading strategies 34
resampled trades
Tomasini(2009, 104–9) describes a basic resampling mechanism for
trades. The period returns for all “flat to flat” trades in the backtest
(and the flat periods with period returns of zero) which are sam- Figure 2: source:Tomasini
pled without replacement. After all trades or flat periods have been
sampled, a new time series is constructed by applying the original
index to the resampled returns. This gives a number of series which
will have the same mean and net return as the original backtest, but
differing drawdowns and tail risk measures. This allows confidence
bounds to be placed on the drawdown and tail risk statistics based
on the resampled returns.
This analysis can be extended to resample without replacement
only the duration and quantity of the trades. These entries, exits,
and flat periods are then applied to the original price data. This
will generate a series of returns which are potentially very different
from the original backtest.14 In this model, average trade duration, 14
We are applying these methodologies
here to gain or refute confidence in
percent time in market, and percent time long and short will remain
backtested results. All of these analyti-
the same, but all the performance statistics will vary based on the cal methods may also be applied to post
new path of the returns. This should allow a much more coherent trade analysis to gain insight into real
trades and execution processes.
placement of the chosen strategy configuration versus other strategy
configurations with similar time series statistical properties.
Burns (2006) describes a number of tests that may be applied
to the resampled returns to evaluate the key question of skill ver-
developing & backtesting systematic trading strategies 35
sus luck. You should be able to determine via the p-value control
statistics some confidence that the strategy is the product of skill
rather than luck as well as the strength of the predictive power of
the strategy in a manner similar to that obtained earlier on the sig-
nal processes. You can also apply all the analysis that was utilized
on evaluating the strategy and its components along the way to the
resampled returns, to see where in the distribution of statistics the
chosen strategy configuration fits.
What if your strategy is never flat? These resampling methods pre-
sume flat to flat construction, where trades are marked from flat
points in the backtest. Things get substantially more complicated if
the strategy is rarely flat, consider for example a long only trend fol-
lower which adds and removes position around a core bias position
over very long periods of time, or a market making strategy which
adds and reduces positions around some carried inventory. In these
cases constructing resampled returns that are statistically similar to
the original strategy is very difficult. One potential choice it to make
use of the “increased to reduced” trade definition methodology. If
you are considering only one instrument, or a series of independent
instruments, the increased to reduced methodology can be appropri-
ately applied.15 15
one of the only cases in which a FIFO
If the instruments are dependent or highly correlated, the diffi- style methodology provides useful
statistical information
culty level of drawing useful statistical inference goes up yet again.
In the case of dependent or highly correlated instruments or posi-
tions, you may need to rely on portfolio-level Monte Carlo analysis
rather than attempting to resample trades. While it may be useful
to learn if the dependent structure adds to overall returns over the
random independent resampled case, it is unlikely to give you much
insight into improving the strategy, or much confidence about the
strategy configuration that has made it all the way to these final anal-
yses.
what can we learn from resampling methods?
what would be incorrect inferences?
Monte Carlo
If there is strong correlation or dependence between instruments in
the backtest, then you will probably have to resample or perform
Monte Carlo analysis from portfolio-level returns, rather than trades.
You lose the ability to evaluate trade statistics, but will still be able to
assess the returns of the backtest against the sampled portfolios for
your primary business objectives and benchmarks. As was discussed
in more detail under Evaluating Signals, it is important that any re-
sampled data preserve the autocorrelation structure of the original
developing & backtesting systematic trading strategies 36
cross validation
Cross validation is a widely used statistical technique for model
evaluation.
In its classical statistical form, the data is cut in half, the model is
trained on one half, and then the trained model is tested on the half
of the model that was “held out”. As such, this type of model will
often be referred to as a single hold out cross validation model. In time
series analysis, the classical formulation is often modified to use a
smaller hold-out period than half the data, in order to create a larger
training set. Challenges with single hold out cross validation in-
clude that one out of sample set is hard to draw firm inferences from,
and that the OOS period may additionally be too short to generate
enough signals and trades to gain confidence in the model.
In many ways, walk forward analysis is related to cross validation.
The OOS periods in walk forward analysis are effectively validation
sets as in cross validation. You can and should measure the out of
sample deterioration of your walk forward model between the IS
performance and the OOS performance of the model. One advantage
of walk forward is that it allows parameters to change with the data.
One disadvantage is that there is a temptation to make the OOS pe-
riods for walk forward analysis rather small, making it very difficult
to measure deterioration from the training period. Another potential
disadvantage is that the IS periods are overlapping, which can be ex-
pected to create autocorrelation among the parameters. This autocor-
relation is mixed from an analysis perspective. A degree of parameter
stability is usually considered an advantage. The IS periods are not
all independent draws from the data, and the OOS periods will later
be used as IS periods, so any analytical technique that assumes i.i.d.
observations should be viewed at least with skepticism.
k-fold cross validation improves on the classical single hold-out
developing & backtesting systematic trading strategies 37
linear models such as Bailey, Borwein, López de Prado, et al. (2014) and
Bailey and López de Prado (2014)
• modifying existing expectations
• CSCV sampling
• Harvey and Liu (2015) , Harvey and Liu (2014) , Harvey and Liu
(2013) look at Type I vs Type II error in evaluating backtests and
look at appropriate haircuts based on this.
Acknowledgements
Colophon
References
//www.ma.utexas.edu/users/mks/statmistakes/datasnooping.html.
Sullivan, Ryan, Allan Timmermann, and Halbert White. 1999.
“Data Snooping, Technical Trading Rule Performance, and the Boot-
strap.” The Journal of Finance 54 (5): 1647–91.
Tomasini, Emilio, and Urban Jaekle. 2009. Trading Systems: A New
Approach to System Development and Portfolio Optimisation.
Tukey, John W. 1962. “The Future of Data Analysis.” The Annals
of Mathematical Statistics. JSTOR, 1–67. http://projecteuclid.org/
euclid.aoms/1177704711.
Vince, Ralph. 2009. The Leverage Space Trading Model: Reconciling
Portfolio Management Strategies and Economic Theory. John Wiley &
Sons.
White, Halbert L. 2000. “System and Method for Testing Prediction
Models and/or Entities.” Google Patents. http://www.google.com/
patents/US6088676.
Xie, Yihui. 2014. “R Markdown — Dynamic Documents for R.”
http://rmarkdown.rstudio.com/.