Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
34 views

Application of Deep Reinforcement Learning To Algo Trading

This paper presents a novel approach using deep reinforcement learning (DRL) to solve the algorithmic trading problem of determining optimal trading positions over time. It proposes a new DRL trading algorithm called TDQN to maximize performance on stock markets. Promising results are reported for TDQN following a new performance assessment methodology.

Uploaded by

Alfred Huang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Application of Deep Reinforcement Learning To Algo Trading

This paper presents a novel approach using deep reinforcement learning (DRL) to solve the algorithmic trading problem of determining optimal trading positions over time. It proposes a new DRL trading algorithm called TDQN to maximize performance on stock markets. Promising results are reported for TDQN following a new performance assessment methodology.

Uploaded by

Alfred Huang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

An Application of Deep Reinforcement Learning to Algorithmic Trading

Thibaut Théatea,∗, Damien Ernsta


a Montefiore Institute, University of Liège (Allée de la découverte 10, 4000 Liège, Belgium)

Abstract
This scientific research paper presents an innovative approach based on deep reinforcement learning (DRL) to solve the
algorithmic trading problem of determining the optimal trading position at any point in time during a trading activity
in the stock market. It proposes a novel DRL trading policy so as to maximise the resulting Sharpe ratio performance
indicator on a broad range of stock markets. Denominated the Trading Deep Q-Network algorithm (TDQN), this new
arXiv:2004.06627v3 [q-fin.TR] 9 Oct 2020

DRL approach is inspired from the popular DQN algorithm and significantly adapted to the specific algorithmic trading
problem at hand. The training of the resulting reinforcement learning (RL) agent is entirely based on the generation of
artificial trajectories from a limited set of stock market historical data. In order to objectively assess the performance
of trading strategies, the research paper also proposes a novel, more rigorous performance assessment methodology.
Following this new performance assessment approach, promising results are reported for the TDQN algorithm.
Keywords: Artificial intelligence, deep reinforcement learning, algorithmic trading, trading policy.

1. Introduction expected to revolutionise the way many decision-making


problems related to the financial sector are addressed, in-
For the past few years, the interest in artificial intelli- cluding the problems related to trading, investment, risk
gence (AI) has grown at a very fast pace, with numerous management, portfolio management, fraud detection and
research papers published every year. A key element for financial advising, to cite a few. Such complex decision-
this growing interest is related to the impressive successes making problems are extremely complex to solve as they
of deep learning (DL) techniques which are based on deep generally have a sequential nature and are highly stochas-
neural networks (DNN) - mathematical models directly in- tic, with an environment partially observable and poten-
spired by the human brain structure. These specific tech- tially adversarial. In particular, algorithmic trading, which
niques are nowadays the state of the art in many appli- is a key sector of the FinTech industry, presents particu-
cations such as speech recognition, image classification or larly interesting challenges. Also called quantitative trad-
natural language processing. In parallel to DL, another ing, algorithmic trading is the methodology to trade using
field of research has recently gained much more attention computers and a specific set of mathematical rules.
from the research community: deep reinforcement learn-
ing (DRL). This family of techniques is concerned with The main objective of this research paper is to an-
the learning process of an intelligent agent (i) interacting swer the following question: how to design a novel trad-
in a sequential manner with an unknown environment (ii) ing policy (algorithm) based on AI techniques that could
aiming to maximise its cumulative rewards and (iii) us- compete with the popular algorithmic trading strategies
ing DL techniques to generalise the information acquired widely adopted in practice? To answer this question, this
from the interaction with the environment. The many re- scientific article presents and analyses a novel DRL solu-
cent successes of DRL techniques highlight their ability to tion to tackle the algorithmic trading problem of deter-
solve complex sequential decision-making problems. mining the optimal trading position (long or short) at any
point in time during a trading activity in the stock market.
Nowadays, an emerging industry which is growing ex- The algorithmic solution presented in this research paper
tremely fast is the financial technology industry, generally is inspired by the popular Deep Q-Network (DQN) algo-
referred to by the abbreviation FinTech. The objective of rithm, which has been adapted to the particular sequential
FinTech is pretty simple: to extensively take advantage decision-making problem at hand. The research question
of technology in order to innovate and improve activities to be answered is all the more relevant as the trading envi-
in finance. In the coming years, the FinTech industry is ronment presents very different characteristics from those
which have already been successfully solved by DRL ap-
∗ Corresponding proaches, mainly significant stochasticity and extremely
author.
Email addresses: thibaut.theate@uliege.be (Thibaut poor observability.
Théate), dernst@uliege.be (Damien Ernst)
The scientific research paper is structured as follows. et al. (2017) which introduced the fuzzy recurrent deep
First of all, a brief review of the scientific literature around neural network structure to obtain a technical-indicator-
the algorithmic trading field and its main AI-based contri- free trading system taking advantage of fuzzy learning to
butions is presented in Section 2. Afterwards, Section 3 in- reduce the time series uncertainty. One can also mention
troduces and rigorously formalises the particular algorith- Carapuço et al. (2018) which studied the application of the
mic trading problem considered. Additionally, this section deep Q-learning algorithm for trading in foreign exchange
makes the link with the reinforcement learning (RL) ap- markets. Finally, there exist a few interesting works study-
proach. Then, Section 4 covers the complete design of the ing the application of DRL techniques to algorithmic trad-
TDQN trading strategy based on DRL concepts. Subse- ing in specific markets, such as in the field of energy, see
quently, Section 5 proposes a novel methodology to objec- e.g. the article Boukas et al. (2020).
tively assess the performance of trading strategies. Section
6 is concerned with the presentation and discussion of the To finish with this short literature review, a sensi-
results achieved by the TDQN trading strategy. To end tive problem in the scientific literature is the tendency to
this research paper, Section 7 discusses interesting leads prioritise the communication of good results or findings,
as future work and draws meaningful conclusions. sometimes at the cost of a proper scientific approach with
objective criticism. Going even further, Ioannidis (2005)
2. Literature review even states that most published research findings in cer-
tain sensitive fields are probably false. Such concern ap-
To begin this brief literature review, two facts have to pears to be all the more relevant in the field of financial
be emphasised. Firstly, it is important to be aware that sciences, especially when the subject directly relates to
many sound scientific works in the field of algorithmic trad- trading activities. Indeed, Bailey et al. (2014) claims that
ing are not publicly available. As explained in Li (2017), many scientific publications in finance suffer from a lack
due to the huge amount of money at stake, private FinTech of a proper scientific approach, instead getting closer to
firms are very unlikely to make their latest research results pseudo-mathematics and financial charlatanism than rig-
public. Secondly, it should be acknowledged that making a orous sciences. Aware of these concerning tendencies, the
fair comparison between trading strategies is a challenging present research paper intends to deliver an unbiased sci-
task, due to the lack of a common, well-established frame- entific evaluation of the novel DRL algorithm proposed.
work to properly evaluate their performance. Instead, the
authors generally define their own framework with their 3. Algorithmic trading problem formalisation
evident bias. Another major problem is related to the
trading costs which are variously defined or even omitted. In this section, the sequential decision-making algorith-
mic trading problem studied in this research paper is pre-
First of all, most of the works in algorithmic trading are sented in detail. Moreover, a rigorous formalisation of this
techniques developed by mathematicians, economists and particular problem is performed. Additionally, the link
traders who do not exploit AI. Typical examples of clas- with the RL formalism is highlighted.
sical trading strategies are the trend following and mean
reversion strategies, which are covered in detail in Chan 3.1. Algorithmic trading
(2009), Chan (2013) and Narang (2009). Then, the major- Algorithmic trading, also called quantitative trading,
ity of works applying machine learning (ML) techniques in is a subfield of finance, which can be viewed as the ap-
the algorithmic trading field focus on forecasting. If the proach of automatically making trading decisions based on
financial market evolution is known in advance with a rea- a set of mathematical rules computed by a machine. This
sonable level of confidence, the optimal trading decisions commonly accepted definition is adopted in this research
can easily be computed. Following this approach, DL tech- paper, although other definitions exist in the literature.
niques have already been investigated with good results, Indeed, several authors differentiate the trading decisions
see e.g. Arévalo et al. (2016) introducing a trading strat- (quantitative trading) from the actual trading execution
egy based on a DNN, and especially Bao et al. (2017) using (algorithmic trading). For the sake of generality, algo-
wavelet transforms, stacked autoencoders and long short- rithmic trading and quantitative trading are considered
term memory (LSTM). Alternatively, several authors have synonyms in this research paper, defining the entire auto-
already investigated RL techniques to solve this algorith- mated trading process. Algorithmic trading has already
mic trading problem. For instance, Moody and Saffell proven to be very beneficial to markets, the main benefit
(2001) introduced a recurrent RL algorithm for discover- being the significant improvement in liquidity, as discussed
ing new investment policies without the need to build fore- in Hendershott et al. (2011). For more information about
casting models, and Dempster and Leemans (2006) used this specific field, please refer to Treleaven et al. (2013)
adaptive RL to trade in foreign exchange markets. More and Nuti et al. (2011).
recently, a few works investigated DRL techniques in a sci-
entifically sound way to solve this particular algorithmic There are many different markets suitable to apply al-
trading problem. For instance, one can first mention Deng gorithmic trading strategies. Stocks and shares can be
2
traded in the stock markets, FOREX trading is concerned The duration ∆t is closely linked to the trading fre-
with foreign currencies, or a trader could invest in com- quency targeted by the trading agent (very high trading
modity futures, to only cite a few. The recent rise of frequency, intraday, daily, monthly, etc.). Such discretisa-
cryptocurrencies, such as the Bitcoin, offers new inter- tion operation inevitably imposes a constraint with respect
esting possibilities as well. Ideally, the DRL algorithms to this trading frequency. Indeed, because the duration ∆t
developed in this research paper should be applicable to between two time steps cannot be chosen as small as pos-
multiple markets. However, the focus will be set on stock sible due to technical constraints, the maximum trading
markets for now, with an extension to various other mar- frequency achievable, equal to 1/∆t, is limited. In the
kets planned in the future. scope of this research paper, this constraint is met as the
trading frequency targeted is daily, meaning that the trad-
In fact, a trading activity can be viewed as the man- ing agent makes a new decision once every day.
agement of a portfolio, which is a set of assets including
diverse stocks, bonds, commodities, currencies, etc. In 3.3. Trading strategy
the scope of this research paper, the portfolio considered The algorithmic trading approach is rule based, mean-
consists of one single stock together with the agent cash. ing that the trading decisions are made according to a set
The portfolio value vt is then composed of the trading of rules: a trading strategy. In technical terms, a trading
agent cash value vtc and the share value vts , which contin- strategy can be viewed as a programmed policy π(at |it ),
uously evolves over time t. Buying and selling operations either deterministic or stochastic, which outputs a trad-
are simply cash and share exchanges. The trading agent ing action at according to the information available to the
interacts with the stock market through an order book, trading agent it at time step t. Additionally, a key char-
which contains the entire set of buying orders (bids) and acteristic of a trading strategy is its sequential aspect, as
selling orders (asks). An example of a simple order book illustrated in Figure 1. An agent executing its trading
is depicted in Table 1. An order represents the willingness strategy sequentially applies the following steps:
of a market participant to trade and is composed of a price
p, a quantity q and a side s (bid or ask). For a trade to 1. Update of the available market information it .
occur, a match between bid and ask orders is required, an 2. Execution of the policy π(at |it ) to get action at .
event which can only happen if pbid ask bid
max ≥ pmin , with pmax 3. Application of the designated trading action at .
ask
(pmin ) being the maximum (minimum) price of a bid (ask) 4. Next time step t → t + 1, loop back to step 1.
order. Then, a trading agent faces a very difficult task in
order to generate profit: what, when, how, at which price
and which quantity to trade. This is the algorithmic trad-
ing complex sequential decision-making problem studied
in this scientific research paper. Figure 1: Illustration of a trading strategy execution

Table 1: Example of a simple order book


In the following subsection, the algorithmic trading se-
quential decision-making problem, which shares similari-
Side s Quantity q Price p ties with other problems successfully tackled by the RL
Ask 3000 107 community, is casted as an RL problem.
Ask 1500 106
Ask 500 105 3.4. Reinforcement learning problem formalisation
Bid 1000 95 As illustrated in Figure 2, reinforcement learning is
Bid 2000 94
concerned with the sequential interaction of an agent with
Bid 4000 93
its environment. At each time step t, the RL agent firstly
observes the RL environment of internal state st , and re-
3.2. Timeline discretisation trieves an observation ot . It then executes the action at
Since trading decisions can be issued at any time, the resulting from its RL policy π(at |ht ) where ht is the RL
trading activity is a continuous process. In order to study agent history and receives a reward rt as a consequence of
the algorithmic trading problem described in this research its action. In this RL context, the agent history can be
paper, a discretisation operation of the continuous time- expressed as ht = {(oτ , aτ , rτ )|τ = 0, 1, ..., t}.
line is performed. The trading timeline is discretized into
a high number of discrete trading time steps t of constant Reinforcement learning techniques are concerned with
duration ∆t. In this research paper, for the sake of clarity, the design of policies π maximising an optimality crite-
the increment (decrement) operations t + 1 (t − 1) are used rion, which directly depends on the immediate rewards rt
to model the discrete transition from time step t to time observed over a certain time horizon. The most popular
step t + ∆t (t − ∆t).

3
optimality criterion is the expected discounted sum of re- At each trading time step t, the RL agent observes
wards over an infinite time horizon. Mathematically, the the stock market whose internal state is st ∈ S. The
resulting optimal policy π ∗ is expressed as the following: limited information collected by the agent on this complex
trading environment is denoted by ot ∈ O. Ideally, this
π ∗ = argmax E[R|π] (1) observation space O should encompass all the information
π
capable of influencing the market prices. Because of the

X sequential aspect of the algorithmic trading problem, an
R= γ t rt (2) observation ot has to be considered as a sequence of both
t=0 the information gathered during the previous τ time steps
The parameter γ is the discount factor (γ ∈ [0, 1]). (history) and the newly available information at time step
It determines the importance of future rewards. For in- t. In this research paper, the RL agent observations can
stance, if γ = 0, the RL agent is said to be myopic as be mathematically expressed as the following:
it only considers the current reward and totally discards
the future rewards. When the discount factor increases,
the RL agent tends to become more long-term oriented. ot = {S(t0 ), D(t0 ), T (t0 ), I(t0 ), M (t0 ), N (t0 ), E(t0 )}tt0 =t−τ
In the extreme case where γ = 1, the RL agent considers (3)
each reward equally. This key parameter should be tuned where:
according to the desired behaviour.
• S(t) represents the state information of the RL agent
at time step t (current trading position, number of
shares owned by the agent, available cash).
• D(t) is the information gathered by the agent at time
step t concerning the OHLCV (Open-High-Low-Close-
Volume) data characterising the stock market. More
precisely, D(t) can be expressed as follows:

D(t) = {pO H L C
t , pt , pt , pt , Vt } (4)

where:
Figure 2: Reinforcement learning core building blocks
– pO
t is the stock market price at the opening of
3.4.1. RL observations the time period [t − ∆t, t[.
In the scope of this algorithmic trading problem, the – pH
t is the highest stock market price over the
RL environment is the entire complex trading world grav- time period [t − ∆t, t[.
itating around the RL agent. In fact, this trading envi-
– pL
t is the lowest stock market price over the time
ronment can be viewed as an abstraction including the
period [t − ∆t, t[.
trading mechanisms together with every single piece of in-
formation capable of having an effect on the trading ac- – pC
t is the stock market price at the closing of
tivity of the agent. A major challenge of the algorithmic the time period [t − ∆t, t[.
trading problem is the extremely poor observability of this – Vt is the total volume of shares exchanged over
environment. Indeed, a significant amount of information the time period [t − ∆t, t[.
is simply hidden to the trading agent, ranging from some
companies’ confidential information to the other market • T (t) is the agent information regarding the trading
participants’ strategies. In fact, the information available time step t (date, weekday, time).
to the RL agent is extremely limited compared to the com-
• I(t) is the agent information regarding multiple tech-
plexity of the environment. Moreover, this information
nical indicators about the stock market targeted at
can take various forms, both quantitative and qualitative.
time step t. There exist many technical indicators
Correctly processing such information and re-expressing
providing extra insights about diverse financial phe-
it using relevant quantitative figures while minimising the
nomena, such as moving average convergence diver-
subjective bias is capital. Finally, there are significant time
gence (MACD), relative strength index (RSI) or av-
correlation complexities to deal with. Therefore, the infor-
erage directional index (ADX), to only cite a few.
mation retrieved by the RL agent at each time step should
be considered sequentially as a series of information rather • M (t) gathers the macroeconomic information at the
than individually. disposal of the agent at time step t. There are many
interesting macroeconomic indicators which could po-
tentially be useful to forecast markets’ evolution,
such as the interest rate or the exchange rate.
4
• N (t) represents the news information gathered by Actually, the real actions occurring in the scope of a
the agent at time step t. These news data can be trading activity are the orders posted on the order book.
extracted from various sources such as social media The RL agent is assumed to communicate with an external
(Twitter, Facebook, LinkedIn), the newspapers, spe- module responsible for the synthesis of these true actions
cific journals, etc. Complex sentiment analysis mod- according to the value of Qt : the trading execution system.
els could then be built to extract meaningful quanti- Despite being out of the scope of this paper, it should be
tative figures (quantity, sentiment polarity and sub- mentioned that multiple execution strategies can be con-
jectivity, etc.) from the news. The benefits of such sidered depending on the general trading purpose.
information has already been demonstrated by sev-
eral authors, see e.g. Leinweber and Sisk (2011), The trading actions have an impact on the two com-
Bollen et al. (2011) and Nuij et al. (2014). ponents of the portfolio value, namely the cash and share
values. Assuming that the trading actions occur close to
• E(t) is any extra useful information at the disposal of the market closure at price pt ' pC
t , the updates of these
the trading agent at time step t, such as other market components are governed by the following equations:
participants trading strategies, companies’ confiden-
tial information, similar stock market behaviours, c
vt+1 = vtc − Qt pt (7)
rumours, experts’ advice, etc.
s
vt+1 = (nt + Qt ) pt+1 (8)
Observation space reduction:
| {z }
nt+1
In the scope of this research paper, it is assumed that
the only information considered by the RL agent is the with nt ∈ Z being the number of shares owned by the
classical OHLCV data D(t) together with the state infor- trading agent at time step t. In the scope of this research
mation S(t). Especially, the reduced observation space O paper, negative values are allowed for this quantity. De-
encompasses the current trading position together with a spite being surprising at first glance, a negative number
series of the previous τ +1 daily open-high-low-close prices of shares simply corresponds to shares borrowed and sold,
and daily traded volume. With such an assumption, the with the obligation to repay the lender in shares in the
reduced RL observation ot can be expressed as the follow- future. Such a mechanism is particularly interesting as it
ing: introduces new possibilities for the trading agent.

ot = {pO
 H L C t
Two important constraints are assumed concerning the
t0 , pt0 , pt0 , pt0 , Vt0 }t0 =t−τ , Pt (5)
quantity of traded shares Qt . Firstly, contrarily to the
with Pt being the trading position of the RL agent at time share value vts which can be both positive or negative, the
step t (either long or short, as explained in the next sub- cash value vtc has to remain positive for every trading time
section of this research paper). steps t. This constraint imposes an upper bound on the
number of shares that the trading agent is capable of pur-
3.4.2. RL actions chasing, this volume of shares being easily derived from
At each time step t, the RL agent executes a trading ac- Equation 7. Secondly, there exists a risk associated with
tion at ∈ A resulting from its policy π(at |ht ). In fact, the the impossibility to repay the share lender if the agent
trading agent has to answer several questions: whether, suffers significant losses. To prevent such a situation from
how and how much to trade? Such decisions can be mod- happening, the cash value vtc is constrained to be suffi-
elled by the quantity of shares bought by the trading agent ciently large when a negative number of shares is owned, in
at time step t, represented by Qt ∈ Z. Therefore, the RL order to be able to get back to a neutral position (nt = 0).
actions can be expressed as the following: A maximum relative change in prices, expressed in % and
denoted  ∈ R+ , is assumed by the RL agent prior to
at = Qt (6) the trading activity. This parameter corresponds to the
maximum market daily evolution supposed by the agent
Three cases can occur depending on the value of Qt : over the entire trading horizon, so that the trading agent
should always be capable of paying back the share lender
• Qt > 0: The RL agent buys shares on the stock mar- as long as the market variation remains below this value.
ket, by posting new bid orders on the order book. Therefore, the constraints acting upon the RL actions at
time step t can be mathematically expressed as follows:
• Qt < 0: The RL agent sells shares on the stock mar-
ket, by posting new ask orders on the order book. c
vt+1 ≥0 (9)
• Qt = 0: The RL agent holds, meaning that it does
not buy nor sell any shares on the stock market. c
vt+1 ≥ −nt+1 pt (1 + ) (10)

5
with the following condition assumed to be satisfied: is a truly complex task. In this research paper, the in-
tegration of both costs into the RL environment is per-
pt+1 − pt formed through a heuristic. When a trade is executed, a
≤ (11)
pt
certain amount of capital equivalent to a percentage C of
the amount of money invested is lost. This parameter was
realistically chosen equal to 0.1% in the forthcoming sim-
Trading costs consideration:
ulations.
Actually, the modelling represented by Equation 7 is
inaccurate and will inevitably lead to unrealistic results.
Practically, these trading costs are directly withdrawn
Indeed, whenever simulating trading activities, the trading
from the trading agent cash. Following the heuristic pre-
costs should not be neglected. Such omission is generally
viously introduced, Equations 7 can be re-expressed with
misleading as a trading strategy, highly profitable in simu-
a corrective term modelling the trading costs:
lations, may be likely to generate large losses in real trad-
ing situations due to these trading costs, especially when c
vt+1 = vtc − Qt pt − C |Qt | pt (12)
the trading frequency is high. The trading costs can be | {z }
subdivided into two categories. On the one hand, there Trading costs
are explicit costs which are induced by transaction costs Moreover, the trading costs have to be properly consid-
and taxes. On the other hand, there are implicit costs, ered in the constraint expressed in Equation 10. Indeed,
called slippage costs, which are composed of three main the cash value vtc should be sufficiently large to get back
elements and are associated to some of the dynamics of to a neutral position (nt = 0) when the maximum mar-
the trading environment. The different slippage costs are ket variation  occurs, the trading costs being included.
detailed hereafter: Consequently, Equation 10 is re-expressed as follows:
• Spread costs: These costs are related to the differ- c
vt+1 ≥ −nt+1 pt (1 + )(1 + C) (13)
ence between the minimum ask price pask min and the
maximum bid price pbid max , called the spread. Because Eventually, the RL action space A can be defined as the
the complete state of the order book is generally too discrete set of acceptable values for the quantity of traded
complex to efficiently process or even not available, shares Qt . Derived in detail in Appendix A, the RL action
the trading decisions are mostly based on the mid- space A is mathematically expressed as the following:
dle price pmid = (pbid ask
max +pmin )/2. However, a buying
(selling) trade issued at pmid inevitably occurs at a A = {Qt ∈ Z ∩ [Qt , Qt ]} (14)
price p ≥ pask bid
min (p ≤ pmax ). Such costs are all the where:
more significant that the stock market liquidity is
low compared to the volume of shares traded. vtc
• Qt = pt (1+C)
• Market impact costs: These costs are induced by (
∆t
the impact of the trader’s actions on the market. pt (1+C) if ∆t ≥ 0
• Qt = ∆t
Each trade (both buying and selling orders) is po- pt (2C+(1+C)) if ∆t < 0
tentially capable of influencing the price. This phe- with ∆t = −vtc − nt pt (1 + )(1 + C).
nomenon is all the more important that the stock
market liquidity is low with respect to the volume of
shares traded. Action space reduction:
In the scope of this scientific research paper, the ac-
• Timing costs: These costs are related to the time tion space A is reduced in order to lower the complexity
required for a trade to physically happen once the of the algorithmic trading problem. The reduced action
trading decision is made, knowing that the market space is composed of only two RL actions which can be
price is continuously evolving. The first cause is the mathematically expressed as the following:
inevitable latency which delays the posting of the
orders on the market order book. The second cause at = Qt ∈ {QLong , QShort } (15)
t t
is the intentional delays generated by the trading
execution system. For instance, a large trade could The first RL action QLong
t maximises the number of
be split into multiple smaller trades spread over time shares owned by the trading agent, by converting as much
in order to limit the market impact costs. cash value vtc as possible into share value vts . It can be
mathematically expressed as follows:
An accurate modelling of the trading costs is required (j vc k
to realistically reproduce the dynamics of the real trad- Long
t
pt (1+C) if at−1 6= QLong
t−1 ,
ing environment. While explicit costs are relatively easy Qt = (16)
0 otherwise.
to take into account, the valid modelling of slippage costs

6
The action QLong
t is always valid as it is obviously in- is particularly well suited for the performance assessment
cluded into the original action space A defined by Equa- task as it considers both the generated profit and the risk
tion 14. As a result of this action, the trading agent owns associated with the trading activity. Mathematically, the
a number of shares NtLong = nt + QLong t . On the contrary, Sharpe ratio Sr is expressed as the following:
the second RL action, designated by QShort
t , converts share
value vts into cash value vtc , such that the RL agent owns E[Rs − Rf ] E[Rs − Rf ] E[Rs ]
a number of shares equal to −NtLong . This operation can Sr = =p 'p (20)
σr var[Rs − Rf ] var[Rs ]
be mathematically expressed as the following:
where:
vtc • Rs is the trading strategy return over a certain time
( j k
−2nt − pt (1+C) if at−1 6= QShort
t−1 ,
Q\
Short =
t (17) period, modelling its profitability.
0 otherwise.
• Rf is the risk-free return, the expected return from
a totally safe investment (negligible).
However, the action Q\
Short may violate the lower bound
t
Qt of the action space A when the price significantly in- • σr is the standard deviation of the trading strategy
creases over time. Eventually, the second RL action QShort
t excess return Rs − Rf , modelling its riskiness.
is expressed as follows:
In order to compute the Sharpe ratio Sr in practice, the
daily returns achieved by the trading strategy are firstly
n o
QShort
t = max \
QShort , Q
t t (18)
computed using the formula ρt = (vt − vt−1 )/vt−1 . Then,
To conclude this subsection, it should be mentioned the ratio between the returns mean and standard devia-
that the two reduced RL actions are actually related to tion is evaluated. Finally, the annualised Sharpe ratio is
the next trading position of the agent, designated as Pt+1 . obtained by multiplying this value by the square root of
Indeed, the first action QLong
t induces a long trading po- the number of trading days in a year (252).
sition because the number of owned shares is positive. On
the contrary, the second action QShort
t always results in Moreover, a well-performing trading strategy should
a number of shares which is negative, which is generally ideally be capable of achieving acceptable performance on
referred to as a short trading position in finance. diverse markets presenting very different patterns. For in-
stance, the trading strategy should properly handle both
3.4.3. RL rewards bull and bear markets (respectively strong increasing and
For this algorithmic trading problem, a natural choice decreasing price trends), with different levels of volatility.
for the RL rewards is the strategy daily returns. Intu- Therefore, the research paper’s core objective is the devel-
itively, it makes sense to favour positive returns which are opment of a novel trading strategy based on DRL tech-
an evidence of a profitable strategy. Moreover, such quan- niques to maximise the average Sharpe ratio computed on
tity has the advantage of being independent of the number the entire set of existing stock markets.
of shares nt currently owned by the agent. This choice is
also motivated by the fact that it allows to avoid a sparse Despite the fact that the ultimate objective is the max-
reward setup, which is more complex to deal with. The RL imisation of the Sharpe ratio, the DRL algorithm adopted
rewards can be mathematically expressed as the following: in this scientific paper actually maximises the expected
discounted sum of rewards (daily returns) over an infinite
vt+1 − vt time horizon. This optimisation criterion, which does not
rt = (19)
vt exactly corresponds to maximising profits but is very close
to that, can in fact be seen as a relaxation of the Sharpe ra-
3.5. Objective tio criterion. A future interesting research direction would
Objectively assessing the performance of a trading strat- be to narrow the gap between these two objectives.
egy is a tricky task, due to the numerous quantitative and
qualitative factors to consider. Indeed, a well-performing
4. Deep reinforcement learning algorithm design
trading strategy is not simply expected to generate profit,
but also to efficiently mitigate the risk associated with In this section, a novel DRL algorithm is designed to
the trading activity. The balance between these two goals solve the algorithmic trading problem previously intro-
varies depending on the trading agent profile and its will- duced. The resulting trading strategy, denominated the
ingness to take extra risks. Although intuitively conve- Trading Deep Q-Network algorithm (TDQN), is inspired
nient, maximising the profit generated by a trading strat- from the successful DQN algorithm presented in Mnih
egy is a necessary but not sufficient objective. Instead, et al. (2013) and is significantly adapted to the specific
the core objective of a trading strategy is the maximisa- decision-making problem at hand. Concerning the train-
tion of the Sharpe ratio, a performance indicator widely ing of the RL agent, artificial trajectories are generated
used in the fields of finance and algorithmic trading. It from a limited set of stock market historical data.
7
τ is defined as a sequence of observations ot ∈ O, actions
at ∈ A and rewards rt from an RL agent for a certain
number T of trading time steps t:

 
τ = {o0 , a0 , r0 }, {o1 , a1 , r1 }, ..., {oT , aT , rT }

Initially, although the environment E is unknown, one


disposes of a single real trajectory, corresponding to the
historical behaviour of the stock market, i.e. the particular
case of the RL agent being inactive. This original trajec-
tory is composed of the historical prices and volumes to-
gether with long actions executed by the RL agent with no
money at its disposal, to represent the fact that no shares
are actually traded. For this algorithmic trading problem,
new fictive trajectories are then artificially generated from
Figure 3: Illustration of the DQN algorithm this unique true trajectory to simulate interactions with
the environment E. The historical stock market behaviour
is simply considered unaffected by the new actions per-
4.1. Deep Q-Network algorithm formed by the trading agent. The artificial trajectories
The Deep Q-Network algorithm, generally referred to generated are simply composed of the sequence of histori-
as DQN, is a DRL algorithm capable of successfully learn- cal real observations associated with various sequences of
ing control policies from high-dimensional sensory inputs. trading actions from the RL agent. For such practice to be
It is in a way the successor of the popular Q-learning algo- scientifically acceptable and lead to realistic simulations,
rithm introduced in Watkins and Dayan (1992). This DRL the trading agent should not be able to influence the stock
algorithm is said to be model-free, meaning that a complete market behaviour. This assumption generally holds when
model of the environment is not required and that trajec- the number of shares traded by the trading agent is low
tories are sufficient. Belonging to the Q-learning family of with respect to the liquidity of the stock market.
algorithms, it is based on the learning of an approximation
of the state-action value function, which is represented by a In addition to the generation of artificial trajectories
DNN. In such context, learning the Q-function amounts to just described, a trick is employed to slightly improve the
learning the parameters θ of this DNN. Finally, the DQN exploration of the RL agent. It relies on the fact that the
algorithm is said to be off-policy as it exploits in batch reduced action space A is composed of only two actions:
mode previous experiences et = (st , at , rt , st+1 ) collected long (QLong
t ) and short (QShort
t ). At each trading time
at any point during training. step t, the chosen action at is executed on the trading
environment E and the opposite action a− t is executed on a
For the sake of brevity, the DQN algorithm is illus- copy of this environment E − . Although this trick does not
trated in Figure 3, but is not extensively presented in completely solve the challenging exploration/exploitation
this paper. Besides the original publications (Mnih et al. trade-off, it enables the RL agent to continuously explore
(2013) and Mnih et al. (2015)), there exists a great sci- at a small extra computational cost.
entific literature around this algorithm, see for instance
van Hasselt et al. (2015), Wang et al. (2015), Schaul et al. 4.3. Diverse modifications and improvements
(2016), Bellemare et al. (2017), Fortunato et al. (2018) The DQN algorithm was chosen as starting point for
and Hessel et al. (2017). Concerning DL techniques, inter- the novel DRL trading strategy developed, but was signifi-
esting resources are LeCun et al. (2015), Goodfellow et al. cantly adapted to the specific algorithmic trading decision-
(2015) and Goodfellow et al. (2016). For more information making problem at hand. The diverse modifications and
about RL, the reader can refer to the following textbooks improvements, which are mainly based on the numerous
and surveys: Sutton and Barto (2018), Szepesvari (2010), simulations performed, are summarised hereafter:
Busoniu et al. (2010), Arulkumaran et al. (2017) and Shao
• Deep neural network architecture: The first dif-
et al. (2019).
ference with respect to the classical DQN algorithm
is the architecture of the DNN approximating the
4.2. Artificial trajectories generation
action-value function Q(s, a). Due to the different
In the scope of the algorithmic trading problem, a com- nature of the input (time-series instead of raw im-
plete model of the environment E is not available. The ages), the convolutional neural network (CNN) has
training of the TDQN algorithm is entirely based on the been replaced by a classical feedforward DNN with
generation of artificial trajectories from a limited set of some leaky rectified linear unit (Leaky ReLU) acti-
stock market historical daily OHLCV data. A trajectory vation functions.
8
• Double DQN: The DQN algorithm suffers from • Xavier initialisation: While the classical DQN
substantial overestimations, this overoptimism harm- algorithm simply initialises the DNN weights ran-
ing the algorithm performance. In order to reduce domly, the Xavier initialisation is implemented to
the impact of this undesired phenomenon, the article improve the algorithm convergence. The idea is to
van Hasselt et al. (2015) presents the double DQN set the initial weights so that the gradients variance
algorithm which is based on the decomposition of the remains constant across the DNN layers.
target max operation into both action selection and
action evaluation. • Batch normalisation layers: This DL technique,
introduced by Ioffe and Szegedy (2015), consists in
• ADAM optimiser: The classical DQN algorithm normalising the input layer by adjusting and scaling
implements the RMSProp optimiser. However, the the activation functions. It brings many benefits in-
ADAM optimiser, introduced in Kingma and Ba (2015), cluding a faster and more robust training phase as
experimentally proves to improve both the training well as an improved generalisation.
stability and the convergence speed of the DRL al-
gorithm. • Regularisation techniques: Because a strong ten-
dency to overfit was observed during the first exper-
• Huber loss: While the classical DQN algorithm im- iments with the DRL trading strategy, three regu-
plements a mean squared error (MSE) loss, the Hu- larisation techniques are implemented: Dropout, L2
ber loss experimentally improves the stability of the regularisation and Early Stopping.
training phase. Such observation is explained by the
fact that the MSE loss significantly penalises large • Preprocessing and normalisation: The training
errors, which is generally desired but has a negative loop of the TDQN algorithm is preceded by both a
side-effect for the DQN algorithm because the DNN preprocessing and a normalisation operation of the
is supposed to predict values that depend on its own RL observations ot . Firstly, because the high-frequency
input. This DNN should not radically change in a noise present in the trading data was experimen-
single training update because this would also lead tally observed to lower the algorithm generalisation,
to a significant change in the target, which could ac- a low-pass filtering operation is executed. However,
tually result in a larger error. Ideally, the update of such a preprocessing operation has a cost as it mod-
the DNN should be performed in a slower and more ifies or even destroys some potentially useful trading
stable manner. On the other hand, the mean ab- patterns and introduces a non-negligible lag. Sec-
solute error (MAE) has the drawback of not being ondly, the resulting data are transformed in order to
differentiable at 0. A good trade-off between these convey more meaningful information about market
two losses is the Huber loss H: movements. Typically, the daily evolution of prices
is considered rather than the raw prices. Thirdly,
 1 2 the remaining data are normalised.
2x if |x| ≤ 1,
H(x) = 1 (21)
|x| − 2 otherwise. • Data augmentation techniques: A key challenge
of this algorithmic trading problem is the limited
amount of available data, which are in addition gen-
erally of poor quality. As a counter to this ma-
jor problem, several data augmentation techniques
are implemented: signal shifting, signal filtering and
artificial noise addition. The application of such
data augmentation techniques will artificially gen-
erate new trading data which are slightly different
but which result in the same financial phenomena.

Finally, the algorithm underneath the TDQN trading


strategy is depicted in detail in Algorithm 1.

Figure 4: Comparison of the MSE, MAE and Huber losses 5. Performance assessment

An accurate performance evaluation approach is capi-


• Gradient clipping: The gradient clipping tech- tal in order to produce meaningful results. As previously
nique is implemented in the TDQN algorithm to hinted, this procedure is all the more critical because there
solve the gradient exploding problem which induces has been a real lack of a proper performance assessment
significant instabilities during the training of the DNN. methodology in the algorithmic trading field. In this sec-
tion, a novel, more reliable methodology is presented to

9
Algorithm 1 TDQN algorithm
Initialise the experience replay memory M of capacity C.
Initialise the main DNN weights θ (Xavier initialisation).
Initialise the target DNN weights θ− = θ.
for episode = 1 to N do
Acquire the initial observation o1 from the environment E and preprocess it.
for t = 1 to T do
With probability , select a random action at from A.
Otherwise, select at = arg maxa∈A Q(ot , a; θ).
Copy the environment E − = E.
Interact with the environment E (action at ) and get the new observation ot+1 and reward rt .
Perform the same operation on E − with the opposite action a− − −
t , getting ot+1 and rt .

Preprocess both new observations ot+1 and ot+1 .
Store both experiences et = (ot , at , rt , ot+1 ) and e− − − −
t = (ot , at , rt , ot+1 ) in M .
if t % T’ = 0 then
Randomly  sample from M a minibatch of Ne experiences ei = (oi , ai , ri , oi+1 ).
ri if the state si+1 is terminal,
Set yi =
ri + γ Q(oi+1 , arg maxa∈A Q(oi+1 , a; θ); θ− ) otherwise.
Compute and clip the gradients based on the Huber loss H(yi , Q(oi , ai ; θ)).
Optimise the main DNN parameters θ based on these clipped gradients.
Update the target DNN parameters θ− = θ every N − steps.
end if
Anneal the -Greedy exploration parameter .
end for
end for

objectively assess the performance of algorithmic trading rejected, even though it could be interesting to assess the
strategies, including the TDQN algorithm. robustness of trading strategies with respect to such an
extraordinary event. However, this choice was motivated
5.1. Testbench by the fact that a shorter trading horizon is less likely to
In the literature, the performance of a trading strategy contain significant market regime shifts which would seri-
is generally assessed on a single instrument (stock mar- ously harm the training stability of the trading strategies.
ket or others) for a certain period of time. Nevertheless, Finally, the trading horizon of eight years is divided into
the analysis resulting from such a basic approach should both training and test sets as follows:
not be entirely trusted, as the trading data could have
been specifically selected so that a trading strategy looks • Training set: 01/01/2012 → 31/12/2017.
profitable, even though it is not the case in general. To • Test set: 01/01/2018 → 31/12/2019.
eliminate such bias, the performance should ideally be as-
sessed on multiple instruments presenting diverse patterns. A validation set is also considered as a subset of the
Aiming to produce trustful conclusions, this research pa- training set for the tuning of the numerous TDQN algo-
per proposes a testbench composed of 30 stocks presenting rithm hyperparameters. Note that the RL policy DNN
diverse characteristics (sectors, regions, volatility, liquid- parameters θ are fixed during the execution of the trading
ity, etc.). The testbench is depicted in Table 2. To avoid strategy on the entire test set, meaning that the new ex-
any confusion, the official reference for each stock (ticker) periences acquired are not valued for extra training. Nev-
is specified in parentheses. To avoid any ambiguities con- ertheless, such practice constitutes an interesting future
cerning the training and evaluation protocols, it should be research direction.
mentioned that a new trading strategy is trained for each
stock included in the testbench. Nevertheless, for the sake To end this subsection, it should be noted that the pro-
of generality, all the algorithm hyperparameters remain posed testbench could be improved thanks to even more
unchanged over the entire testbench. diversification. The obvious addition would be to include
more stocks with different financial situations and prop-
Regarding the trading horizon, the eight years preced- erties. Another interesting addition would be to consider
ing the publication year of the research paper are selected different training/testing time periods while excluding the
to be representative of the current market conditions. Such significant market regime shifts. Nevertheless, this last
a short-time period could be criticised because it may be idea was discarded in this scientific article due to the im-
too limited to be representative of the entire set of finan- portant time already required to produce results for the
cial phenomena. For instance, the financial crisis of 2008 is proposed testbench.

10
Table 2: Performance assessment testbench

Region
Sector
American European Asian
Dow Jones (DIA) FTSE 100 (EZU) Nikkei 225 (EWJ)
Trading index S&P 500 (SPY)
NASDAQ (QQQ)
Apple (AAPL) Nokia (NOK) Sony (6758.T)
Google (GOOGL) Philips (PHIA.AS) Baidu (BIDU)
Technology Amazon (AMZN) Siemens (SIE.DE) Tencent (0700.HK)
Facebook (FB) Alibaba (BABA)
Microsoft (MSFT)
Twitter (TWTR)
Financial services JPMorgan Chase (JPM) HSBC (HSBC) CCB (0939.HK)
Energy ExxonMobil (XOM) Shell (RDSA.AS) PetroChina (PTR)
Automotive Tesla (TSLA) Volkswagen (VOW3.DE) Toyota (7203.T)
Food Coca Cola (KO) AB InBev (ABI.BR) Kirin (2503.T)

5.2. Benchmark trading strategies mean reversion strategy does not, the opposite being true
In order to properly assess the strengths and weak- as well. This is due to the fact that these two families
nesses of the TDQN algorithm, some benchmark algo- of trading strategies adopt opposite positions: a mean re-
rithmic trading strategies were selected for comparison version strategy always denies and goes against the trends
purposes. Only the classical trading strategies commonly while a trend following strategy follows the movements.
used in practice were considered, excluding for instance
strategies based on DL techniques or other advanced ap-
proaches. Despite the fact that the TDQN algorithm is
an active trading strategy, both passive and active strate-
gies are taken into consideration. For the sake of fairness,
the strategies share the same input and output spaces pre-
sented in Section 3.4.2 (O and A). The following list sum-
marises the benchmark strategies selected:

• Buy and hold (B&H).


Figure 5: Illustration of a typical trend following trading strategy
• Sell and hold (S&H).
• Trend following with moving averages (TF).
• Mean reversion with moving averages (MR).

For the sake of brevity, a detailed description of each


strategy is not provided in this research paper. The reader
can refer to Chan (2009), Chan (2013) or Narang (2009)
for more information. The first two benchmark trading
Figure 6: Illustration of a typical mean reversion trading strategy
strategies (B&H and S&H) are said to be passive, as there
are no changes in trading position over the trading hori-
zon. On the contrary, the other two benchmark strategies 5.3. Quantitative performance assessment
(TF and MR) are active trading strategies, issuing multi- The quantitative performance assessment consists in
ple changes in trading positions over the trading horizon. defining one performance indicator or more to numerically
On the one hand, a trend following strategy is concerned quantify the performance of an algorithmic trading strat-
with the identification and the follow-up of significant mar- egy. Because the core objective of a trading strategy is
ket trends, as depicted in Figure 5. On the other hand, a to be profitable, its performance should be linked to the
mean reversion strategy, illustrated in Figure 6, is based amount of money earned. However, such reasoning omits
on the tendency of a stock market to get back to its previ- to consider the risk associated with the trading activity
ous average price in the absence of clear trends. By design, which should be efficiently mitigated. Generally, a trading
a trend following strategy generally makes a profit when a strategy achieving a small but stable profit is preferred to
11
Table 3: Quantitative performance assessment indicators

Performance indicator Description


Sharpe ratio Return of the trading activity compared to its riskiness.
Profit & loss Money gained or lost at the end of the trading activity.
Annualised return Annualised return generated during the trading activity.
Annualised volatility Modelling of the risk associated with the trading activity.
Profitability ratio Percentage of winning trades made during the trading activity.
Profit and loss ratio Ratio between the trading activity trades average profit and loss.
Sortino ratio Similar to the Sharpe ratio with the negative risk penalised only.
Maximum drawdown Largest loss from a peak to a trough during the trading activity.
Maximum drawdown duration Time duration of the trading activity maximum drawdown.

a trading strategy achieving a huge profit in a very unsta- https://github.com/ThibautTheate/An-Applicat


ble way after suffering from multiple losses. It eventually ion-of-Deep-Reinforcement-Learning-to-Algorith
depends on the investor profile and the willingness to take mic-Trading.
extra risks to potentially earn more.
6.1. Good results - Apple stock
Multiple performance indicators were selected to accu- The first detailed analysis concerns the execution of the
rately assess the performance of a trading strategy. As TDQN trading strategy on the Apple stock, resulting in
previously introduced in Section 3.5, the most important promising results. Similar to many DRL algorithms, the
one is certainly the Sharpe ratio. This performance in- TDQN algorithm is subject to a non-negligible variance.
dicator, widely used in the field of algorithmic trading, Multiple training experiments with the exact same initial
is particularly informative as it combines both profitabil- conditions will inevitably lead to slightly different trading
ity and risk. Besides the Sharpe ratio, this research paper strategies of varying performance. As a consequence, both
considers multiple other performance indicators to provide a typical run of the TDQN algorithm and its expected per-
extra insights. Table 3 presents the entire set of perfor- formance are presented hereafter.
mance indicators employed to quantify the performance of
a trading strategy. Typical run: Firstly, Table 4 presents the perfor-
mance achieved by each trading strategy considered, the
Complementarily to the computation of these numer- initial amount of money being equal to $100,000. The
ous performance indicators, it is interesting to graphically TDQN algorithm achieves good results from both an earn-
represent the trading strategy behaviour. Plotting both ings and a risk mitigation point of view, clearly outper-
the stock market price pt and portfolio value vt evolutions forming all the benchmark active and passive trading strate-
together with the trading actions at issued by the trading gies. Secondly, Figure 7 plots both the stock market price
strategy seems appropriate to accurately analyse the trad- pt and RL agent portfolio value vt evolutions, together
ing policy. Moreover, such visualisation could also provide with the actions at outputted by the TDQN algorithm. It
extra insights about the performance, the strengths and can be observed that the DRL trading strategy is capable
weaknesses of the strategy analysed. of accurately detecting and benefiting from major trends,
while being more hesitant during market behavioural shifts
6. Results and discussion when the volatility increases. It can also be seen that
the trading agent generally lags slightly behind the mar-
In this section, the TDQN trading strategy is evalu- ket trends, meaning that the TDQN algorithm learned to
ated following the performance assessment methodology be more reactive than proactive for this particular stock.
previously described. Firstly, a detailed analysis is per- This behaviour is expected with such a limited observation
formed for both a case that give good results and a case space O not including the reasons for the future market
for which the results were mitigated. This highlights the directions (new product announcement, financial report,
strengths, weaknesses and limitations of the TDQN algo- macroeconomics, etc.). However, this does not mean that
rithm. Secondly, the performance achieved by the DRL the policies learned are purely reactive. Indeed, it was ob-
trading strategy on the entire testbench is summarised served that the RL agent may decide to adapt its trading
and analysed. Finally, some additional discussions about position before a trend inversion by noticing an increase
the discount factor parameter, the trading costs influence in volatility, therefore anticipating and being proactive.
and the main challenges faced by the TDQN algorithm
are provided. The experimental code supporting the re- Expected performance: In order to estimate the ex-
sults presented is publicly available at the following link: pected performance as well as the variance of the TDQN
algorithm, the same RL trading agent is trained multiple

12
Table 4: Performance assessment for the Apple stock

Performance indicator B&H S&H TF MR TDQN


Sharpe ratio 1.239 -1.593 1.178 -0.609 1.484
Profit & loss [$] 79823 -80023 68738 -34630 100288
Annualised return [%] 28.86 -100.00 25.97 -19.09 32.81
Annualised volatility [%] 26.62 44.39 24.86 28.33 25.69
Profitability ratio [%] 100 0.00 42.31 56.67 52.17
Profit and loss ratio ∞ 0.00 3.182 0.492 2.958
Sortino ratio 1.558 -2.203 1.802 -0.812 1.841
Max drawdown [%] 38.51 82.48 14.89 51.12 17.31
Max drawdown duration [days] 62 250 20 204 25

Figure 7: TDQN algorithm execution for the Apple stock (test set) Figure 8: TDQN algorithm expected performance for the Apple stock

times. Figure 8 plots the averaged (over 50 iterations) per-


formance of the TDQN algorithm for both the training and Typical run: Similar to the previous analysis, Ta-
test sets with respect to the number of training episodes. ble 5 presents the performance achieved by every trad-
This expected performance is comparable to the perfor- ing strategies considered, the initial amount of money be-
mance achieved during the typical run of the algorithm. ing equal to $100,000. The mitigated results achieved by
It can also be noticed that the overfitting tendency of the the benchmark active strategies suggest that the Tesla
RL agent seems to be properly handled for this specific stock is quite difficult to trade, which is partly due to its
market. Please note that the test set performance being significant volatility. Even though the TDQN algorithm
temporarily superior to the training set performance is not achieves a positive Sharpe ratio, almost no profit is gen-
a mistake. It simply indicates an easier to trade and more erated. Moreover, the risk level associated with this trad-
profitable market for the test set trading period for the ing activity cannot really be considered acceptable. For
Apple stock. This example perfectly illustrates a major instance, the maximum drawdown duration is particularly
difficulty of the algorithmic trading problem: the train- large, which would result in a stressful situation for the op-
ing and test sets do not share the same distributions. In- erator responsible for the trading strategy. Figure 9, which
deed, the distribution of the daily returns is continuously plots both the stock market price pt and RL agent port-
changing, which complicates both the training of the DRL folio value vt evolutions together with the actions at out-
trading strategy and its performance evaluation. putted by the TDQN algorithm, confirms this observation.
Moreover, it can be clearly observed that the pronounced
6.2. Mitigated results - Tesla stock volatility of the Tesla stock induces a higher trading fre-
The same detailed analysis is performed on the Tesla quency (changes in trading positions, which correspond to
stock, which presents very different characteristics com- the situation where at 6= at−1 ) despite the non-negligible
pared to the Apple stock, such as a pronounced volatility. trading costs, which increases even more the riskiness of
In contrast to the promising performance achieved on the the DRL trading strategy.
previous stock, this case was specifically selected to high-
light the limitations of the TDQN algorithm.
13
Table 5: Performance assessment for the Tesla stock

Performance indicator B&H S&H TF MR TDQN


Sharpe ratio 0.508 -0.154 -0.987 0.358 0.261
Profit & loss [$] 29847 -29847 -73301 8600 98
Annualised return [%] 24.11 -7.38 -100.00 19.02 12.80
Annualised volatility [%] 53.14 46.11 52.70 58.05 52.09
Profitability ratio [%] 100 0.00 34.38 67.65 38.18
Profit and loss ratio ∞ 0.00 0.534 0.496 1.621
Sortino ratio 0.741 -0.205 -1.229 0.539 0.359
Max drawdown [%] 52.83 54.09 79.91 65.31 58.95
Max drawdown duration [days] 205 144 229 159 331

Figure 9: TDQN algorithm execution for the Tesla stock (test set) Figure 10: TDQN algorithm expected performance for the Tesla
stock

Expected performance: Figure 10 plots the expected


performance of the TDQN algorithm for both the train- in Section 5.1, in order to draw more robust and trust-
ing and test sets as a function of the number of training ful conclusions. Table 6 presents the expected Sharpe ra-
episodes (over 50 iterations). It can be directly noticed tio achieved by both the TDQN and benchmark trading
that this expected performance is significantly better than strategies on the entire set of stocks included in this test-
the performance achieved by the typical run previously bench.
analysed, which can therefore be considered as not really
representative of the average behaviour. This highlights Regarding the performance achieved by the benchmark
a key limitation of the TDQN algorithm: the substantial trading strategies, it is important to differentiate the pas-
variance which may result in selecting poorly performing sive strategies (B&H and S&H) from the active ones (TF
policies compared to the expected performance. The sig- and MR). Indeed, this second family of trading strategies
nificantly higher performance achieved on the training set has more potential at the cost of an extra non-negligible
also suggests that the DRL algorithm is subject to overfit- risk: continuous speculation. Because the stock markets
ting in this specific case, despite the multiple regularisation were mostly bullish (price pt mainly increasing over time)
techniques implemented. This overfitting phenomenon can with some instabilities during the test set trading period,
be partially explained by the observation space O which is it is not surprising to see the buy and hold strategy outper-
too limited to efficiently apprehend the Tesla stock. Even forming the other benchmark trading strategies. In fact,
though this overfitting phenomenon does not seem to be neither the trend following nor the mean reversion strat-
too harmful in this particular case, it may lead to poor egy managed to generate satisfying results on average on
performance for other stocks. this testbench. It clearly indicates that there is a major
difficulty to actively trade in such market conditions. This
6.3. Global results - Testbench poorer performance can also be explained by the fact that
As previously suggested in this research paper, the such strategies are generally well suited to exploit specific
TDQN algorithm is evaluated on the testbench introduced financial patterns, but they lack versatility and thus of-
ten fail to achieve good average performance on a large
14
Table 6: Performance assessment for the entire testbench

Sharpe Ratio
Stock
B&H S&H TF MR TDQN
Dow Jones (DIA) 0.684 -0.636 -0.325 -0.214 0.684
S&P 500 (SPY) 0.834 -0.833 -0.309 -0.376 0.834
NASDAQ 100 (QQQ) 0.845 -0.806 0.264 0.060 0.845
FTSE 100 (EZU) 0.088 0.026 -0.404 -0.030 0.103
Nikkei 225 (EWJ) 0.128 -0.025 -1.649 0.418 0.019
Google (GOOGL) 0.570 -0.370 0.125 0.555 0.227
Apple (AAPL) 1.239 -1.593 1.178 -0.609 1.424
Facebook (FB) 0.371 -0.078 0.248 -0.168 0.151
Amazon (AMZN) 0.559 -0.187 0.161 -1.193 0.419
Microsoft (MSFT) 1.364 -1.390 -0.041 -0.416 0.987
Twitter (TWTR) 0.189 0.314 -0.271 -0.422 0.238
Nokia (NOK) -0.408 0.565 1.088 1.314 -0.094
Philips (PHIA.AS) 1.062 -0.672 -0.167 -0.599 0.675
Siemens (SIE.DE) 0.399 -0.265 0.525 0.526 0.426
Baidu (BIDU) -0.699 0.866 -1.209 0.167 0.080
Alibaba (BABA) 0.357 -0.139 -0.068 0.293 0.021
Tencent (0700.HK) -0.013 0.309 0.179 -0.466 -0.198
Sony (6758.T) 0.794 -0.655 -0.352 0.415 0.424
JPMorgan Chase (JPM) 0.713 -0.743 -1.325 -0.004 0.722
HSBC (HSBC) -0.518 0.725 -1.061 0.447 0.011
CCB (0939.HK) 0.026 0.165 -1.163 -0.388 0.202
ExxonMobil (XOM) 0.055 0.132 -0.386 -0.673 0.098
Shell (RDSA.AS) 0.488 -0.238 -0.043 0.742 0.425
PetroChina (PTR) -0.376 0.514 -0.821 -0.238 0.156
Tesla (TSLA) 0.508 -0.154 -0.987 0.358 0.621
Volkswagen (VOW3.DE) 0.384 -0.208 -0.361 0.601 0.216
Toyota (7203.T) 0.352 -0.242 -1.108 -0.378 0.304
Coca Cola (KO) 1.031 -0.871 -0.236 -0.394 1.068
AB InBev (ABI.BR) -0.058 0.275 0.036 -1.313 0.187
Kirin (2503.T) 0.106 0.156 -1.441 0.313 0.852
Average 0.369 -0.202 -0.331 -0.056 0.404

15
set of stocks presenting diverse characteristics. Moreover, 6.5. Trading costs discussion
such strategies are generally more impacted by the trad- The analysis of the trading costs influence on a trading
ing costs due their higher trading frequency (for relatively strategy behaviour and performance is capital, due to the
short moving averages durations, as it is the case in this fact that such costs represent an extra risk to mitigate. A
research paper). major motivation for studying DRL solutions rather than
pure prediction techniques that could also be based on
Concerning the innovative trading strategy, the TDQN DL architectures is related to the trading costs. As pre-
algorithm achieves promising results on the testbench, out- viously explained in Section 3, the RL formalism enables
performing the benchmark active trading strategies on av- the consideration of these additional costs directly into the
erage. Nevertheless, the DRL trading strategy only barely decision-making process. The optimal policy is learned
surpasses the buy and hold strategy on these particular according to the trading costs value. On the contrary,
bullish markets which are so favourable to this simple pas- a purely predictive approach would only output predic-
sive strategy. Interestingly, it should be noted that the tions about the future market direction or prices without
performance of the TDQN algorithm is identical or very any indications regarding an appropriate trading strategy
close to the performance of the passive trading strategies taking into account the trading costs. Although this last
(B&H and S&H) for multiple stocks. This is explained by approach offers more flexibility and could certainly lead
the fact that the DRL strategy efficiently learns to tend to well-performing trading strategies, it is less efficient by
toward a passive trading strategy when the uncertainty design.
associated to active trading increases. It should also be
emphasized that the TDQN algorithm is neither a trend In order to illustrate the ability of the TDQN algo-
following nor a mean reversion trading strategy as both rithm to automatically and efficiently adapt to different
financial patterns can be efficiently handled in practice. trading costs, Figure 11 presents the behaviour of the DRL
Thus, the main advantage of the DRL trading strategy is trading strategy for three different costs values, all other
certainly its versatility and its ability to efficiently handle parameters remaining unchanged. It can clearly be ob-
various markets presenting diverse characteristics. served that the TDQN algorithm effectively reduces its
trading frequency when the trading costs increase, as ex-
6.4. Discount factor discussion pected. When these costs become too high, the DRL algo-
As previously explained in Section 3.4, the discount rithm simply stops actively trading and adopts a passive
factor γ is concerned with the importance of future re- approach (buy and hold or sell and hold strategies).
wards. In the scope of this algorithmic trading problem,
the proper tuning of this parameter is not trivial due to 6.6. Core challenges
the significant uncertainty of the future. On the one hand, Nowadays, the main DRL solutions successfully ap-
the desired trading policy should be long-term oriented plied to real-life problems concern specific environments
(γ → 1), in order to avoid a too high trading frequency with particular properties such as games (see e.g. the fa-
and being exposed to considerable trading costs. On the mous AlphaGo algorithm developed by Google Deepmind
other hand, it would be unwise to place too much im- Silver et al. (2016)). In this research paper, an entirely
portance on a stock market future which is particularly different environment characterised by a significant com-
uncertain (γ → 0). Therefore, a trade-off intuitively exists plexity and a considerable uncertainty is studied with the
for the discount factor parameter. algorithmic trading problem. Obviously, multiple chal-
lenges were faced during the research around the TDQN
This reasoning is validated by the multiple experiments algorithm, the major ones being summarised hereafter.
performed to tune the parameter γ. Indeed, it was ob-
served that there is an optimal value for the discount Firstly, the extremely poor observability of the trad-
factor, which is neither too small nor too large. Addi- ing environment is a characteristic that significantly lim-
tionally, these experiments highlighted the hidden link be- its the performance of the TDQN algorithm. Indeed, the
tween the discount factor and the trading frequency, due amount of information at the disposal of the RL agent
to the trading costs. From the point of view of the RL is really not sufficient to accurately explain the financial
agent, these costs represent an obstacle to overcome for a phenomena occurring during training, which is necessary
change in trading position to occur, due to the immediate to efficiently learn to trade. Secondly, although the distri-
reduced (and often negative) reward received. It models bution of the daily returns is continuously changing, the
the fact that the trading agent should be sufficiently con- past is required to be representative enough of the future
fident about the future in order to overcome the extra risk for the TDQN algorithm to achieve good results. This
associated with the trading costs. The discount factor de- makes the DRL trading strategy particularly sensitive to
termining the importance assigned to the future, a small significant market regime shifts. Thirdly, the TDQN al-
value for the parameter γ will inevitably reduce the ten- gorithm overfitting tendency has to be properly handled
dency of the RL agent to change its trading position, which in order to obtain a reliable trading strategy. As sug-
decreases the trading frequency of the TDQN algorithm. gested in Zhang et al. (2018), more rigorous evaluation
16
(a) Trading costs: 0% (b) Trading costs: 0.1% (c) Trading costs: 0.2%

Figure 11: Impact of the trading costs on the TDQN algorithm, for the Apple stock

protocols are required in RL due to the strong tendency (2017). Another interesting research direction is the com-
of common DRL techniques to overfit. More research on parison of the TDQN algorithm with Policy Optimisation
this particular topic is required for DRL techniques to fit DRL algorithms such as the Proximal Policy Optimisation
a broader range of real-life applications. Lastly, the sub- (PPO - Schulman et al. (2017)) algorithm.
stantial variance of DRL algorithms such as DQN makes
it rather difficult to successfully apply these algorithms to The last major research direction suggested concerns
certain problems, especially when the training and test sets the formalisation of the algorithmic trading problem into a
differ considerably. This is a key limitation of the TDQN reinforcement learning one. Firstly, the observation space
algorithm which was previously highlighted for the Tesla O should be extended to enhance the observability of the
stock. trading environment. Similarly, some constraints about
the action space A could be relaxed in order to enable
new trading possibilities. Secondly, advanced RL reward
7. Conclusion
engineering should be performed to narrow the gap be-
This scientific research paper presents the Trading Deep tween the RL objective and the Sharpe ratio maximisation
Q-Network algorithm (TDQN), a deep reinforcement learn- objective. Finally, an interesting and promising research
ing (DRL) solution to the algorithmic trading problem direction is the consideration of distributions instead of ex-
of determining the optimal trading position at any point pected values in the TDQN algorithm in order to encom-
in time during a trading activity in stock markets. Fol- pass the notion of risk and to better handle uncertainty.
lowing a rigorous performance assessment, this innova-
tive trading strategy achieves promising results, surpassing Acknowledgements
on average the benchmark trading strategies. Moreover,
the TDQN algorithm demonstrates multiple benefits com- Thibaut Théate is a Research Fellow of the F.R.S.-
pared to more classical approaches, such as an appreciable FNRS, of which he acknowledges the financial support.
versatility and a remarkable robustness to diverse trading
costs. Additionally, such data-driven approach presents
References
the major advantage of suppressing the complex task of
defining explicit rules suited to the particular financial Arévalo, A., Niño, J., Hernández, G., and Sandoval, J. (2016). High-
markets considered. Frequency Trading Strategy Based on Deep Neural Networks.
ICIC.
Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath,
Nevertheless, the performance of the TDQN algorithm A. A. (2017). A Brief Survey of Deep Reinforcement Learning.
could still be improved, from both a generalisation and a CoRR, abs/1708.05866.
reproducibility point of view, to cite a few. Several re- Bailey, D. H., Borwein, J. M., de Prado, M. L., and Zhu, Q. J. (2014).
Pseudo-Mathematics and Financial Charlatanism: The Effects of
search directions are suggested to upgrade the DRL solu- Backtest Overfitting on Out-of-Sample Performance. Notice of
tion, such as the use of LSTM layers into the deep neural the American Mathematical Society, pages 458–471.
network which should help to better process the financial Bao, W. N., Yue, J., and Rao, Y. (2017). A Deep Learning Frame-
time-series data, see e.g. Hausknecht and Stone (2015). work for Financial Time Series using Stacked Autoencoders and
Long-Short Term Memory. PloS one, 12.
Another example is the consideration of the numerous im- Bellemare, M. G., Dabney, W., and Munos, R. (2017). A Dis-
provements implemented in the Rainbow algorithm, which tributional Perspective on Reinforcement Learning. CoRR,
are detailed in Sutton and Barto (2018), van Hasselt et al. abs/1707.06887.
(2015), Wang et al. (2015), Schaul et al. (2016), Bellemare Bollen, J., Mao, H., and jun Zeng, X. (2011). Twitter Mood Predicts
the Stock Market. J. Comput. Science, 2:1–8.
et al. (2017), Fortunato et al. (2018) and Hessel et al.

17
Boukas, I., Ernst, D., Théate, T., Bolland, A., Huynen, A., Buch- tized Experience Replay. CoRR, abs/1511.05952.
wald, M., Wynants, C., and Cornélusse, B. (2020). A Deep Rein- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov,
forcement Learning Framework for Continuous Intraday Market O. (2017). Proximal Policy Optimization Algorithms. CoRR,
Bidding. ArXiv, abs/2004.05940. abs/1707.06347.
Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D. (2010). Re- Shao, K., Tang, Z., Zhu, Y., Li, N., and Zhao, D. (2019). A Sur-
inforcement Learning and Dynamic Programming using Function vey of Deep Reinforcement Learning in Video Games. ArXiv,
Approximators. CRC Press. abs/1912.10944.
Carapuço, J., Neves, R. F., and Horta, N. (2018). Reinforcement Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den
Learning applied to Forex Trading. Appl. Soft Comput., 73:783– Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam,
794. V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbren-
Chan, E. P. (2009). Quantitative Trading: How to Build Your Own ner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu,
Algorithmic Trading Business. Wiley. K., Graepel, T., and Hassabis, D. (2016). Mastering the Game of
Chan, E. P. (2013). Algorithmic Trading: Winning Strategies and Go with Deep Neural Networks and Tree Search. Nature, 529:484–
Their Rationale. Wiley. 489.
Dempster, M. A. H. and Leemans, V. (2006). An Automated FX Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An
Trading System using Adaptive Reinforcement Learning. Expert Introduction. The MIT Press, second edition.
Syst. Appl., 30:543–552. Szepesvari, C. (2010). Algorithms for Reinforcement Learning. Mor-
Deng, Y., Bao, F., Kong, Y., Ren, Z., and Dai, Q. (2017). Deep gan and Claypool Publishers.
Direct Reinforcement Learning for Financial Signal Representa- Treleaven, P. C., Galas, M., and Lalchand, V. (2013). Algorithmic
tion and Trading. IEEE Transactions on Neural Networks and Trading Review. Commun. ACM, 56:76–85.
Learning Systems, 28:653–664. van Hasselt, H. P., Guez, A., and Silver, D. (2015). Deep Reinforce-
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Hessel, M., Osband, ment Learning with Double Q-Learning. CoRR, abs/1509.06461.
I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Wang, Z., de Freitas, N., and Lanctot, M. (2015). Dueling Net-
Blundell, C., and Legg, S. (2018). Noisy Networks for Exploration. work Architectures for Deep Reinforcement Learning. CoRR,
CoRR, abs/1706.10295. abs/1511.06581.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. Watkins, C. J. C. H. and Dayan, P. (1992). Technical Note: Q-
MIT Press. Learning. Machine Learning, 8:279–292.
Goodfellow, I. J., Bengio, Y., and Courville, A. C. (2015). Deep Zhang, C., Vinyals, O., Munos, R., and Bengio, S. (2018). A
Learning. Nature, 521:436–444. Study on Overfitting in Deep Reinforcement Learning. CoRR,
Hausknecht, M. J. and Stone, P. (2015). Deep Recurrent Q-Learning abs/1804.06893.
for Partially Observable MDPs. CoRR, abs/1507.06527.
Hendershott, T., Jones, C. M., and Menkveld, A. J. (2011). Does
Algorithmic Trading Improve Liquidity? Journal of Finance,
66:1–33.
Hessel, M., Modayil, J., van Hasselt, H. P., Schaul, T., Ostrovski,
G., Dabney, W., Horgan, D., Piot, B., Azar, M. G., and Silver, D.
(2017). Rainbow: Combining Improvements in Deep Reinforce-
ment Learning. CoRR, abs/1710.02298.
Ioannidis, J. P. A. (2005). Why Most Published Research Findings
Are False. PLoS Med, 2:124.
Ioffe, S. and Szegedy, C. (2015). Batch Normalization: Accelerat-
ing Deep Network Training by Reducing Internal Covariate Shift.
CoRR, abs/1502.03167.
Kingma, D. P. and Ba, J. (2015). Adam: A Method for Stochastic
Optimization. CoRR, abs/1412.6980.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep Learning. Na-
ture, 521.
Leinweber, D. and Sisk, J. (2011). Event-Driven Trading and the
“New News”. The Journal of Portfolio Management, 38:110–124.
Li, Y. (2017). Deep Reinforcement Learning: An Overview. CoRR,
abs/1701.07274.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I.,
Wierstra, D., and Riedmiller, M. A. (2013). Playing Atari with
Deep Reinforcement Learning. CoRR, abs/1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,
Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A.,
Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,
I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis,
D. (2015). Human-Level Control through Deep Reinforcement
Learning. Nature, 518:529–533.
Moody, J. E. and Saffell, M. (2001). Learning to Trade via Direct
Reinforcement. IEEE transactions on neural networks, 12 4:875–
89.
Narang, R. K. (2009). Inside the Black Box. Wiley.
Nuij, W., Milea, V., Hogenboom, F., Frasincar, F., and Kaymak, U.
(2014). An Automated Framework for Incorporating News into
Stock Trading Strategies. IEEE Transactions on Knowledge and
Data Engineering, 26:823–835.
Nuti, G., Mirghaemi, M., Treleaven, P. C., and Yingsaeree, C.
(2011). Algorithmic Trading. Computer, 44:61–69.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Priori-

18
Appendix A. Derivation of action space A Case of Qt ≥ 0: The previous expression becomes
vtc − Qt pt − C Qt pt ≥ −(nt + Qt ) pt (1 + C)(1 + )
Theorem 1. The RL action space A admits an upper ⇔ vtc ≥ −nt pt (1 + C)(1 + ) − Qt pt  (1 + C)
bound Qt such that: −v c −n p (1+C)(1+)
⇔ Qt ≥ t ptt t (1+C)
vtc The expression on the right side of the inequality repre-
Qt = sents the first lower bound for the RL action space A.
pt (1 + C)

Proof. The upper bound of the RL action space A is de- Case of Qt < 0: The previous expression becomes
rived from the fact that the cash value vtc has to remain vtc − Qt pt + C Qt pt ≥ −(nt + Qt ) pt (1 + C)(1 + )
positive over the entire trading horizon (Equation 9). Mak- ⇔ vtc ≥ −nt pt (1 + C)(1 + ) − Qt pt (2C +  + C)
ing the hypothesis that vtc ≥ 0, the number of shares Qt −v c −nt pt (1+C)(1+)
⇔ Qt ≥ tpt (2C+(1+C))
traded by the RL agent at time step t has to be set such The expression on the right side of the inequality repre-
c
that vt+1 ≥ 0 as well. Introducing this condition into sents the second lower bound for the RL action space A.
Equation 12 expressing the update of the cash value, the
following expression is obtained: Both lower bounds previously derived have the same
numerator, which is denoted ∆t from now on. This quan-
vtc − Qt pt − C |Qt | pt ≥ 0 tity represents the difference between the maximum as-
Two cases arise depending on the value of Qt : sumed cost to get back to a neutral position at the next
time step t + 1 and the current cash value of the agent vtc .
Case of Qt < 0: The previous expression becomes The expression tests whether the agent can pay its debt
vtc − Qt pt + C Qt pt ≥ 0. in the worst assumed case or not at the next time step, if
vtc nothing is done at the current time step (Qt = 0). Two
⇔ Qt ≤ pt (1−C) .
cases arise depending on the sign of the quantity ∆t :
The expression on the right side of the inequality is al-
ways positive due to the hypothesis that vtc ≥ 0. Because
Case of ∆t < 0: The trading agent has no problem paying
Qt is negative in this case, the condition is always satisfied.
its debt in the situation previously described. This is al-
ways true when the agent owns a positive number of shares
Case of Qt ≥ 0: The previous expression becomes
(nt ≥ 0). This is also always true when the agent owns
vtc − Qt pt − C Qt pt ≥ 0.
vtc a negative number of shares (nt < 0) and when the price
⇔ Qt ≤ pt (1+C) . decreases (pt < pt−1 ) due to the hypothesis that Equa-
This condition represents the upper bound (positive) of tion 13 was verified for time step t. In this case, the most
the RL action space A. constraining lower bound of the two is the following:
∆t
Qt =
Theorem 2. The RL action space A admits a lower bound pt (2C + (1 + C))
Qt such that:
( Case of ∆t ≥ 0: The trading agent may have problem pay-
∆t ing its debt in the situation previously described. Follow-
if ∆t ≥ 0
Qt = pt (1+C)
∆t ing a similar reasoning than for the previous case, the most
pt (2C+(1+C)) if ∆t < 0
constraining lower bound of the two is the following:
with ∆t = −vtc − nt pt (1 + )(1 + C).
∆t
Proof. The lower bound of the RL action space A is de- Qt =
pt (1 + C)
rived from the fact that the cash value vtc has to be suffi-
cient to get back to a neutral position (nt = 0) over the en-
tire trading horizon (Equation 13). Making the hypothesis
that this condition is satisfied at time step t, the number
of shares Qt traded by the RL agent should be such that
this condition remains true at the next time step t + 1. In-
troducing this constraint into Equation 12, the following
inequality is obtained:

vtc − Qt pt − C |Qt | pt ≥ −(nt + Qt ) pt (1 + C)(1 + )

Two cases arise depending on the value of Qt :

19

You might also like