Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
19 views

Practical Algorithmic Trading Using State Represen

This paper proposes a practical algorithmic trading method called SIRL-Trader that achieves good profits using only long positions. It uses offline and online state representation learning to extract robust features from stock price data. It also uses imitative reinforcement learning techniques to incorporate behavior cloning with TD3 to learn a profitable trading policy.

Uploaded by

at th
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Practical Algorithmic Trading Using State Represen

This paper proposes a practical algorithmic trading method called SIRL-Trader that achieves good profits using only long positions. It uses offline and online state representation learning to extract robust features from stock price data. It also uses imitative reinforcement learning techniques to incorporate behavior cloning with TD3 to learn a profitable trading policy.

Uploaded by

at th
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Practical Algorithmic Trading Using


State Representation Learning and
Imitative Reinforcement Learning
DEOG-YEONG PARK AND KI-HOON LEE
School of Computer and Information Engineering, Kwangwoon University, 20 Kwangwoon-ro, Nowon-gu, Seoul 01897, Republic of Korea
Corresponding author: Ki-Hoon Lee (e-mail: kihoonlee@kw.ac.kr)
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the
Ministry of Education (NRF-2018R1D1A1B07043727). The present research has been conducted by the Research Grant of Kwangwoon
University in 2020.

ABSTRACT Algorithmic trading allows investors to avoid emotional and irrational trading decisions and
helps them make profits using modern computer technology. In recent years, reinforcement learning has
yielded promising results for algorithmic trading. Two prominent challenges in algorithmic trading with
reinforcement learning are (1) extracting robust features and (2) learning a profitable trading policy. Another
challenge is that it was previously often assumed that both long and short positions are always possible in
stock trading; however, taking a short position is risky or sometimes impossible in practice. We propose a
practical algorithmic trading method, SIRL-Trader, which achieves good profit using only long positions.
SIRL-Trader uses offline/online state representation learning (SRL) and imitative reinforcement learning.
In offline SRL, we apply dimensionality reduction and clustering to extract robust features whereas, in
online SRL, we co-train a regression model with a reinforcement learning model to provide accurate state
information for decision-making. In imitative reinforcement learning, we incorporate a behavior cloning
technique with the twin-delayed deep deterministic policy gradient (TD3) algorithm and apply multistep
learning and dynamic delay to TD3. The experimental results show that SIRL-Trader yields higher profits
and offers superior generalization ability compared with state-of-the-art methods.

INDEX TERMS algorithmic trading, deep learning, state representation learning, imitation learning,
reinforcement learning.

I. INTRODUCTION Previously, an assumption was often made that both long


Algorithmic trading, which enables investors to trade stocks and short positions are always possible in stock trading.
without human intervention, has started playing an important These positions represent the direction of the bets that a
role in modern stock markets. Algorithmic trading is a subset stock price is expected to either rise or fall. A long position
of quantitative trading, which relies heavily on quantitative involves buying stocks with the intention of future selling,
analysis and machine-learning methods. In particular, rein- whereas a short position involves selling stocks borrowed
forcement learning methods can be used to learn trading from a stockbroker first and then buying them back to close
strategies in the process of interacting with the stock market the position.
environment. The application of reinforcement learning to We note that taking a short position is risky or sometimes
algorithmic trading is not trivial because financial time-series impossible in practice, particularly for individual investors.
data are nonstationary and contain considerable noise. Two First, the maximum gain in a short position is 100% when
prominent challenges presented by algorithmic trading with the stock price falls to zero, whereas the potential loss is
reinforcement learning are (1) extracting robust features and theoretically infinite as the price has no upper bound. On the
(2) learning a profitable trading policy. To accommodate other hand, the maximum loss in a long position is limited
these challenges, recent methods [1]–[6] use deep reinforce- because the price can only decrease to zero, but the possible
ment learning and deliver good performance in terms of gain is theoretically infinite. Second, the stockbroker may
profitability. close a short position immediately, without the investor’s

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

consent. The borrowed stocks are essentially loans from The agent selects an action using a deterministic policy
stockbrokers. If the stock price rises, the stockbroker requires µ(s) or stochastic policy π(a|s) that defines the probability
additional capital, and if the investor cannot provide capital, distribution over actions for each state. The discounted sum
the stockbroker can close the position, resulting in a loss. of future rewards collected by the agent from the state st is
Third, the stock market generally increases over time. Fourth, defined as the discounted return Gt in Equation (1).
economic regulators can severely restrict or temporarily ban

short-selling during an economic crisis. In summary, a short Gt =
X
γ i rt+i = rt + γGt+1 (1)
position is not recommended for individual investors. How- i=0
ever, obtaining good profits using only long positions is
challenging in algorithmic trading. B. DEEP Q-NETWORK (DQN)
In this paper, we propose a practical algorithmic trading In value-based reinforcement learning, the agent learns an
method named SIRL-Trader, which generates good profits estimate of the expected discounted return, or value, for
using only long positions. First, we devise an offline/online each state (Equation (2)) or for each state and action pair
state representation learning (SRL) method. The offline unsu- (Equation (3)).
pervised SRL reduces the dimensionality and applies cluster-
ing to extract robust features from observations. The online V π (s) = Ea∼π [Gt |St = s] (2)
supervised SRL co-trains a regression model that predicts
the next price with a reinforcement learning model to pro- Qπ (s, a) = Ea∼π [Gt |St = s, At = a] (3)
vide accurate state information for decision-making. Second,
we combine imitation learning with reinforcement learning A common way of deriving a new policy π 0 from Qπ (s, a)
by cloning the behavior of a prophetic expert who has is to act -greedily with respect to actions. With probability
information about subsequent price movements. Third, we (1 − ), the agent takes the action with the highest Q-value
extend the twin-delayed deep deterministic policy gradient (the greedy action), that is, π 0 (s) = argmax Qπ (s, a). With
a∈A
(TD3) algorithm [7], which is a state-of-the-art reinforce- probability , the agent takes a random action to introduce the
ment learning method, to incorporate offline/online SRL, exploration.
behavior cloning, multistep learning, and dynamic delay. The deep Q-network (DQN) [9] introduces deep neural
Fourth, compared with state-of-the-art algorithmic trading networks to approximate the Q-value for large state and
methods, SIRL-Trader yields higher profit and has superior action spaces. DQN uses a replay buffer to store past experi-
generalization ability for different stocks. ences as tuples of hs, a, r, s0 i and learns by sampling batches
The remainder of this paper is organized as follows. from the replay buffer. DQN uses two neural networks: on-
Section II introduces reinforcement learning methods, and line (Qθ ) and target (Qθ0 ) networks. The parameters θ of the
Section III reviews existing work. Sections IV and V present online network are periodically copied to the target network
SIRL-Trader and experimental results, respectively. Finally, during training. The loss function of DQN in Equation (5)
Section VI presents our conclusions and suggestions for is the mean squared error (MSE) between Qθ (s, a) and the
future work. For ease of reading, Table V in the Appendix target value YDQN in Equation (4), which uses the Bellman
lists the abbreviations used in this paper. equation [10]. The techniques of experience replay and the
target network enable stable learning.
II. BACKGROUND
YDQN = r + γ max Qθ0 (s0 , a0 ) (4)
A. REINFORCEMENT LEARNING 0 a

In reinforcement learning, an agent learns to act in an en- LDQN = E[(YDQN − Qθ (s, a))2 ] (5)
vironment to maximize the total reward. At each time step,
the environment provides a state s to the agent, the agent Because we take the maximum value for the target network
selects and takes an action a, and then the environment in Equation (4), we often obtain an overestimated value. Dou-
provides a reward r and the next state s0 . This interaction can ble DQN (DDQN) [11] solves this overestimation problem
be formalized as a Markov decision process (MDP), which of DQN by decoupling the selection of the action from its
is a tuple hS, A, P, R, γi, where S is a finite set of states, evaluation, as shown in Equation (6).
A is a finite set of actions, P (s, a, s0 ) is a state transition
probability, R(s, a) is a reward function, and γ ∈ [0, 1] is the YDDQN = r + γQθ0 (s0 , argmax Qθ (s0 , a0 )) (6)
a0
discount factor, a trade-off between immediate and long-term
rewards. In reinforcement learning for algorithmic trading, Rainbow DQN [12] is an improvement on DQN and
the state is not directly given and needs to be constructed combines several features including DDQN, prioritized ex-
from a history of observations. To accommodate this, the perience replay [13], dueling networks [14], multistep learn-
MDP model was extended with an observation probability ing [15], distributional reinforcement learning [16], and noisy
P (o|s, a). The extended model is referred to as the partially networks [17].
observable MDP (POMDP) model [8].
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

C. ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC (A3C) as in Equation (15), which is used to obtain the target value
In policy-based reinforcement learning, the agent directly Y in Equation (16). This technique, known as target policy
learns a policy function π(a|s). Actor-critic methods com- smoothing, serves as a regularizer to avoid overfitting to sharp
bine policy-based and value-based reinforcement learning peaks in the Q-value estimate. The noise is clipped to limit its
methods. In these methods, the critic network (Vθ ) learns the impact.
value function, and the actor network (πφ ) learns the policy
in the direction suggested by the critic. In general, the loss a0 = µφ0 (s0 ) + ,  ∼ clip(N , −c, c) (15)
function of the critic network is defined in Equation (8), Second, TD3 uses two critics (and two target critics) to
which uses the Bellman equation; the loss function of the solve the overestimation problem of DDPG. The minimum
actor network is defined in Equation (9), which uses the value in the pair of target critics is used to compute the target
stochastic policy gradient theorem [18]. value, as shown in Equation (16). This technique is known as
clipped double Q-learning.
Y = r + γVθ (s0 ) (7)
Y = r + γ min Qθj0 (s0 , a0 ) (16)
Lcritic = E[(Y − Vθ (s))2 ] (8) j=1,2

Third, TD3 updates the actor network µφ (and the target


Lactor = E[− log πφ (a|s)(Y − Vθ (s))] (9)
actor µφ0 and critic networks Qθj0 ) less frequently than critic
Asynchronous advantage actor-critic (A3C) is an actor- networks Qθj . This technique, known as delayed policy
critic method that asynchronously executes multiple agents updates, aims to delay policy updates until the Q-value
in parallel instead of using experience replay. In A3C, the converges.
critic network estimates the advantage of action a in state
s, or A(s, a) = Q(s, a) − V (s). A3C uses n-step returns to III. RELATED WORK
accelerate convergence and includes the entropy of the policy Considerable research has been devoted to algorithmic trad-
π(a|s) to the loss function of the actor network to introduce ing, including supervised learning methods [21]–[27] that
the exploration. predict price trends and reinforcement learning methods [1]–
[6] that directly learn a profitable trading policy. In recent
D. DEEP DETERMINISTIC POLICY GRADIENT (DDPG) years, deep reinforcement learning methods have shown
The deterministic policy gradient (DPG) [19] is an actor- promising results in algorithmic trading. There are two chal-
critic method that learns a deterministic policy µ(s) rather lenges in deep reinforcement learning for algorithmic trad-
than a stochastic policy π(a|s). It is a special case of the ing: extracting robust features from observations to represent
stochastic policy gradient when the variance approaches zero. the states (i.e., SRL) and learning a profitable trading policy.
DPG is efficient and effective for high-dimensional continu- Deng et al. [1] used a fuzzy deep neural network for
ous action spaces. SRL and proposed a policy-based reinforcement learning
The deep DPG (DDPG) [20] is an actor-critic method that method with a recurrent neural network. Li et al. [2]
combines DPG and DQN. As in DQN, DDPG uses a replay used a stacked denoising autoencoder for SRL and pro-
buffer and target networks. For the exploration, DDPG adds posed DDQN-extended and A3C-extended methods with a
noise N to the policy, as shown in Equation (10). The loss long short-term memory (LSTM) network [28]. The A3C-
function of the critic network is defined as in Equation (12), extended method yields more profits than the DDQN-
which uses the Bellman equation, and that of the actor extended method. Fengqian and Chao [3] decomposed can-
network in Equation (13), which uses the deterministic policy dlesticks (or K-lines) into components such as the lengths
gradient theorem [19]. After updating the online actor and of the upper shadow line, lower shadow line, and body.
critic networks, the target actor and critic networks are soft- Each component is then clustered, and the cluster centers
updated from the online networks, as in Equation (14). and the color of the body are used to represent the state.
For deep reinforcement learning, [3] used a policy gradient
a = µφ (s) + N (10) method with -greedy exploration. Wu et al. [6] used a gated
recurrent unit (GRU) network [29] for SRL and proposed
Y = r + γQθ0 (s0 , µφ0 (s0 )) (11)
DDQN-extended and DDPG-extended methods, GDQN and
Lcritic = E[(Y − Qθ (s, a))2 ] (12) GDPG, respectively. GDPG provides more stable returns
than GDQN does. Lei et al. [4] used a gate structure to
Lactor = E[−Qθ (s, µφ (s))] (13) select features, GRU to capture long-term dependency, and a
temporal attention mechanism to weight past states based on
θ0 ← τ θ + (1 − τ )θ0 , φ0 ← τ φ + (1 − τ )φ0 (14) the current state. They proposed a policy gradient method,
known as time-driven feature-aware jointly deep reinforce-
E. TWIN-DELAYED DDPG (TD3) ment learning (TFJ-DRL), which combines SRL and a policy
The twin-delayed DDPG (TD3) improves DDPG in three gradient method using an auto-encoder. The decoding part of
ways. First, TD3 adds Gaussian noise to the target action, the SRL model is used to predict the next closing price, where
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

FIGURE 1: Architecture for SIRL-Trader

TABLE I: Input Features


the real price is used as the feedback signal. The encoding
part of the SRL model is used as the state representation Feature Group Features
for the reinforcement learning. Liu et al. [5] used GRU for Candlestick the lengths of upper shadow line,
components [3] lower shadow line, and body; the body color
SRL and introduced imitation learning techniques, such as BBANDS, DEMA, EMA, HT-TRENDLINE, KAMA,
a demonstration buffer and behavior cloning, to the DPG Overlap
MA, MAMA, MIDPOINT, MIDPRICE, SAR,
studies [30]
algorithm. SAREXT, SMA, T3, TEMA, TRIMA, WMA
The above-mentioned studies have resulted in several ADX, ADXR, APO, AROON, AROONOSC, BOP,
CCI, CMO, DX, MACD, MACDEXT, MACDFIX,
significant improvements in algorithm trading. However, it Momentum
MFI, MINUS_DI, MINUS_DM, MOM, PLUS_DI,
indicators [30]
is unclear whether these improvements are complementary PLUS_DM, PPO, ROC, ROCP, ROCR, RSI, STOCH,
and can be combined to obtain positive results. This study STOCHF, STOCHRSI, TRIX, ULTOSC, WILLR
Volume
resulted in a comprehensive solution that integrates existing AD, ADOSC, OBV
indicators [30]
improvements with new ideas, as explained in the next sec- Volatility
ATR, NATR, TRANGE
tion. indicators [30]

IV. SIRL-TRADER
In this section, we propose the novel algorithmic trading
method named SIRL-Trader.

A. ARCHITECTURE
We propose an actor-critic reinforcement learning method
that extends TD3 to incorporate offline/online SRL, imitation
learning, multistep learning, and dynamic delay. Fig. 1 shows
the proposed architecture. Each component is explained in
detail in the following subsections.

B. STATE REPRESENTATION LEARNING


SRL models learn state representations to help the agent learn
a good policy. In other words, SRL models learn how to map
observations to states. We use the candlestick components
and technical indicators in Table I as observations.
Our SRL method consists of (1) offline unsupervised SRL
that occurs before training the reinforcement learning model FIGURE 2: Architecture for offline SRL
and (2) online supervised SRL that occurs while the rein-
forcement learning model is being trained.
The offline SRL extracts a low-dimensional robust repre- The online SRL, of which the architecture is shown in
sentation from high-dimensional observations, as shown in Fig. 3, takes input as a sliding window of outputs from the
Fig. 2; first, we normalize each input feature using the z- offline SRL to see historical data. To focus on important
score standardization method. Second, for each feature group features in each window, we assign weights to the features
in Table I with high dimensionality, we reduce the dimen- using the gate structure [4]. The gate g shown in Fig. 4 uses
sionality of the feature space to the threshold F in Fig. 2 a sigmoid activation function σ, as in Equation (17), where
using principal components analysis (PCA). Third, we clus- f denotes the input feature vector. Parameters W and b are
ter each feature using fuzzy c-means clustering (FCM) [31]. learned using end-to-end training. In subsequent steps, we
After clustering, each feature value except the body color is use the weighted feature vector f 0 in Equation (18) where
represented by the cluster center to which it belongs. denotes the element-wise multiplication of vectors g and f .
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

g = σ(W · f + b) (17)

f0 = g f (18)

FIGURE 5: Neural network structures

FIGURE 3: Architecture for online SRL of shares of a stock and sells all shares of the stock at the
closing price.

2) Reward
We define the reward rt for action at as the change rate of the
portfolio value as in Equation (19). The portfolio value Vtp is
the sum of the stock value Vts and remaining cash balance
Vtc , as in Equation (20).
p
rt = at × (Vt+1 − Vtp )/Vtp (19)

Vtp = Vts + Vtc (20)


FIGURE 4: Gate structure for weighting features The stock value Vts is the current value of the stock, which
is computed by multiplying the closing price pct of the stock
After weighting the features, we apply an LSTM layer to by the number of shares ns owned, as in Equation (21). To
learn temporal characteristics. The input of the LSTM layer is simulate a real trading environment, we include a transaction
a sliding window of weighted feature vectors, and its output cost term ζ.
is the hidden state of the last time step as shown in Fig. 3.
We term the network up to the LSTM layer as the online SRL Vts = ns × pct × (1 − ζ) (21)
network. Each actor and critic network has a corresponding
online SRL network. We co-train the online SRL networks 3) Algorithm
with actor and critic networks, respectively. We incorporate offline/online SRL, behavior cloning, mul-
In addition to the online SRL networks, we train a re- tistep learning, and dynamic delay into TD3. Algorithm 1
gression network whose structure is shown in Fig. 5(a), presents the proposed reinforcement learning algorithm. As
which predicts the next closing price to provide accurate state shown in Fig. 6, we use two critic networks Qθ1 , Qθ2 , two
information for the actor network. The MSE between the online SRL networks oη1 , oη2 for Qθ1 , Qθ2 , an actor network
real and predicted prices is used as the loss function for the µφ , an online SRL network λυ , and a regression network for
regression network. The predicted price does not participate λυ . The input for the online SRL networks is a sliding win-
in training the actor network, but the underlying online SRL dow wt of weighted feature vectors {xt−Nw +1 , ..., xt−1 , xt }
network of the regression network does because the online obtained from the offline SRL.
SRL network is shared with the actor network as explained To accelerate the training process, we use multistep learn-
in the next section. ing, which collects transitions of hwt , at , rt , wt+1 i using the
N -step buffer D (lines 9 to 11). An N -step transition of
C. IMITATIVE REINFORCEMENT LEARNING hwt−N +1:t+1 , at−N +1:t , rt−N +1:t i is constructed from the
1) Action transitions stored in D (line 13) and used to compute the
The trading action at ∈ {buy, hold, sell} = {1, 0, −1} is target value Y (line 16).
taken on each trading day. Because we use only the long The actor network µφ , whose structure is shown in
position, the buy action precedes the sell action. The agent of Fig. 5(b), is combined with the online SRL network λυ as
the SIRL-Trader starts with a certain amount of capital. For shown in Fig. 6. To support discrete actions, we use the
the sake of simplicity, the agent buys the maximum number softmax layer as the output layer of the actor network. To
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

FIGURE 6: Architecture for reinforcement learning

Algorithm 1 The reinforcement learning algorithm of SIRL-Trader


Input: Sliding-window data wt ← {xt−Nw +1 , ..., xt−1 , xt } obtained from the offline SRL
1. Initialize critic networks Qθ1 , Qθ2 , online SRL networks oη1 , oη2 for the critics
2. Initialize an actor network µφ , an online SRL network λυ for the actor, and a regression network for λυ
0
3. Initialize target networks: θ1,2 0
← θ1,2 , η1,2 ← η1,2 , φ0 ← φ, υ 0 ← υ
4. Initialize a replay buffer B

5. for e = 1 to Nepochs do
6. Initialize an N -step buffer D
7. Compute the delay value for each epoch: d ← (e mod α) + β

8. for t = Nw to T − 1 do
9. Select an action with exploration noise  ∼ N (0, σ): at ← µφ (st ) +  where st = λυ (wt )
10. Observe a reward rt and the next input wt+1
11. Store a transition hwt , at , rt , wt+1 i to D

12. if t ≥ Nw + N − 1 then
13. Obtain an N -step transition hwt−N +1:t+1 , at−N +1:t , rt−N +1:t i from D and store it to B
14. Sample a mini-batch of B transitions of length N from B
15. Smooth the target policy with  ∼ clip(N (0, σ 0 ), −c, c): a0t+1 ← µφ0 (st+1 ) +  where st+1 = λυ0 (wt+1 )
P −1 i N min Q 0 (sj j
16. Y ← N i=0 γ rt−N +1+i + γ j=1,2
0
θj t+1 , at+1 ) where st+1 = oηj0 (wt+1 )
1
(Y − Qθj (sjt−N +1 , at−N +1 ))2 where sjt−N +1 = oηj (wt−N +1 )
P
17. Update the critics θj by the MSE loss: B

18. if t mod d then


19. Update the actor φ by the deterministic policy gradient:
1 P
B
∇Qθ1 (st−N +1 , at−N +1 ) where st−N +1 = oη1 (wt−N +1 ), at−N +1 = µφ (λυ (wt−N +1 ))
1 P predicted 2
20. Update the regression network by the MSE loss: B (closereal
t−N +2 − closet−N +2 )

1 P
21. Update the actor φ by the CE loss for the behavior cloning: B ∇CE(at−N +1 , aexpert ) where at−N +1 = µφ (λυ (wt−N +1 ))
0 0 , η0
← τ θ1,2 + (1 − τ )θ1,2 0 0 0 0 0
22. Soft-update the target networks: θ1,2 1,2 ← τ η1,2 + (1 − τ )η1,2 , φ ← τ φ + (1 − τ )φ , υ ← τ υ + (1 − τ )υ
23. end if
24. end if
25. end for
26. end for

6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

TABLE II: Comparison of algorithmic trading methods


select an action with exploration noise, we add a different
Offline SRL Online SRL Reinforcement Learning
amount of noise to each output of the softmax layer (line 9) dim. clus- regres- imita- multi- dynamic
gate
and then apply argmax. As in the original TD3 algorithm, reduction tering sion tion step delay
SIRL-Trader ◦ ◦ ◦ ◦ ◦ ◦ ◦
we use target policy smoothing as a regularization strategy K-line [3] × ◦ × × × × ×
(line 15). When the actor network µφ is updated, the online TFJ-DRL [4] × × ◦ ◦ × × ×
iRDPG [5] × × × × ◦ × ×
SRL network λυ is also updated by backpropagation. GDPG [6] × × × × × × ×
The critic networks Qθ1 , Qθ2 , whose structure is shown
in Fig. 5(c), are combined with the online SRL networks
oη1 , oη2 , respectively. To solve the overestimation problem, {long, hold, short} = {1, 0, −1}. For the reward, the stock
we use the clipped double Q-learning technique of TD3 with value Vts , which is used in Equation (20), is redefined as
multistep learning (lines 16 to 17). When the critic networks Equation (23). The stock value Vtlong for the long position
are updated, the corresponding online SRL networks are also is computed using Equation (24), which is the same as
updated by backpropagation. Equation (21). The stock value Vtshort for the short position
To provide accurate state information for the actor net- is computed by multiplying the difference between pct and
work, we update the regression network using the MSE pcopened by the number of shares nshort , as in Equation (25)
between the real and predicted prices (line 20). When the where pcopened is the closing price of the stock when the short
regression network is updated, the online SRL network λυ position is opened, and ζ is the transaction cost. For the sake
is also updated by backpropagation. The actor network is of simplicity, we take either a long or a short position at a
indirectly affected by this update through the underlying time.
online SRL network λυ , which is shared with the regression
network as shown in Fig. 6. Vts = Vtlong + Vtshort (23)
For imitation learning, we introduce a behavior-cloning
technique to guide the actor network training. We create a Vtlong = nlong × pct × (1 − ζ) (24)
prophetic trading expert who selects an action at dayt−N +1
Vtshort = nshort × (pct − pcopened × (1 − ζ)) (25)
using information about today’s closing price closet−N +1
and tomorrow’s closing price closet−N +2 . The expert buys
V. EXPERIMENTS
when closet−N +2 > h × closet−N +1 and sells when
In this section, we present experiments designed to answer
closet−N +2 < h × closet−N +1 where h ≥ 1 is a hyper-
the following questions.
parameter. Otherwise, the expert holds the stock. We train
• Can SIRL-Trader outperform state-of-the-art methods?
the actor network to minimize the cross-entropy (CE) loss
• What leads to the gain obtained by SIRL-Trader?
between the softmax output vector of hbuy, hold, selli and
• How do hyperparameters affect the performance of
the action aexpert of the expert, which is represented as a
one-hot vector (line 21). SIRL-Trader?
• Is SIRL-Trader robust to high transaction costs?
For more stable and efficient training, we propose a dy-
namic delay technique for updating the actor and target
networks. In the original TD3 algorithm, the delay is fixed A. EXPERIMENTAL SETUP
to a constant value, and it is hard to find an optimal value. 1) Datasets
The dynamic delay technique allows us to try various de- We test algorithmic trading methods using two datasets with
lay values while the reinforcement learning model is being different numbers of stocks included in the S&P 500 index.
trained. For each epoch, we compute the delay value d using We obtain stock data consist of opening, high, low, closing,
Equation (22) (used in line 7). In Equation (22), α is a and volume values from Yahoo Finance [32]. We use a
constant for adjusting the variance in delay values, and β is a smaller dataset to compare SIRL-Trader with other methods
constant for setting the minimum delay value. in detail. To ensure the diversity of price trends, we select
six stocks with different price trends (upward, sideways, and
d = (e mod α) + β (22) downward), as shown in Fig. 7. We use a larger dataset to
verify the generalization ability of the methods. Similar to
D. DISCUSSION [4], we select 56 stocks from twelve different sectors as
Table II summarizes algorithm trading methods along three shown in Table IV. For the two datasets, the training period is
dimensions: offline SRL, online SRL, and reinforcement from Jan. 2014 to Dec. 2018, and the test period is from Jan.
learning. SIRL-Trader is the only method that integrates 2019 to Dec. 2020. To simulate the real trading environment,
all the techniques (dimensionality reduction, clustering, gate we do not use any information from the current trading day
structure, regression model, imitation learning, multistep for testing.
learning, and dynamic delay) for the three dimensions.
SIRL-Trader can be easily extended to support short po- 2) Evaluation Metrics
sitions by redefining the action space and reward of rein- For each method, we evaluate the rate of return (Vend −
forcement learning. The action space is redefined as at ∈ Vstart )/Vstart obtained using starting capital Vstart of
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

policy and regression networks comprise two ReLU


dense layers with 128 and 64 units, a dropout layer with
an elimination fraction of 0.3, and a softmax output
layer. We set the sliding window size to 3, the mini-
batch size to 32, the learning rate to 0.0001, and  to
0.7 with a decay of 0.9.
(a) Upward trending • iRDPG [5] uses GRU and imitative actor-critic re-
inforcement learning. The actor network comprises a
ReLU dense layer with 16 units and a softmax output
layer. The critic network comprises a ReLU dense layer
with 16 units and a linear output layer. We set the sliding
window size to 10, the standard deviation of the noise
(or the noise size) for the exploration to 0.9, the mini-
(b) Sideways trending batch size to 32, and the learning rate to 0.0001.
• GDPG [6] uses GRU and DDPG. The actor network
comprises two GRU layers with 20 and 24 units, a
dropout layer with an elimination fraction of 0.3, and
a softmax output layer. The critic network comprises a
concatenate layer, two ReLU dense layers with 64 and
16 units, and a linear output layer. We set the sliding
(c) Downward trending window size to 5, the noise size for the exploration to
FIGURE 7: Stocks with different price trends
0.6, the mini-batch size to 64, and the learning rate to
0.001.

$10,000 after a trading test period. We also evaluate the 4) Implementation Details of SIRL-Trader
Sharpe ratio [33], which measures the return of an investment In the offline SRL, we set the tumbling window size for the
compared to its risk. In Equation (26), E[R] is the expected z-score normalization to 20, the dimensionality threshold F
return, and σ[R] is the standard deviation of the return, in Fig. 2 to eight, and the number of clusters to 20. In the
which is a measure of fluctuations, that is, the risk. A greater online SRL, we set the size of the input sliding window to
Sharpe ratio indicates a higher risk-adjusted return rate. We five and the number of units of the LSTM layer to 128. In
use the change rate of the portfolio value as the return in the reinforcement learning, we set the transaction cost ζ in
Equation (26). Equation (21) to 0.25%, the N of the multistep learning to
two, the h for determining the action of the expert to 1.001,
E[R] the α and β of the dynamic delay to four and two, the noise
Sharpe_ratio = (26)
σ[R] size σ for the exploration to 0.7, the noise size σ 0 for the
3) Baseline Methods
regularization to 0.7, the clipping size c to 1, the mini-batch
size to 64, and the learning rate to 0.0001.
The state-of-the-art methods compared with SIRL-Trader
are listed below. For each method, network structures and
B. EXPERIMENTAL RESULTS
hyperparameters are manually optimized as stated below1 .
1) Comparison with Other Methods
We use the same reward function in Equation (19) and the
We compare SIRL-Trader with other methods in detail using
Adam optimizer for all methods.
the smaller dataset with various price trends. As shown in
• Buy and Hold (B&H) method buys the stock on the
Table III, for the stocks trending upward such as AAPL and
first day of the test and holds it throughout the test AMD, all methods make good profits, but SIRL-Trader has
period. This directly reflects price trends. the best return rate and Sharpe ratio. For the stocks trending
• K-line [3] clusters candlestick components using the
sideways such as DUK and K, SIRL-Trader is the best in
FCM method. The policy network comprises three K; GDPG is the best in DUK, but the difference between
ReLU dense layers with 128 units and a softmax output the return rates of GDPG and SIRL-Trader is small. For the
layer. We set the number of clusters to 5, the sliding stocks trending downward such as CCL and OXY, most of the
window size to 10, the learning rate to 0.0002, and  methods suffer losses, but SIRL-Trader makes a good profit.
to 0.9 with a decay of 0.95. Particularly in CCL, all other methods suffer significant
• TFJ-DRL [4] uses a gate structure, GRU, temporal
losses, but SIRL-Trader makes a huge profit. Fig. 8, which
attention mechanism, and a regression network for SRL. is extracted from the test period in CCL, shows that SIRL-
It combines SRL with a policy gradient method. The Trader can fully exploit the price fluctuations compared with
1 Unstated network structures and hyperparameters are set to the same as other methods in the dotted area. We can also observe that
in the original paper. TFJ-DRL trades too frequently and iRDPG too rarely. In
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

summary, SIRL-Trader can yield significant profits in the results indicate that the reinforcement learning algorithm of
stocks with different trends by integrating all the techniques SIRL-Trader, which is integrated with offline/online SRL, is
in Table II. effective for generalization.
TABLE III: Experimental results on the smaller dataset
Rate of Return
Stocks K-line GDPG B&H iRDPG TFJ-DRL SIRL-Trader
Upward AAPL 60.5% 271.1% 235.0% 237.6% 265.7% 344.7%
Trending AMD 94.8% 195.0% 385.8% 388.9% 355.7% 427.7%
Sideways DUK -23.5% 24.6% 7.8% 6.7% 8.7% 20.0%
Trending K -1.2% 8.6% 9.6% 8.3% 11.1% 43.3%
Downward CCL -39.6% -35.8% -56.5% -43.7% -14.3% 120.0%
Trending OXY -39.5% -10.6% -72.0% -29.1% 27.9% 39.3%
Minimum -39.6% -35.8% -72.0% -43.7% -14.3% 20.0%
Maximum 94.8% 271.1% 385.8% 388.9% 355.7% 427.7%
Average 8.6% 75.5% 84.9% 94.8% 109.1% 165.8%
FIGURE 9: Experimental results on the larger dataset
Sharpe Ratio
Stocks K-line GDPG B&H iRDPG TFJ-DRL SIRL-Trader
Upward AAPL 1.37 3.05 2.53 2.55 2.75 3.39
Trending AMD 1.28 1.77 2.34 2.36 2.31 2.51
2) Ablation Study
Sideways DUK -0.48 1.12 0.39 0.37 0.41 0.66
Trending K 0.13 0.41 0.43 0.40 0.86 1.38 We conduct an ablation study to demonstrate the contribution
Downward CCL -0.75 0.02 -0.05 0.15 0.16 1.56 of each component of SIRL-Trader. We exclude the com-
Trending OXY -0.38 0.10 -0.49 -1.57 1.09 0.79
Minimum -0.75 0.02 -0.49 -1.57 0.16 0.66 ponents one by one and report the results in Fig. 10. The
Maximum 1.37 3.05 2.53 2.55 2.75 3.39 excluded components are dimensionality reduction (Dim),
Average 0.20 1.08 0.86 0.71 1.26 1.71
Number of Trading Actions
clustering (Clu), multistep learning (Mul), regression model
Stocks K-line GDPG B&H iRDPG TFJ-DRL SIRL-Trader (Reg), and imitation learning (Imi); ‘All’ denotes SIRL-
Upward AAPL 234 30 1 1 5 27
Trending AMD 124 75 1 1 3 25
Trader with all the components. We evaluate the minimum,
Sideways DUK 123 104 1 1 6 20 maximum, and average performance on the smaller dataset.
Trending K 109 67 1 1 52 73 The results show that all the components contribute to im-
Downward CCL 132 151 1 15 118 78
Trending OXY 78 62 1 8 22 37 proving the performance. In particular, the offline SRL (Dim
and Clu) is very crucial.

FIGURE 10: Experimental results of the ablation study

To demonstrate the effectiveness of the dynamic delay, we


compare it with static delays ranging from two to five. As
shown in Fig. 11, the dynamic delay significantly improves
the performance compared with the static delays.

FIGURE 8: The trading actions performed in CCL

We verify the generalization ability of all methods using


the larger dataset. As shown in Table IV and Fig. 9, SIRL-
Trader outperforms all other methods in terms of the mini- FIGURE 11: Comparison of static and dynamic delays
mum, maximum, and average return rate and Sharpe ratio.
SIRL-Trader achieves an average return rate of 57.8%, which
is 14.1%P higher than the second-highest method, iRDPG. 3) Comparison with Other Hyperparameter Values
The average Sharpe ratio of SIRL-Trader is 1.06, which is We conduct experiments to evaluate the effect of important
0.25 higher than the second-highest method, iRDPG. These hyperparameters of SIRL-Trader, including the threshold for
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

FIGURE 12: Comparison with other hyperparameter values

the dimensionality reduction, number of clusters, sliding VI. CONCLUSIONS AND FUTURE WORK
window size, and noise size for the exploration. We evaluate In this paper, we proposed a practical algorithmic trading
the average return rate on the smaller dataset. method, SIRL-Trader, which achieves good profit using only
From Figs. 12(a) and (b), we can see that if the dimen- long positions. We used offline/online SRL and imitative re-
sionality threshold (or the number of clusters) is too small or inforcement learning to learn a profitable trading policy from
too large, it degrades the performance. This is because of the nonstationary and noisy stock data. In the offline SRL, we
trade-off between information loss (if too small) and noise used dimensionality reduction and clustering to extract robust
inclusion (if too large). Fig. 12(c) shows that as the sliding features. In the online SRL, we co-trained a regression model
window size increases, the performance decreases. This is with a reinforcement learning model to provide accurate state
because price information older than one week does not aid information for decision-making. In the imitative reinforce-
in decision-making and just increases the amount of noise. ment learning, we incorporated a behavior cloning technique
Fig. 12(d) shows that if the noise size is too small or too with the TD3 algorithm and applied multistep learning and
large, it degrades the performance. This is because of the dynamic delay to TD3. The experimental results showed
exploration-exploitation trade-off. Greater noise means more that SIRL-Trader yields significantly higher profits and has
exploration and less exploitation. Smaller noise indicates the superior generalization ability compared with state-of-the-art
opposite. methods. We expect our approach to be generalizable to other
complex sequential decision-making problems. Finally, we
4) Robustness Study plan to apply our work to future markets where both long
In a real trading environment, transaction costs such as trans- and short positions are always possible.
action fees, taxes, and trading slippages2 exist. We study the
robustness of SIRL-Trader by varying the transaction cost REFERENCES
[1] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement
ζ in Equation (21). We evaluate the average return rate on learning for financial signal representation and trading,” IEEE Trans.
the smaller dataset. Fig. 13 shows that the average return Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 653–664, 2017.
decreases as the transaction cost increases for all methods. [2] Y. Li, W. Zheng, and Z. Zheng, “Deep robust reinforcement learning for
practical algorithmic trading,” IEEE Access, vol. 7, pp. 108 014–108 022,
However, SIRL-Trader shows the best results regardless of 2019.
the transaction cost. We observe that, though the transaction [3] D. Fengqian and L. Chao, “An adaptive financial trading system using deep
cost of 0.35% is much higher than that of real trading reinforcement learning with candlestick decomposing features,” IEEE
Access, vol. 8, pp. 63 666–63 678, 2020.
environments, SIRL-Trader still makes a good profit. [4] K. Lei, B. Zhang, Y. Li, M. Yang, and Y. Shen, “Time-driven feature-aware
jointly deep reinforcement learning for financial signal representation and
algorithmic trading,” Expert Syst. Appl., vol. 140, p. 112872, 2020.
[5] Y. Liu, Q. Liu, H. Zhao, Z. Pan, and C. Liu, “Adaptive quantitative
trading: An imitative deep reinforcement learning approach,” in Proc.
AAAI, vol. 34, no. 2, 2020, pp. 2128–2135.
[6] X. Wu, H. Chen, J. Wang, L. Troiano, V. Loia, and H. Fujita, “Adaptive
stock trading strategies with deep reinforcement learning methods,” Inf.
Sci., vol. 538, pp. 142–158, 2020.
[7] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation
error in actor-critic methods,” in Proc. ICML, 2018, pp. 1587–1596.
[8] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting
in partially observable stochastic domains,” Artif. Intell., vol. 101, no. 1-2,
pp. 99–134, 1998.
[9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al.,
“Human-level control through deep reinforcement learning,” Nature, vol.
FIGURE 13: Experimental results of the robustness study
518, no. 7540, pp. 529–533, 2015.
[10] R. Bellman, “On the theory of dynamic programming,” PNAS, vol. 38,
no. 8, p. 716, 1952.
2 The difference between the expected price at which a trade takes place [11] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with
and the actual price at which the trade is executed. double Q-learning,” in Proc. AAAI, vol. 30, no. 1, 2016.

10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

TABLE IV: Experimental results on the larger dataset


Rate of Return Sharpe Ratio
Sectors Stocks K-line B&H GDPG TFJ-DRL iRDPG SIRL-Trader K-line B&H GDPG TFJ-DRL iRDPG SIRL-Trader
CMCSA 26.1% 51.9% 21.8% 36.8% 48.7% 85.4% 1.04 1.19 0.70 1.28 1.14 2.06
EA 70.9% 78.0% 17.5% 50.0% 76.0% 82.3% 1.44 1.37 1.26 1.09 1.35 1.42
FB 51.4% 99.9% 117.4% 85.4% 98.9% 106.6% 1.11 1.57 1.78 2.43 1.56 1.64
Communication Services
NFLX 98.5% 100.5% 60.8% 86.2% 104.2% 108.1% 1.62 1.50 1.20 1.39 1.54 1.63
OMC -4.6% -14.4% 3.9% 5.0% -11.3% -9.5% 0.10 -0.04 0.28 0.31 0.03 0.07
T -17.3% -2.9% 6.4% 7.0% -3.8% 10.8% -0.32 0.13 0.49 0.37 0.10 0.71
BBY 26.1% 85.5% 79.8% 94.5% 96.2% 101.7% 0.79 1.31 1.26 1.46 1.59 1.76
CCL -39.6% -56.5% -35.8% -14.3% -43.7% 120.0% -0.75 -0.05 0.02 0.16 0.15 1.56
F -19.7% 11.0% 27.9% 3.5% 21.3% 51.9% -0.15 0.47 0.84 0.25 0.62 1.07
Consumer Discretionary
GPC 12.7% 5.7% -7.8% 12.7% 32.4% 57.8% 0.49 0.37 0.10 0.49 0.80 1.35
HRB -32.7% -38.1% -18.1% -29.3% 14.9% -17.6% -1.33 -0.41 -1.13 -0.24 0.54 -0.43
WHR 3.1% 66.4% 62.8% 81.8% 185.6% 73.7% 0.29 1.08 1.61 1.22 2.35 1.14
HSY -8.8% 43.7% 24.1% 51.5% 42.0% 45.9% -0.11 1.10 0.78 1.28 1.08 1.17
K -1.2% 9.6% 8.6% 11.1% 8.3% 43.3% 0.13 0.43 0.41 0.86 0.40 1.38
MO -15.4% -17.0% -14.7% -1.9% -9.6% -7.5% -0.42 -0.21 -0.22 0.13 -0.10 0.02
Consumer Staple
PG 9.6% 51.8% 55.4% 26.8% 57.6% 54.0% 0.47 1.33 1.48 0.85 1.79 1.38
SYY 2.8% 19.4% 48.5% 21.4% 17.8% 20.3% 0.38 0.61 1.73 0.63 1.52 0.64
WMT 25.8% 54.0% 60.4% 25.7% 54.0% 60.4% 1.34 1.42 1.67 1.31 1.42 1.57
ABT -3.7% 56.8% 68.3% 51.0% 55.3% 62.3% 0.09 1.25 1.49 1.33 1.23 1.41
COO 20.4% 43.6% 28.5% 26.6% 40.9% 69.8% 0.73 1.05 0.80 0.77 1.01 1.69
Health Care
VRTX 11.3% 43.0% 31.1% 37.4% 40.8% 70.3% 0.48 0.95 1.03 0.91 0.92 1.43
XRAY 9.1% 37.7% 30.8% 23.3% 76.9% 43.9% 0.42 0.85 0.76 1.24 1.68 0.96
APA 55.8% -47.7% -20.0% -39.9% -36.5% -28.1% 0.94 0.22 -1.01 0.27 -1.68 0.42
COP -29.7% -36.9% -24.2% -27.1% -5.0% 24.5% -0.49 -0.21 -0.11 -0.05 -0.28 0.68
Energy OXY -39.5% -72.0% -10.6% 27.9% -29.1% 39.3% -0.38 -0.49 0.10 1.09 -1.57 0.79
SLB -14.3% -41.3% -27.8% -37.9% -28.0% -18.9% 0.01 -0.21 -0.06 -0.17 -0.71 0.14
XOM -43.2% -40.9% -36.1% -32.4% -23.4% -10.1% -0.83 -0.66 -1.44 -1.00 -0.26 0.06
AMG 26.8% 3.3% 17.9% 24.1% 37.7% 52.5% 0.75 0.38 0.74 0.82 0.86 0.99
COF 30.8% 27.5% 26.3% 13.3% 26.3% 29.1% 0.77 0.70 0.69 0.64 0.69 0.72
Financials
NTRS 7.9% 10.5% 7.5% 9.7% 9.1% 13.3% 0.40 0.46 0.39 0.43 0.44 0.50
STT -24.5% 13.5% 37.2% 3.4% 13.0% 22.8% -0.50 0.52 0.84 0.30 0.51 0.64
EXPD 17.0% 40.8% 43.8% 21.6% 68.9% 57.9% 0.68 1.09 1.16 0.71 2.30 1.48
Industrials GE 3.4% 39.2% 33.4% 78.8% 38.0% 86.7% 0.29 0.81 0.76 1.17 0.81 1.22
JCI 16.7% 52.2% 16.1% 43.6% 46.4% 119.9% 0.58 1.15 0.58 1.07 1.16 2.36
AAPL 60.5% 235.0% 271.1% 265.7% 237.6% 344.7% 1.37 2.53 3.05 2.75 2.55 3.39
AMD 94.8% 385.8% 195.0% 355.7% 388.9% 427.7% 1.28 2.34 1.77 2.31 2.36 2.51
CSCO 21.6% 3.9% 5.5% 2.7% 31.4% 15.7% 0.81 0.32 0.34 0.30 0.88 0.55
Information Technology
IBM -15.3% 8.9% -2.0% 7.1% 7.6% 12.5% -0.44 0.42 0.13 0.38 0.39 0.49
INTC -16.2% 5.5% 27.9% 33.6% 33.3% 19.2% -0.27 0.39 0.78 1.29 0.83 0.95
ORCL 4.6% 42.7% 42.7% 15.8% 42.0% 50.9% 0.30 1.01 1.02 0.56 1.00 1.20
APD 30.0% 70.0% 65.8% 72.1% 68.7% 72.1% 0.87 1.36 1.32 1.54 1.34 1.58
FCX 81.8% 150.5% 150.8% 111.5% 155.4% 201.7% 1.41 1.52 1.79 1.43 1.55 2.36
Materials NEM -0.9% 73.3% 72.4% 27.8% 65.9% 73.9% 0.21 1.31 1.32 1.14 1.23 1.34
NUE -2.7% 1.6% -4.5% 7.3% 10.5% 15.3% 0.05 0.31 0.19 0.40 0.80 0.54
VMC 42.8% 51.4% 19.8% 49.5% 41.0% 64.6% 0.98 1.01 0.61 1.00 0.89 1.31
DRE 10.5% 59.2% 47.0% 31.2% 57.1% 73.2% 0.53 1.19 1.08 1.13 1.17 1.47
EQR 17.5% -7.2% 19.5% 1.7% 19.4% 18.9% 0.72 0.12 0.68 0.21 0.65 0.60
Real Estate HST 49.3% -11.0% -10.4% 3.3% 44.6% 38.8% 1.16 0.19 -0.53 0.32 0.89 1.01
IRM -6.2% -8.4% 0.3% 13.0% 12.8% 9.2% -0.26 0.07 0.29 0.61 1.04 0.42
VNO 1.3% -38.1% -18.2% -13.9% -35.7% -16.2% 0.26 -0.31 -0.28 0.09 -1.33 0.08
AES 37.5% 65.3% 22.4% 67.9% 91.4% 76.6% 0.85 1.13 0.71 1.39 1.76 1.74
DUK -23.5% 7.8% 24.6% 8.7% 6.7% 20.0% -0.48 0.39 1.12 0.41 0.37 0.66
Utilities ETR -6.9% 18.6% 42.8% 4.3% 17.1% 35.6% -0.07 0.59 1.19 0.38 0.57 0.89
FE -16.3% -16.8% 8.2% -5.2% -13.6% -0.8% -0.11 -0.05 0.42 0.13 -0.11 -0.39
PPL -19.9% 0.1% -6.1% 0.1% -2.0% 7.8% -0.25 0.25 0.13 0.25 0.21 0.40
ETF SPY 7.7% 47.9% 50.5% 50.5% 48.5% 54.0% 0.46 1.29 1.35 1.87 1.30 1.42
Minimum -43.2% -72.0% -36.1% -39.9% -43.7% -28.1% -1.33 -0.66 -1.44 -1.00 -1.68 -0.43
Maximum 98.5% 385.8% 271.1% 355.7% 388.9% 427.7% 1.62 2.53 3.05 2.75 2.55 3.39
Average 10.4% 30.7% 31.5% 33.7% 43.7% 57.8% 0.35 0.69 0.70 0.80 0.81 1.06

[12] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dab- on reinforcement learning,” in Proc. ICML, 2017, pp. 449–458.
ney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining [17] M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hessel, I. Osband,
improvements in deep reinforcement learning,” in Proc. AAAI, vol. 32, A. Graves, V. Mnih, R. Munos, D. Hassabis et al., “Noisy networks for
no. 1, 2018. exploration,” in Proc. ICLR, 2018.
[13] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience [18] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gra-
replay,” in Proc. ICLR, 2016. dient methods for reinforcement learning with function approximation,” in
[14] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, Proc. NIPS, 2000, pp. 1057–1063.
“Dueling network architectures for deep reinforcement learning,” in Proc. [19] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
ICML, 2016, pp. 1995–2003. “Deterministic policy gradient algorithms,” in Proc. ICML, 2014, pp. 387–
[15] R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning. 395.
Cambridge: MIT Press, 1998, vol. 135. [20] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver,
[16] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective and D. Wierstra, “Continuous control with deep reinforcement learning,”

VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access

D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning

in Proc. ICLR, 2016.


[21] L.-J. Cao and F. E. H. Tay, “Support vector machine with adaptive
parameters in financial time series forecasting,” IEEE Trans. Neural Netw.,
vol. 14, no. 6, pp. 1506–1518, 2003.
[22] X. Ding, Y. Zhang, T. Liu, and J. Duan, “Deep learning for event-driven
stock prediction,” in Proc. IJCAI, 2015.
[23] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and
A. Iosifidis, “Forecasting stock prices from the limit order book using
convolutional neural networks,” in Proc. CBI, vol. 1, 2017, pp. 7–12.
[24] E. Chong, C. Han, and F. C. Park, “Deep learning networks for stock
market analysis and prediction: Methodology, data representations, and
case studies,” Expert Syst. Appl., vol. 83, pp. 187–205, 2017.
[25] L. Zhang, C. Aggarwal, and G.-J. Qi, “Stock price prediction via discov-
ering multi-frequency trading patterns,” in Proc. KDD, 2017, pp. 2141–
2149.
[26] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj, “Temporal
attention-augmented bilinear network for financial time-series data analy-
sis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 5, pp. 1407–1418,
2018.
[27] F. Feng, X. He, X. Wang, C. Luo, Y. Liu, and T.-S. Chua, “Temporal
relational ranking for stock prediction,” ACM Trans. Inf. Syst., vol. 37,
no. 2, pp. 1–30, 2019.
[28] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[29] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN
encoder-decoder for statistical machine translation,” in Proc. EMNLP,
2014.
[30] TA-Lib: Technical analysis library. [Accessed: 11-Sept-2021]. [Online].
Available: http://ta-lib.org/
[31] J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-means clustering
algorithm,” Comput. Geosci., vol. 10, no. 2-3, pp. 191–203, 1984.
[32] Yahoo Finance. [Accessed: 11-Sept-2021]. [Online]. Available: https:
//finance.yahoo.com/
[33] W. F. Sharpe, “The Sharpe ratio,” J. Portf. Manag., vol. 21, no. 1, pp. 49–
58, 1994.

VII. APPENDIX
TABLE V: List of abbreviations

Abbreviation Description
A3C Asynchronous Advantage Actor-Critic
B&H Buy and Hold
CE Cross Entropy
DPG Deterministic Policy Gradient
DDPG Deep Deterministic Policy Gradient
DDQN Double Deep Q-Network
DQN Deep Q-Network
FCM Fuzzy C-Means
GRU Gated Recurrent Unit
LSTM Long Short-Term Memory
MDP Markov Decision Process
MSE Mean Squared Error
State representation learning and
SIRL
Imitative Reinforcement Learning
SRL State Representation Learning
TD3 Twin-Delayed Deep Deterministic policy gradient

12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like