Practical Algorithmic Trading Using State Represen
Practical Algorithmic Trading Using State Represen
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT Algorithmic trading allows investors to avoid emotional and irrational trading decisions and
helps them make profits using modern computer technology. In recent years, reinforcement learning has
yielded promising results for algorithmic trading. Two prominent challenges in algorithmic trading with
reinforcement learning are (1) extracting robust features and (2) learning a profitable trading policy. Another
challenge is that it was previously often assumed that both long and short positions are always possible in
stock trading; however, taking a short position is risky or sometimes impossible in practice. We propose a
practical algorithmic trading method, SIRL-Trader, which achieves good profit using only long positions.
SIRL-Trader uses offline/online state representation learning (SRL) and imitative reinforcement learning.
In offline SRL, we apply dimensionality reduction and clustering to extract robust features whereas, in
online SRL, we co-train a regression model with a reinforcement learning model to provide accurate state
information for decision-making. In imitative reinforcement learning, we incorporate a behavior cloning
technique with the twin-delayed deep deterministic policy gradient (TD3) algorithm and apply multistep
learning and dynamic delay to TD3. The experimental results show that SIRL-Trader yields higher profits
and offers superior generalization ability compared with state-of-the-art methods.
INDEX TERMS algorithmic trading, deep learning, state representation learning, imitation learning,
reinforcement learning.
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
consent. The borrowed stocks are essentially loans from The agent selects an action using a deterministic policy
stockbrokers. If the stock price rises, the stockbroker requires µ(s) or stochastic policy π(a|s) that defines the probability
additional capital, and if the investor cannot provide capital, distribution over actions for each state. The discounted sum
the stockbroker can close the position, resulting in a loss. of future rewards collected by the agent from the state st is
Third, the stock market generally increases over time. Fourth, defined as the discounted return Gt in Equation (1).
economic regulators can severely restrict or temporarily ban
∞
short-selling during an economic crisis. In summary, a short Gt =
X
γ i rt+i = rt + γGt+1 (1)
position is not recommended for individual investors. How- i=0
ever, obtaining good profits using only long positions is
challenging in algorithmic trading. B. DEEP Q-NETWORK (DQN)
In this paper, we propose a practical algorithmic trading In value-based reinforcement learning, the agent learns an
method named SIRL-Trader, which generates good profits estimate of the expected discounted return, or value, for
using only long positions. First, we devise an offline/online each state (Equation (2)) or for each state and action pair
state representation learning (SRL) method. The offline unsu- (Equation (3)).
pervised SRL reduces the dimensionality and applies cluster-
ing to extract robust features from observations. The online V π (s) = Ea∼π [Gt |St = s] (2)
supervised SRL co-trains a regression model that predicts
the next price with a reinforcement learning model to pro- Qπ (s, a) = Ea∼π [Gt |St = s, At = a] (3)
vide accurate state information for decision-making. Second,
we combine imitation learning with reinforcement learning A common way of deriving a new policy π 0 from Qπ (s, a)
by cloning the behavior of a prophetic expert who has is to act -greedily with respect to actions. With probability
information about subsequent price movements. Third, we (1 − ), the agent takes the action with the highest Q-value
extend the twin-delayed deep deterministic policy gradient (the greedy action), that is, π 0 (s) = argmax Qπ (s, a). With
a∈A
(TD3) algorithm [7], which is a state-of-the-art reinforce- probability , the agent takes a random action to introduce the
ment learning method, to incorporate offline/online SRL, exploration.
behavior cloning, multistep learning, and dynamic delay. The deep Q-network (DQN) [9] introduces deep neural
Fourth, compared with state-of-the-art algorithmic trading networks to approximate the Q-value for large state and
methods, SIRL-Trader yields higher profit and has superior action spaces. DQN uses a replay buffer to store past experi-
generalization ability for different stocks. ences as tuples of hs, a, r, s0 i and learns by sampling batches
The remainder of this paper is organized as follows. from the replay buffer. DQN uses two neural networks: on-
Section II introduces reinforcement learning methods, and line (Qθ ) and target (Qθ0 ) networks. The parameters θ of the
Section III reviews existing work. Sections IV and V present online network are periodically copied to the target network
SIRL-Trader and experimental results, respectively. Finally, during training. The loss function of DQN in Equation (5)
Section VI presents our conclusions and suggestions for is the mean squared error (MSE) between Qθ (s, a) and the
future work. For ease of reading, Table V in the Appendix target value YDQN in Equation (4), which uses the Bellman
lists the abbreviations used in this paper. equation [10]. The techniques of experience replay and the
target network enable stable learning.
II. BACKGROUND
YDQN = r + γ max Qθ0 (s0 , a0 ) (4)
A. REINFORCEMENT LEARNING 0 a
In reinforcement learning, an agent learns to act in an en- LDQN = E[(YDQN − Qθ (s, a))2 ] (5)
vironment to maximize the total reward. At each time step,
the environment provides a state s to the agent, the agent Because we take the maximum value for the target network
selects and takes an action a, and then the environment in Equation (4), we often obtain an overestimated value. Dou-
provides a reward r and the next state s0 . This interaction can ble DQN (DDQN) [11] solves this overestimation problem
be formalized as a Markov decision process (MDP), which of DQN by decoupling the selection of the action from its
is a tuple hS, A, P, R, γi, where S is a finite set of states, evaluation, as shown in Equation (6).
A is a finite set of actions, P (s, a, s0 ) is a state transition
probability, R(s, a) is a reward function, and γ ∈ [0, 1] is the YDDQN = r + γQθ0 (s0 , argmax Qθ (s0 , a0 )) (6)
a0
discount factor, a trade-off between immediate and long-term
rewards. In reinforcement learning for algorithmic trading, Rainbow DQN [12] is an improvement on DQN and
the state is not directly given and needs to be constructed combines several features including DDQN, prioritized ex-
from a history of observations. To accommodate this, the perience replay [13], dueling networks [14], multistep learn-
MDP model was extended with an observation probability ing [15], distributional reinforcement learning [16], and noisy
P (o|s, a). The extended model is referred to as the partially networks [17].
observable MDP (POMDP) model [8].
2 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
C. ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC (A3C) as in Equation (15), which is used to obtain the target value
In policy-based reinforcement learning, the agent directly Y in Equation (16). This technique, known as target policy
learns a policy function π(a|s). Actor-critic methods com- smoothing, serves as a regularizer to avoid overfitting to sharp
bine policy-based and value-based reinforcement learning peaks in the Q-value estimate. The noise is clipped to limit its
methods. In these methods, the critic network (Vθ ) learns the impact.
value function, and the actor network (πφ ) learns the policy
in the direction suggested by the critic. In general, the loss a0 = µφ0 (s0 ) + , ∼ clip(N , −c, c) (15)
function of the critic network is defined in Equation (8), Second, TD3 uses two critics (and two target critics) to
which uses the Bellman equation; the loss function of the solve the overestimation problem of DDPG. The minimum
actor network is defined in Equation (9), which uses the value in the pair of target critics is used to compute the target
stochastic policy gradient theorem [18]. value, as shown in Equation (16). This technique is known as
clipped double Q-learning.
Y = r + γVθ (s0 ) (7)
Y = r + γ min Qθj0 (s0 , a0 ) (16)
Lcritic = E[(Y − Vθ (s))2 ] (8) j=1,2
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
IV. SIRL-TRADER
In this section, we propose the novel algorithmic trading
method named SIRL-Trader.
A. ARCHITECTURE
We propose an actor-critic reinforcement learning method
that extends TD3 to incorporate offline/online SRL, imitation
learning, multistep learning, and dynamic delay. Fig. 1 shows
the proposed architecture. Each component is explained in
detail in the following subsections.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
g = σ(W · f + b) (17)
f0 = g f (18)
FIGURE 3: Architecture for online SRL of shares of a stock and sells all shares of the stock at the
closing price.
2) Reward
We define the reward rt for action at as the change rate of the
portfolio value as in Equation (19). The portfolio value Vtp is
the sum of the stock value Vts and remaining cash balance
Vtc , as in Equation (20).
p
rt = at × (Vt+1 − Vtp )/Vtp (19)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
5. for e = 1 to Nepochs do
6. Initialize an N -step buffer D
7. Compute the delay value for each epoch: d ← (e mod α) + β
8. for t = Nw to T − 1 do
9. Select an action with exploration noise ∼ N (0, σ): at ← µφ (st ) + where st = λυ (wt )
10. Observe a reward rt and the next input wt+1
11. Store a transition hwt , at , rt , wt+1 i to D
12. if t ≥ Nw + N − 1 then
13. Obtain an N -step transition hwt−N +1:t+1 , at−N +1:t , rt−N +1:t i from D and store it to B
14. Sample a mini-batch of B transitions of length N from B
15. Smooth the target policy with ∼ clip(N (0, σ 0 ), −c, c): a0t+1 ← µφ0 (st+1 ) + where st+1 = λυ0 (wt+1 )
P −1 i N min Q 0 (sj j
16. Y ← N i=0 γ rt−N +1+i + γ j=1,2
0
θj t+1 , at+1 ) where st+1 = oηj0 (wt+1 )
1
(Y − Qθj (sjt−N +1 , at−N +1 ))2 where sjt−N +1 = oηj (wt−N +1 )
P
17. Update the critics θj by the MSE loss: B
1 P
21. Update the actor φ by the CE loss for the behavior cloning: B ∇CE(at−N +1 , aexpert ) where at−N +1 = µφ (λυ (wt−N +1 ))
0 0 , η0
← τ θ1,2 + (1 − τ )θ1,2 0 0 0 0 0
22. Soft-update the target networks: θ1,2 1,2 ← τ η1,2 + (1 − τ )η1,2 , φ ← τ φ + (1 − τ )φ , υ ← τ υ + (1 − τ )υ
23. end if
24. end if
25. end for
26. end for
6 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
$10,000 after a trading test period. We also evaluate the 4) Implementation Details of SIRL-Trader
Sharpe ratio [33], which measures the return of an investment In the offline SRL, we set the tumbling window size for the
compared to its risk. In Equation (26), E[R] is the expected z-score normalization to 20, the dimensionality threshold F
return, and σ[R] is the standard deviation of the return, in Fig. 2 to eight, and the number of clusters to 20. In the
which is a measure of fluctuations, that is, the risk. A greater online SRL, we set the size of the input sliding window to
Sharpe ratio indicates a higher risk-adjusted return rate. We five and the number of units of the LSTM layer to 128. In
use the change rate of the portfolio value as the return in the reinforcement learning, we set the transaction cost ζ in
Equation (26). Equation (21) to 0.25%, the N of the multistep learning to
two, the h for determining the action of the expert to 1.001,
E[R] the α and β of the dynamic delay to four and two, the noise
Sharpe_ratio = (26)
σ[R] size σ for the exploration to 0.7, the noise size σ 0 for the
3) Baseline Methods
regularization to 0.7, the clipping size c to 1, the mini-batch
size to 64, and the learning rate to 0.0001.
The state-of-the-art methods compared with SIRL-Trader
are listed below. For each method, network structures and
B. EXPERIMENTAL RESULTS
hyperparameters are manually optimized as stated below1 .
1) Comparison with Other Methods
We use the same reward function in Equation (19) and the
We compare SIRL-Trader with other methods in detail using
Adam optimizer for all methods.
the smaller dataset with various price trends. As shown in
• Buy and Hold (B&H) method buys the stock on the
Table III, for the stocks trending upward such as AAPL and
first day of the test and holds it throughout the test AMD, all methods make good profits, but SIRL-Trader has
period. This directly reflects price trends. the best return rate and Sharpe ratio. For the stocks trending
• K-line [3] clusters candlestick components using the
sideways such as DUK and K, SIRL-Trader is the best in
FCM method. The policy network comprises three K; GDPG is the best in DUK, but the difference between
ReLU dense layers with 128 units and a softmax output the return rates of GDPG and SIRL-Trader is small. For the
layer. We set the number of clusters to 5, the sliding stocks trending downward such as CCL and OXY, most of the
window size to 10, the learning rate to 0.0002, and methods suffer losses, but SIRL-Trader makes a good profit.
to 0.9 with a decay of 0.95. Particularly in CCL, all other methods suffer significant
• TFJ-DRL [4] uses a gate structure, GRU, temporal
losses, but SIRL-Trader makes a huge profit. Fig. 8, which
attention mechanism, and a regression network for SRL. is extracted from the test period in CCL, shows that SIRL-
It combines SRL with a policy gradient method. The Trader can fully exploit the price fluctuations compared with
1 Unstated network structures and hyperparameters are set to the same as other methods in the dotted area. We can also observe that
in the original paper. TFJ-DRL trades too frequently and iRDPG too rarely. In
8 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
summary, SIRL-Trader can yield significant profits in the results indicate that the reinforcement learning algorithm of
stocks with different trends by integrating all the techniques SIRL-Trader, which is integrated with offline/online SRL, is
in Table II. effective for generalization.
TABLE III: Experimental results on the smaller dataset
Rate of Return
Stocks K-line GDPG B&H iRDPG TFJ-DRL SIRL-Trader
Upward AAPL 60.5% 271.1% 235.0% 237.6% 265.7% 344.7%
Trending AMD 94.8% 195.0% 385.8% 388.9% 355.7% 427.7%
Sideways DUK -23.5% 24.6% 7.8% 6.7% 8.7% 20.0%
Trending K -1.2% 8.6% 9.6% 8.3% 11.1% 43.3%
Downward CCL -39.6% -35.8% -56.5% -43.7% -14.3% 120.0%
Trending OXY -39.5% -10.6% -72.0% -29.1% 27.9% 39.3%
Minimum -39.6% -35.8% -72.0% -43.7% -14.3% 20.0%
Maximum 94.8% 271.1% 385.8% 388.9% 355.7% 427.7%
Average 8.6% 75.5% 84.9% 94.8% 109.1% 165.8%
FIGURE 9: Experimental results on the larger dataset
Sharpe Ratio
Stocks K-line GDPG B&H iRDPG TFJ-DRL SIRL-Trader
Upward AAPL 1.37 3.05 2.53 2.55 2.75 3.39
Trending AMD 1.28 1.77 2.34 2.36 2.31 2.51
2) Ablation Study
Sideways DUK -0.48 1.12 0.39 0.37 0.41 0.66
Trending K 0.13 0.41 0.43 0.40 0.86 1.38 We conduct an ablation study to demonstrate the contribution
Downward CCL -0.75 0.02 -0.05 0.15 0.16 1.56 of each component of SIRL-Trader. We exclude the com-
Trending OXY -0.38 0.10 -0.49 -1.57 1.09 0.79
Minimum -0.75 0.02 -0.49 -1.57 0.16 0.66 ponents one by one and report the results in Fig. 10. The
Maximum 1.37 3.05 2.53 2.55 2.75 3.39 excluded components are dimensionality reduction (Dim),
Average 0.20 1.08 0.86 0.71 1.26 1.71
Number of Trading Actions
clustering (Clu), multistep learning (Mul), regression model
Stocks K-line GDPG B&H iRDPG TFJ-DRL SIRL-Trader (Reg), and imitation learning (Imi); ‘All’ denotes SIRL-
Upward AAPL 234 30 1 1 5 27
Trending AMD 124 75 1 1 3 25
Trader with all the components. We evaluate the minimum,
Sideways DUK 123 104 1 1 6 20 maximum, and average performance on the smaller dataset.
Trending K 109 67 1 1 52 73 The results show that all the components contribute to im-
Downward CCL 132 151 1 15 118 78
Trending OXY 78 62 1 8 22 37 proving the performance. In particular, the offline SRL (Dim
and Clu) is very crucial.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
the dimensionality reduction, number of clusters, sliding VI. CONCLUSIONS AND FUTURE WORK
window size, and noise size for the exploration. We evaluate In this paper, we proposed a practical algorithmic trading
the average return rate on the smaller dataset. method, SIRL-Trader, which achieves good profit using only
From Figs. 12(a) and (b), we can see that if the dimen- long positions. We used offline/online SRL and imitative re-
sionality threshold (or the number of clusters) is too small or inforcement learning to learn a profitable trading policy from
too large, it degrades the performance. This is because of the nonstationary and noisy stock data. In the offline SRL, we
trade-off between information loss (if too small) and noise used dimensionality reduction and clustering to extract robust
inclusion (if too large). Fig. 12(c) shows that as the sliding features. In the online SRL, we co-trained a regression model
window size increases, the performance decreases. This is with a reinforcement learning model to provide accurate state
because price information older than one week does not aid information for decision-making. In the imitative reinforce-
in decision-making and just increases the amount of noise. ment learning, we incorporated a behavior cloning technique
Fig. 12(d) shows that if the noise size is too small or too with the TD3 algorithm and applied multistep learning and
large, it degrades the performance. This is because of the dynamic delay to TD3. The experimental results showed
exploration-exploitation trade-off. Greater noise means more that SIRL-Trader yields significantly higher profits and has
exploration and less exploitation. Smaller noise indicates the superior generalization ability compared with state-of-the-art
opposite. methods. We expect our approach to be generalizable to other
complex sequential decision-making problems. Finally, we
4) Robustness Study plan to apply our work to future markets where both long
In a real trading environment, transaction costs such as trans- and short positions are always possible.
action fees, taxes, and trading slippages2 exist. We study the
robustness of SIRL-Trader by varying the transaction cost REFERENCES
[1] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement
ζ in Equation (21). We evaluate the average return rate on learning for financial signal representation and trading,” IEEE Trans.
the smaller dataset. Fig. 13 shows that the average return Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 653–664, 2017.
decreases as the transaction cost increases for all methods. [2] Y. Li, W. Zheng, and Z. Zheng, “Deep robust reinforcement learning for
practical algorithmic trading,” IEEE Access, vol. 7, pp. 108 014–108 022,
However, SIRL-Trader shows the best results regardless of 2019.
the transaction cost. We observe that, though the transaction [3] D. Fengqian and L. Chao, “An adaptive financial trading system using deep
cost of 0.35% is much higher than that of real trading reinforcement learning with candlestick decomposing features,” IEEE
Access, vol. 8, pp. 63 666–63 678, 2020.
environments, SIRL-Trader still makes a good profit. [4] K. Lei, B. Zhang, Y. Li, M. Yang, and Y. Shen, “Time-driven feature-aware
jointly deep reinforcement learning for financial signal representation and
algorithmic trading,” Expert Syst. Appl., vol. 140, p. 112872, 2020.
[5] Y. Liu, Q. Liu, H. Zhao, Z. Pan, and C. Liu, “Adaptive quantitative
trading: An imitative deep reinforcement learning approach,” in Proc.
AAAI, vol. 34, no. 2, 2020, pp. 2128–2135.
[6] X. Wu, H. Chen, J. Wang, L. Troiano, V. Loia, and H. Fujita, “Adaptive
stock trading strategies with deep reinforcement learning methods,” Inf.
Sci., vol. 538, pp. 142–158, 2020.
[7] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation
error in actor-critic methods,” in Proc. ICML, 2018, pp. 1587–1596.
[8] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting
in partially observable stochastic domains,” Artif. Intell., vol. 101, no. 1-2,
pp. 99–134, 1998.
[9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al.,
“Human-level control through deep reinforcement learning,” Nature, vol.
FIGURE 13: Experimental results of the robustness study
518, no. 7540, pp. 529–533, 2015.
[10] R. Bellman, “On the theory of dynamic programming,” PNAS, vol. 38,
no. 8, p. 716, 1952.
2 The difference between the expected price at which a trade takes place [11] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with
and the actual price at which the trade is executed. double Q-learning,” in Proc. AAAI, vol. 30, no. 1, 2016.
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
[12] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dab- on reinforcement learning,” in Proc. ICML, 2017, pp. 449–458.
ney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining [17] M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hessel, I. Osband,
improvements in deep reinforcement learning,” in Proc. AAAI, vol. 32, A. Graves, V. Mnih, R. Munos, D. Hassabis et al., “Noisy networks for
no. 1, 2018. exploration,” in Proc. ICLR, 2018.
[13] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience [18] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gra-
replay,” in Proc. ICLR, 2016. dient methods for reinforcement learning with function approximation,” in
[14] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, Proc. NIPS, 2000, pp. 1057–1063.
“Dueling network architectures for deep reinforcement learning,” in Proc. [19] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
ICML, 2016, pp. 1995–2003. “Deterministic policy gradient algorithms,” in Proc. ICML, 2014, pp. 387–
[15] R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning. 395.
Cambridge: MIT Press, 1998, vol. 135. [20] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver,
[16] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective and D. Wierstra, “Continuous control with deep reinforcement learning,”
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3127209, IEEE Access
D.-Y. Park and K.-H. Lee: Practical Algorithmic Trading Using State Representation Learning and Imitative Reinforcement Learning
VII. APPENDIX
TABLE V: List of abbreviations
Abbreviation Description
A3C Asynchronous Advantage Actor-Critic
B&H Buy and Hold
CE Cross Entropy
DPG Deterministic Policy Gradient
DDPG Deep Deterministic Policy Gradient
DDQN Double Deep Q-Network
DQN Deep Q-Network
FCM Fuzzy C-Means
GRU Gated Recurrent Unit
LSTM Long Short-Term Memory
MDP Markov Decision Process
MSE Mean Squared Error
State representation learning and
SIRL
Imitative Reinforcement Learning
SRL State Representation Learning
TD3 Twin-Delayed Deep Deterministic policy gradient
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/