Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
28 views

A_Futures_Quantitative_Trading_Strategy_Based_on_a_Deep_Reinforcement_Learning_Algorithm

Uploaded by

陳欣玄
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

A_Futures_Quantitative_Trading_Strategy_Based_on_a_Deep_Reinforcement_Learning_Algorithm

Uploaded by

陳欣玄
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2023 IEEE the 8th International Conference on Big Data Analytics

A Futures Quantitative Trading Strategy Based on a


Deep Reinforcement Learning Algorithm
2023 IEEE 8th International Conference on Big Data Analytics (ICBDA) | 979-8-3503-1076-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICBDA57405.2023.10104902

Xuemei Chen Haoran Guo


School of Information Ningbo Institute of Artificial
Management Intelligence
Wuhan University Shanghai Jiao Tong University
Wuhan, China Ningbo,China
18262629056@139.com 2016579609@qq.com

Abstract—Deep reinforcement learning (DRL) is a type of end process from financial data information to transaction
machine learning algorithm that has gained a lot of attention for decision-making can further improve the automation of
its application in the financial field. Based on the proximal policy transactions. Deep reinforcement learning has the ability to self-
optimization algorithm (PPO) in deep reinforcement learning, this learn, which is very suitable for realizing the end-to-end process
paper designs a trading strategy for the Chinese futures market, from financial raw data to transaction decisions.
and realizes the end-to-end decision-making process from futures
data to trading actions. Afterwards, using domestic rebar futures Many scholars have used deep reinforcement learning to
data, multiple historical data were selected for backtesting, and implement timing strategies, stock selection strategies,
compared with traditional trading strategies. The results show statistical arbitrage strategies, and investment portfolio
that in the 12 selected test periods, 83.3% of the test periods are strategies[8][9][10][11]. The current research is mainly on the data of
profitable, which is better than 33.3% of mean reversion (MR) and US stocks, and the trading frequency is generally measured in
25% of trend following (TF). It shows that the strategy proposed days[12]. There are still the following issues to be explored.
by us shows good adaptability when the futures market rises or
falls compared with traditional methods, and can reduce losses First, the current research on China's financial sector is
through trading even when the market price changes significantly, insufficient, especially the research on China's futures market
thus increasing the return on investment. data is rare. The factors that affect the price changes of futures
and stocks are different, and the conclusions drawn from US
Keywords—machine learning, Proximal Policy Optimization stock data cannot be directly applied to futures. Secondly, when
algorithm, deep reinforcement learning, financial trading the trading frequency is calculated in days, the time interval
between different trading moments is very large, and the stock
I. INTRODUCTION price between different days may be affected by news and other
Financial trading includes technical analysis, fundamental information, and the before and after laws may be quite different,
analysis, and algorithmic trading[1]. Electronic limit order books affecting the performance of the model. If high-frequency data
are now used in almost all markets. Hence algorithmic trading, in milliseconds is used, the performance of the algorithm is
usually defined as the use of computer algorithms to greatly affected by hardware performance, so we need to
automatically make trading decisions, submit orders and consider trading on a suitable time scale. In addition, new
manage those orders after submission[2]. Reinforcement learning algorithms for deep reinforcement learning continue to emerge,
can interact with the environment, get a reward according to the and the performance of deep Q-learning algorithms used in some
feedback of the environment, adjust the strategy through the previous studies is not as good as that of some later methods,
reward, and finally maximize the expected reward[3].The which have achieved good performance in practice on other
combination of deep learning and reinforcement learning problems. These new methods need to be further applied to
increases the adaptability of reinforcement learning in different transaction decision-making problems in the field of transaction
situations[4].Since the introduction of deep reinforcement decision-making.
learning, a series of related algorithms have been produced,
This paper addresses the above issues. Based on the
which have been widely used in news recommendation, robot
Proximal Policy Optimization algorithm, build a futures trading
control, and financial trade.
model, use the high-quality futures data of the Chinese futures
In 2015, AlphaGo developed by Google used the deep market, select the contract with the largest daily trading volume,
learning method to defeat the European Go champion in the Go and conduct intra-day timing trading. The experiment found that
game[5]. The subsequent use of the improved version of deep for the selected rebar futures data, when trading at minute
reinforcement learning achieved better results, which brought intervals, it is easier to obtain the trend of price changes and
further attention of scholars to deep reinforcement learning. capture good trading opportunities in combination with
Deep learning has many applications in time series forecasting, information such as trading volume. Finally, this paper conducts
and some papers using deep learning to predict financial time back testing based on rebar futures data, compares the trading
series have achieved great influence[6][7]. The ultimate purpose effects of deep reinforcement learning timing strategy and
of forecasting financial time series is to use the forecast results traditional trading strategy, and finds that the scheme proposed
as a reference to make trading decisions. Realizing the end-to- in this paper can achieve better results than traditional methods.

979-8-3503-1076-4/23/$31.00 ©2023 IEEE 175


Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:23:48 UTC from IEEE Xplore. Restrictions apply.
II. DATASETS AND PREPROCESSING III. PROBLEM FORMALISATION
The data used in this article is high-quality rebar futures data The problem that the paper wants to solve is to use a deep
from China's futures market, which records the changes in reinforcement learning algorithm to make trading decisions
information such as the transaction price and transaction volume when the initial capital is determined, and to obtain better returns
of rebar futures during daily trading hours. A piece of data is than traditional strategies within a fixed period of time. The
recorded every 500 milliseconds, from December 2018 to following basic assumptions exist:
October 2021.On a given day, there are generally 12 futures
contracts with different expiry dates in rebar futures. The Assumption 1: The liquidity of all market assets is high
contract with the most volume over a multi-day time frame is enough that, each trade can be executed immediately at the
called the dominant contract during this time. As shown in ordered price, i.e., zero slippage.
Figure 1, the main contracts corresponding to different times are Assumption 2: The action of agent will not affect the market.
different.
Assumption 3: The volume of each stock is large enough, so
the model can buy or sell every stock at any trading day[9].
In addition, futures have a margin mechanism. The existence
of the margin mechanism will increase the profit amount by a
certain proportion, and it does not determine whether the
transaction is profitable or not. Therefore, it is assumed that both
the reinforcement learning strategy and the traditional strategy
do not consider the margin mechanism when trading.

Fig. 1. The main contract changes over time

In the experiment, the price of the main contract is used as Fig. 2. The agent-environment interaction in reinforcement learning[9]
the transaction reference price. Our data is divided into several
small data sets. Each small dataset has ten days of data, of which The structure of the reinforcement learning part is shown in
eight days of data are used for training and two days of data are the figure 2. The trading strategy is made by the agent, and the
used for testing. When demarcating small data sets, make sure state, action, and return are defined below[13].We shall first
that the 10-day data is distributed over two consecutive weeks provide some mathematical notations:
and corresponds to the same trading contract. Because the data
information is at the millisecond level, data extraction needs to pt : futures price at time t.
be done to convert it into new data that reflects how the data Ft : The transaction fee at time t, the fee rate is set to 1/10000.
changes every minute.
N t : The number of futures contracts held.
The futures trading price at time t is recorded as pt , from When the position is from buy to sell or from sell to buy, it
time t-1 to time t, the total transaction volume is recorded as is necessary to close the position first, formula (1) represents the
volt , the total transaction value during this period divided by change of cash when the position is closed.
the transaction volume, denoted as mt , indicates the weighted C ' Ct -1 ± pt N t -1 - F 't
= (1)
average price. Further, according to the change of pt , the '
N t is the largest integer not exceeding C t .
technical indicators moving average convergence divergence pt
(MACD) and Bollinger Bands (Boll) were calculated, and the
positive and negative difference (DIF) of macd at time t was H t : The contract value of the position at time t, it is calculated
recorded as macdt , and the average of similarities and from formula (2).
H t = N t pt (2)
differences (DEA) was recorded as macdst . The average line of
Ct : The amount of cash remaining at time t.
Bollinger Bands is recorded as Bt , and the upper and lower
After the position is closed and then opened, the cash at time
limits adopt 2 times variance, which is recorded as Bupt and t is as formula (3).
Blpt respectively.
Ct = C 't ± H t − F ''t (3)
The daily trading hours are divided into three sections: Vt : Total assets at time t, that is, the total value of contracts and
21:00-23:00 the previous day; 9:00-11:30 in the morning; and
13:30-15:00 in the afternoon. Considering that the price changes cash held in positions. The initial funding is set to one million.
in the time period at the beginning of trading are still unstable, (4)
V= H t + Ct
trading will start at 21:20 and 9:20, 20 minutes after the opening. t

176
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:23:48 UTC from IEEE Xplore. Restrictions apply.
Pt : Indicates that it is currently buy or sell. -1 means that it is t specifies the time index in [0,T ] , within a given length-T
currently in a selling state, and +1 means that it is currently in a trajectory segment, γ and λ represents the corresponding
buy state. discount factor.
at : Transaction action at time t. 0 means going to sell , 1  (8)
A = δ t + ( γλ ) δ t +1 +  +  + ( γλ )
T −t +1
δ T −1
means going to buy.
The structure of the actor-critic network is shown in figure 4
st : State at time t, a tuple consisting of:
and figure 5.
( pt −i , meant −i , volt −i , macdt −i , macdst −i , Bupt −i , Bt −i , Blpt −i ,..., Pt )
where i=0,1, …,14. 15 messages from time t-14 to time t,
combined with the position status, are input into the actor
network as the state, and the output is the action at time t+1.

Rt : Reward at time t, defined as: Rt = Vt − Vt −1


Vt
π t ( a | s ) : The probability that at = a if st = s .

IV. METHOD
Proximal Policy Optimization (PPO)[9]is an Actor–critic
method that an actor uses to select an action given an input state.
The estimated value function is known as the critic as it
evaluates the action value made by the actor. The critic learns
and evaluates the policy adopted by the actor, and uses the Fig. 4. Actor network
temporal different error to update the policy according to the
error. The overall structure is shown in figure 3[14].

Fig. 3. Actor Critic Structure


Fig. 5. Critic network
PPO is introduced to control the policy gradient update and
ensure that the new policy will not be too different from the The comparison method used in the paper is as
previous one[15]. follows[16]:Buy and hold (B&H). Sell and hold (S&H). Trend
following with moving averages (TF). Mean reversion with
Let rt (θ ) denote the probability ratio: moving averages (MR). B&H and S&H are said to be passive,
as there are no changes in trading position over the trading
π θ ( at | st ) (5) horizon. TF and MR are active trading strategies, issuing
rt (θ ) =
π θ ( at | st )
old multiple changes in trading positions over the trading horizon[17].
The clipped surrogate objective function of PPO is:
V. RESULTS AND ANALYSIS

 ( )
J CLIP (θ ) E  min rt (θ ) A ( st , at ) , clip(rt (θ ) ,1 + ε ,1 − ε ) A ( st , at )  (6)
 A. Dataset Specific Transaction Performance
Using the above method, train on several small datasets and
𝐴𝐴̂(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) is the estimated advantage function. The function then test the transaction effect on the test set. Observe a certain
clip (rt (θ ) ,1 + ε ,1 − ε ) clips the ratio rt (θ ) to be within data set and draw the rise and fall of funds during the transaction,
as shown in figure 6. Here is the data set of 20200706-20200717
[1 − ε ,1 + ε ] . as an example. The upper part of Figure 6 shows the change of
rebar futures price with time, and the lower part shows the
The time difference is recorded as δ t : change of profit and loss amount with time. A small red triangle
indicates a buying position at that moment, and a small green
rt + γ V ( st +1 ) − V ( st )
δt = (7) triangle indicates a selling position at that moment.

177
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:23:48 UTC from IEEE Xplore. Restrictions apply.
It can be seen that when the price of rebar futures dropped indicate the amount of profit and loss. Three experiments were
sharply in the middle period, the algorithm avoided large losses conducted on the same test set, and the table shows the average
by continuously adjusting the trading actions. In the later stages profit and loss of the three experiments.
of the test set, the algorithm captured the trend of rising prices
and achieved substantial profits. TABLE 1. PROFIT & LOSS PERFORMANCE OF DIFFERENT METHODS ON THE
TEST SET
Train and test Comparison of the profit and loss of different trading
set data range strategies in different test sets
B&H S&H MR TF DRLPPO
20181217- -691 276 1574 -5065 2674
20181228
20190415- -3340 2924 -23355 -13774 -2964
20190426
20190513- 6839 -7272 11014 -18975 4620
20190524
20200706- -1488 1014 -17786 -16095 6645
20200717
20200803- 3300 -3699 -14787 -9361 241
20200814
Fig. 6. A test set of trading conditions and changes in profit and loss
20201026- 2772 -3191 -6247 -28853 3446
Then, as shown in figure 7, draw the distribution of profit & 20201106
loss of each trade and the distribution of holding time on the 20201109- 10521 -10883 -2265 -18019 2367
training set and test set of the small data set. 20201120
20201207- 26529 -26741 7467 -43126 14461
20201218
20210222- -20306 20062 -30191 10429 -6850
20210305
20210315- 28312 -28857 14696 -34853 4692
20210326
20210517- 45090 -45055 -30794 5590 30781
20210528
20210705- 6103 -6479 -35982 1462 14539
20210716
Profitable Test 66.7% 33.3% 33.3% 25% 83.3%
Set Proportion
Fig. 7. Trading profit & loss and position time distribution

(a) and (b) of figure 7 respectively represent the distribution


of the profit & loss of each transaction among all the trading It can be seen from the table that the DRLPPO method
actions in the training set and the test set. It can be seen that the proposed in this paper has made profits in 10 of the 12 test
profit & loss of multiple transactions conform to a normal periods, and it is the best among all strategies in multiple test
distribution. sets. The strategy of holding the position after buying
futures(B&H) and then closing the position until the end is
(c) and (d) in figure 7 represent the distribution of position simple, but because there is no frequent trading, the transaction
holding time in multiple trades in the training set and test set, cost is reduced. The trading performance is not bad when the
respectively. They all approximately exhibit an exponential market is rising, but there may be greater losses in the case of a
distribution. falling market. The strategy of closing the position at the end of
the test set after selling(S&H) is good for the falling market, but
The consistency of the data distribution of transaction data
it also has the same problem as B&H. In terms of minute-level
in the training set and the test set. It shows that our algorithm
transaction frequency, the price changes frequently, and the
has learned the laws of the data and shows stable performance
traditional trend following (TF) and mean reversion (MR)
in the test set.
strategies will cause frequent transactions to generate a large
B. Comparison of Trading Performance of Different Trading amount of handling fees, and the transaction performance is not
Strategies in Multiple Datasets ideal. The strategy we proposed shows profits in both the rising
Abbreviating our approach in Section 4 as DRLPPO, we market and some falling markets, and can also reduce losses
compare our method with 4 other traditional methods mentioned through buying and selling when some prices fluctuate sharply,
in Section 4 on multiple datasets. Table 1 shows the trading showing better adaptability in many test intervals.
profit and loss performance of different methods on multiple test
sets under the same initial conditions. The numbers in the table

178
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:23:48 UTC from IEEE Xplore. Restrictions apply.
VI. CONCLUSION [5] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of
Go with deep neural networks and tree search [J]. nature, 2016, 529(7587):
With the continuous opening up of the domestic financial 484-9.
sector and the continuous entry of foreign investment [6] ISCHER T, KRAUSS C. Deep learning with long short-term memory
institutions, research on financial investment is crucial to networks for financial market predictions [J]. European Journal of
national financial security. Artificial intelligence algorithms Operational Research, 2018, 270(2): 654-69.
continue to develop and are widely used in the financial field. In [7] KRAUSS C, DO X A, HUCK N. Deep neural networks, gradient-boosted
this context, using deep reinforcement learning algorithms for trees, random forests: Statistical arbitrage on the S&P 500 [J]. European
Journal of Operational Research, 2017, 259(2): 689-702.
futures trading has important theoretical and practical
[8] KAWY R A, ABDELMOEZ W M, SHOUKRY A. Financial portfolio
significance. construction for Quantitative Trading using Deep learning technique;
Based on the proximal policy optimization Algorithm (PPO), proceedings of the International Conference on Artificial Intelligence and
Soft Computing, F, 2021 [C]. Springer.
this paper designs a futures trading strategy, which can realize
[9] ZHANG H, JIANG Z, SU J. A Deep Deterministic Policy Gradient-based
the end-to-end decision-making process from the historical Strategy for Stocks Portfolio Management; proceedings of the 2021 IEEE
trading price and trading volume data of rebar futures to the 6th International Conference on Big Data Analytics (ICBDA), F, 2021
trading action. The trading strategy we proposed will be [C]. IEEE.
compared with the traditional trading strategy in back testing [10] JIANG Z, XU D, LIANG J. A deep reinforcement learning framework for
was done in a simulated environment. The research results show the financial portfolio management problem [J]. arXiv preprint
that in the 12 selected test time periods, our strategy has made arXiv:170610059, 2017.
profits in 10 time periods. On the minute-level transaction scale, [11] LI J, WANG X, LIN Y, et al. Generating realistic stock market order
streams; proceedings of the Proceedings of the AAAI Conference on
the price fluctuates frequently, and the traditional strategy has Artificial Intelligence, F, 2020 [C].
more transactions. The transaction cost based on the PPO [12] XIONG Z, LIU X-Y, ZHONG S, et al. Practical deep reinforcement
proposed in this paper will comprehensively consider the impact learning approach for stock trading [J]. arXiv preprint arXiv:181107522,
of volume price data and transaction costs on the transaction 2018.
profit results, showing a stability superior to traditional [13] YANG H, LIU X-Y, ZHONG S, et al. Deep reinforcement learning for
strategies, indicating that the strategy proposed by us shows automated stock trading: An ensemble strategy [J]. Available at SSRN,
good adaptability when the futures market rises or falls, and can 2020.
reduce losses through trading even when the market price [14] Sutton R S , Barto A G . Reinforcement Learning: An Introduction[J].
changes significantly, thus increasing the return on investment. IEEE Transactions on Neural Networks, 1998, 9(5):1054-1054.
[15] SCHULMAN J, MORITZ P, LEVINE S, et al. High-dimensional
Future research can focus on the period of loss, analyze the continuous control using generalized advantage estimation [J]. arXiv
cause of the loss, and further optimize the trading effect. In preprint arXiv:150602438, 2015.
addition to volume and price data, the impact of financial news [16] CHAN E P. Quantitative trading: how to build your own algorithmic
on price trends can be further considered to improve the trading business [M]. John Wiley & Sons, 2021.
accuracy of algorithmic trading [18]. [17] THéATE T, ERNST D. An application of deep reinforcement learning to
algorithmic trading [J]. Expert Systems with Applications, 2021, 173:
114632.
REFERENCES
[18] Wang J, Shi J, Han D, et al. Internet financial news and prediction for
[1] ZHANG Z, ZOHREN S, ROBERTS S. Deep reinforcement learning for stock market: An empirical analysis of tourism plate based on LDA and
trading [J]. The Journal of Financial Data Science, 2020, 2(2): 25-40. SVM[J]. Journal of Advances in Information Technology Vol, 2019,
[2] HENDERSHOTT T, RIORDAN R. Algorithmic trading and information 10(3).
[J]. Manuscript, University of California, Berkeley, 2009.
[3] LI Y. Deep reinforcement learning: An overview [J]. arXiv preprint
arXiv:170107274, 2017.
[4] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep
reinforcement learning [J]. arXiv preprint arXiv:13125602, 2013.

179
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:23:48 UTC from IEEE Xplore. Restrictions apply.

You might also like