Reinforcement Learning For Options Trading
Reinforcement Learning For Options Trading
sciences
Article
Reinforcement Learning for Options Trading
Wen Wen 1,2 , Yuyu Yuan 1,2, * and Jincui Yang 1,2
1 School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and
Telecommunications, Beijing 100876, China; wen-wen@bupt.edu.cn (W.W.); jincuiyang@bupt.edu.cn (J.Y.)
2 Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education,
Beijing 100876, China
* Correspondence: yuanyuyu@bupt.edu.cn
Abstract: Reinforcement learning has been applied to various types of financial assets trading,
such as stocks, futures, and cryptocurrencies. Options, as a novel kind of derivative, have their
characteristics. Because there are too many option contracts for one underlying asset and their price
behavior is different. Besides, the validity period of an option contract is relatively short. To apply
reinforcement learning to options trading, we propose the options trading reinforcement learning
(OTRL) framework. We use options’ underlying asset data to train the reinforcement learning model.
Candle data in different time intervals are utilized, respectively. The protective closing strategy
is added to the model to prevent unbearable losses. Our experiments demonstrate that the most
stable algorithm for obtaining high returns is proximal policy optimization (PPO) with the protective
closing strategy. The deep Q network (DQN) can exceed the buy and hold strategy in options trading,
as can soft actor critic (SAC). The OTRL framework is verified effectively.
Keywords: reinforcement learning; options trading; data augmentation; protective closing strategy
sell the agreed amount from the option seller at the agreed time and the agreed price.
There are some papers researching options trading or portfolio construction in recent years.
Kelly’s criterion is used to construct options portfolio [12]. Zhao L et al. [13] proposed an
efficient BSUM-M-based algorithm [14] to solve the portfolio design problem, under the
Markowitz mean-variance framework. Reference [15] incorporated volatility forecasting,
utilizing investors’ sentiments in the decision-making process, to improve options trading
performance. Reference [16] converted index future trading to option trading strategies via
LSTM architecture and achieved superior performance.
Although the methods mentioned above demonstrate good performance in options
trading or portfolio construction, they are not robust for the actual dynamic market and
cannot be directly applied to algorithmic trading. Reinforcement learning agents can
interact with the environment and learn the decision-making parameters automatically,
which is suitable for trading financial assets, including options trading. However, as far
as we know, the papers talking about options trading using reinforcement learning are
very few. There are still some problems that need to be solved. First of all, there are many
different options contracts for one underlying asset, and each contract has its characteristics.
Besides, the period from one contract listed to its expiration is relatively short. Usually,
the period is less than six months. The period for active trading is even shorter, usually less
than two months. The training of reinforcement learning needs plenty of homogeneous
data. Training on one or all contracts cannot meet the demand. Secondly, options, as a kind
of derivative, have high leverage and volatility in nature. If the trading of options is not
appropriately limited, the principal may suffer a considerable drawdown.
To address the problems mentioned above, we put forward a framework to trade
options using reinforcement learning. The underlying asset’s trading data is utilized for
training the RL model. Compared with the option contracts, the underlying assets have
a more extended and continuous trading period, which is fit for RL training. To further
augment the training dataset, we use candle data in different time intervals (a day, 60 min,
30 min, 15 min, and 5 min) to train the model, respectively. Different RL algorithms are
used to test our methods, such as DQN, SAC, and proximal policy optimization (PPO).
To relieve the volatility of options, we added a protective close strategy to the model.
The strategy can prevent unbearable losses. The proposed model was tested on the actual
market data for options trading. It can make reliable profits on various options contracts.
Furthermore, we compared the models’ performance training on different time interval
datasets to find the proper time interval training datasets we should use.
Our study has research significance and application meaning in options algorith-
mic trading. It has a reference value for other financial assets trading and training data
augmentation problems. The main contributions of this paper are as follows:
• We propose a framework to trade options using reinforcement learning. The model is
trained on the underlying asset’s trading data. Candle data in different time intervals
(a day, 60 min, 30 min, 15 min, and 5 min) are utilized to augment the training datasets.
By this means, adequate data is available for training the RL model, leading to a
satisfactory result.
• The protective close strategy is added to the RL model to prevent unbearable losses.
Due to the high volatility of option prices, the strategy can protect the principal from
suffering a massive drawdown. Moreover, the strategy could improve investment
returns to some extent.
• The experimental results showed that the proposed model, with the protective close
strategy, can get decent returns, compared to the Buy and Hold strategy in options
trading, which indicates that our model learns ways to make profit from underlying
asset data and applies it to options trading.
The remaining parts of this paper are organized as follows. Section 2 describes the
background on reinforcement learning, options trading, and data augmentation. Section 3
introduces the options trading reinforcement learning (OTRL) framework. Section 4 pro-
vides experimental settings, results, and evaluation. Finally, Section 5 concludes this paper.
Appl. Sci. 2021, 11, 11208 3 of 17
and one type of option (calls or puts). For example, for Shanghai 50 ETF (510050), which
currently trades at 3.5, you can trade call options with strike prices of 3.4, 3.5, 3.6, 3.7, etc.,
and put options with the strike prices of 3.3, 3.4, 3.5, 3.6, etc.
3. Methodology
In this section, we introduce the options trading reinforcement learning (OTRL) frame-
work and its protective close strategy. Section 3.1 presents the training data for the RL
model trading options. Section 3.2 details the OTRL framework, and Section 3.3 introduces
the algorithms used in the framework. Section 3.4 introduces the protective close strategy
that the framework utilizes.
However, the daily trading of the data of underlying assets is not very sufficient for
RL training in reality. Shanghai 50 ETF was listed on the exchange on 23 February 2005.
Additionally, Shanghai and Shenzhen 300 ETF was listed on the exchange on 28 May
2012. There are about 250 trading days in a year. Thousands of daily trading data rows
are not enough to train the RL model, which needs millions of steps to converge. It may
lead to over-fitting. To further augment the training dataset, we use the underlying asset
candle data in different time intervals (a day, 60 min, 30 min, 15 min, and 5 min) to train
the RL model, respectively. Minute candle data can be used to train RL agents for data
augmentation [29]. The agent is then used to guide daily trading. The data in a larger time
interval (like 60 min) may be more similar to daily data in price behavior, but its amount is
fewer and vice versa. In our experiment, proper training data can be picked out from data
in different time intervals.
The trading data we used to train the RL model contains five columns: open price,
highest price, lowest price, close price, and trading volume. Trading data in different time
intervals all have these features. When multiple features exist simultaneously, if the range
of feature values is very different, the feature with a more extensive value range is more
likely to significantly impact the model, such as the trading volume in our training data.
Normalization is to process the data by some algorithm and then limit it to a specific range.
In our experiment, the features mentioned above were normalized to a dataset with a mean
of 0 and variance of 1. The normalization formula is as follows:
x−µ
z= (1)
σ
Among which, µ and σ are the mean and variance of the original dataset, respectively.
Due to the large fluctuations in the stock price and the options price, it is more reasonable
to normalize the features by a certain length of a sliding window, rather than normalize
the global features data.
The training will use several RL algorithms, i.e., DQN, PPO, and SAC, in our ex-
periment. PPO and SAC will be introduced in Section 3.3. After the training, different
models training on datasets in different time intervals are applied to trade underlying
assets for validation. The model behaving well will be selected by trading on datasets that
are different from the training datasets. By this means, the training dataset in the proper
time interval is confirmed in different RL algorithms. Then, the selected model is used for
trading options contracts.
There are many options contracts for one underlying asset. For example, Shanghai
50 ETF option quotation on one day is listed in Table 1. Table 2 shows the 50 ETF call option
candle data that day. The close price of 50 ETF that day was 3.399. In our experiment,
the model trades one option contract in one test, to show a complete trading result on one
contract during a specific period. In reality, the agent can monitor many options contracts
at the same time. We can buy the option contract that the agent takes a buy action on early,
in order to seize the trading opportunity in time.
DeepMind successfully trains DQN on a set of 49 Atari games and demonstrates the
efficiency of the approach, when applied to complicated environments [31]. The steps of
the DQN algorithm are shown in Algorithm 1.
Appl. Sci. 2021, 11, 11208 7 of 17
Algorithm 1: DQN.
target
1 Initialize the parameters for Q(s, a) and Q (s, a) with random weights and
empty the replay buffer.
2 With probability e, select a random action a, otherwise, a(s) = arg max Q(s, a).
a
0
3 Excute action a in an emulator and observe the reward r and the next state s .
0
4 Store transition (s, a, r, s ) in the replay buffer.
5 Sample a random mini-batch of transitions from the replay buffer.
6 For every transition in the buffer, calculate target y = r if the episode has ended at
target
this step, or y = r + γ max Q (s, a).
0
a ∈A
7 Calculate loss: L = ( Q(s, a) − y)2 .
8 Update Q(s, a) using the stochastic gradient descent algorithm by minimizing the
loss in respect to the model parameters.
target
9 Every N steps, copy weights from Q to Q .
10 Repeat from step 2 until converged.
typically taking multiple steps of (usually minibatch) SGD to maximize the objective. Here,
L is given by:
πθ ( a | s ) πθ πθ ( a|s)
πθ
L(s, a, θk , θ ) = min A (s, a), clip
k , 1 − e, 1 + e A (s, a) , (4)
k
πθk ( a | s ) πθk ( a | s )
in which e is a (small) hyperparameter, which roughly says how far away the new policy is
allowed to go from the old.
There is another simplified version of this objective:
πθ ( a | s ) πθ
π
L(s, a, θk , θ ) = min A k (s, a), g(e, A θk (s, a)) , (5)
πθk ( a | s )
where
(1 + e ) A A≥0
g(e, A) = (6)
(1 − e ) A A < 0.
Clipping serves as a regularizer, by removing incentives for the policy to change
dramatically. The hyperparameter e corresponds to how far away the new policy can go
from the old, while still profiting the objective.
Entropy is a quantity that says how random a random variable is. Let x be a random
variable, with probability mass P. The entropy H of x is computed from its distribution P,
according to:
H ( P) = E [− log P( x )]. (7)
x∼P
where α > 0 is the trade-off coefficient. The policy of SAC should, in each state, act to
maximize the expected future return, plus expected future entropy.
Original SAC is for continuous action settings and it is not applicable to discrete action
settings. To deal with discrete action settings of our financial asset, we use an alternative
version of the SAC algorithm that is suitable for discrete actions. It is competitive with
other model-free, state-of-the-art RL algorithms [33].
4. Experiment Result
To evaluate the effectiveness of the proposed OTRL framework, we designed an
experiment process, referring to Algorithm 2. Firstly, the better time interval of trading
data is selected by the training and validation process. Secondly, the proper threshold
of the protective closing strategy is determined by the retraining and validation process.
Thirdly, the final models are used for trading options.
Figures 4 and 5 show the training results of 50 ETF, in five different time intervals,
using the three RL algorithms. A trial graph takes the mean of sessions and plots an error
envelop using standard deviation. As we can see from Figure 4, the models trained on the
dataset in larger time intervals (such as 60 min or 1 day) tend to perform better in training
results. There is more noise than in the trading data of shorter time intervals (like 5 min
or 15 min). It is difficult for models to learn a pattern using training data with too much
noise. However, limited training data may lead to overfitting. To select a proper time
Appl. Sci. 2021, 11, 11208 11 of 17
interval for training data, we use models trained on 60-min and 1-day trading data for the
validation process.
Figure 4. Training results on 50 ETF in two different time intervals using PPO, DQN, and SAC.
Appl. Sci. 2021, 11, 11208 12 of 17
Figure 5. Training results on 50 ETF in three different time intervals using PPO, DQN, and SAC.
algorithms. The agents trained on 1-day trading data do not always outperform the B and
H strategy on the validation sets, and they perform well on the training sets. The amount
of 1-day trading data training sets is a quarter of that of the 60-min trading data training
set. We infer that the poor performance of the agents trained on 1-day trading data was
due to overfitting. In addition, the agents trained using PPO tend to hold the asset and
reduce trading frequency. While the agents trained using DQN tend to have no position
and wait for trading opportunities. Furthermore, the agents trained using SAC tend to
trade frequently or even randomly.
The validation results show that the agents, trained on 60-min trading data, using
the three algorithms, perform well on trading assets in different trading dates from the
training dataset, which verifies their effectiveness. To further improve the performance of
trading options, the protective closing strategy is added to the models in the next stage.
Since the models with the protective closing strategy performed well on the validation
set, they can be applied to trade options in the next stage. Although the SAC model
cannot be trained well with protective closing strategy, it is still used to trade options,
for comparison.
Figure 10 shows trading performance on options of 50 and 300 ETF. The option in
Figure 10a was 2.45 call of 50 ETF from 26 January 2017 to 27 September 2017. The op-
tion in Figure 10b was 4.5 call of 300 ETF from 23 January 2020 to 23 September 2020.
The DQN and PPO models were trained with a 2% stop-loss threshold, and they were
used for trading options with the same threshold. The SAC model was trained without a
protective closing strategy, but it was used to trade options with a 2% stop-loss threshold,
for comparison. From Figure 10 and Table 5, we can see that the RL models with protective
closing strategy outperformed the B and H trading strategy, which verifies the effectiveness
of our OTRL framework. In addition, the PPO model performed best among the three
models, which is likely due to the fact that its internal strategy of holding goes well with
the stop-loss mechanism in trading options.
(a) 50 ETF 2.45 call option (b) 300 ETF 4.5 call option
5. Conclusions
In this paper, we propose the OTRL framework to trade options using RL algorithms. It
provides a set of executable processes, including training, validation, and testing processes.
Underlying assets’ trading data in different time intervals were used as the training set.
In our experiment, PPO, DQN, and SAC were chosen to train the models. In practice, any
RL algorithms can be applied to trading options using the OTRL framework. Trading data
options of 50 and 300 ETF, in the A-share market, was used as the testing set. The models
with a protective closing strategy performed well on options trading, compared with the
B and H trading strategy, which verifies the effectiveness of the OTRL framework. It
addresses the problem of lack of training data and reduces trading risk. The PPO model
with a protective closing strategy performed best among the three models, which may have
a reference value for the following research.
While the proposed framework has been verified effectively, some directions are still
worth studying in the future. Multi-leg options strategies can be considered to reduce
trading risk further. Meta learning may be utilized to train the model using options with
different strike prices simultaneously.
Appl. Sci. 2021, 11, 11208 16 of 17
Author Contributions: Conceptualization, Y.Y. and J.Y.; methodology, W.W. and Y.Y.; software W.W.;
writing—original draft preparation, W.W., Y.Y. and J.Y.; writing—review and editing, W.W. and J.Y.
All authors have read and agreed to the published version of the manuscript.
Funding: This work is supported by the National Natural Science Foundation of China (No.91118002).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Meng, T.L.; Khushi, M. Reinforcement learning in financial markets. Data 2019, 4, 110. [CrossRef]
2. Lei, K.; Zhang, B.; Li, Y.; Yang, M.; Shen, Y. Time-driven feature-aware jointly deep reinforcement learning for financial signal
representation and algorithmic trading. Expert Syst. Appl. 2020, 140, 112872. [CrossRef]
3. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv
2014, arXiv:1412.3555.
4. Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A.; Bottou, L. Stacked denoising autoencoders: Learning useful
representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408.
5. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
6. Li, Y.; Zheng, W.; Zheng, Z. Deep robust reinforcement learning for practical algorithmic trading. IEEE Access 2019, 7, 108014–108022.
[CrossRef]
7. Li, Y.; Ni, P.; Chang, V. Application of deep reinforcement learning in stock trading strategies and stock forecasting. Computing
2020, 102, 1305–1322. [CrossRef]
8. Bisht, K.; Kumar, A. Deep Reinforcement Learning based Multi-Objective Systems for Financial Trading. In Proceedings of
the 2020 5th IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE), Jaipur, India, 1–3
December 2020; pp. 1–6.
9. Zhang, Z.; Zohren, S.; Roberts, S. Deep reinforcement learning for trading. J. Financ. Data Sci. 2020, 2, 25–40. [CrossRef]
10. Si, W.; Li, J.; Ding, P.; Rao, R. A multi-objective deep reinforcement learning approach for stock index future’s intraday trading.
In Proceedings of the 2017 10th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China,
9–10 December 2017; Volume 2, pp. 431–436.
11. Lucarelli, G.; Borrotti, M. A deep reinforcement learning approach for automated cryptocurrency trading. In Proceedings of the
IFIP International Conference on Artificial Intelligence Applications and Innovations, Hersonissos, Crete, Greece, 24–26 May
2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 247–258.
12. Wu, M.E.; Chung, W.H. A novel approach of option portfolio construction using the Kelly criterion. IEEE Access 2018, 6, 53044–53052.
[CrossRef]
13. Zhao, L.; Palomar, D.P. A markowitz portfolio approach to options trading. IEEE Trans. Signal Process. 2018, 66, 4223–4238.
[CrossRef]
14. Hong, M.; Chang, T.H.; Wang, X.; Razaviyayn, M.; Ma, S.; Luo, Z.Q. A block successive upper bound minimization method of
multipliers for linearly constrained convex optimization. arXiv 2014, arXiv:1401.7079.
15. Mutum, K. Volatility Forecast Incorporating Investors’ Sentiment and its Application in Options Trading Strategies: A Behavioural
Finance Approach at Nifty 50 Index. Vision 2020, 24, 217–227. [CrossRef]
16. Wu, J.M.T.; Wu, M.E.; Hung, P.J.; Hassan, M.M.; Fortino, G. Convert index trading to option strategies via LSTM architecture.
Neural Comput. Appl. 2020, 1–18. [CrossRef]
17. Hu, J.; Niu, H.; Carrasco, J.; Lennox, B.; Arvin, F. Voronoi-based multi-robot autonomous exploration in unknown environments
via deep reinforcement learning. IEEE Trans. Veh. Technol. 2020, 69, 14413–14423. [CrossRef]
18. Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [CrossRef]
19. Théate, T.; Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl. 2021, 173, 114632.
[CrossRef]
20. Cui, X.; Goel, V.; Kingsbury, B. Data augmentation for deep neural network acoustic modeling. IEEE ACM Trans. Audio Speech
Lang. Process. 2015, 23, 1469–1477.
21. Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [CrossRef]
22. Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, UK, 2016; Volume 1.
23. Fons, E.; Dawson, P.; Zeng, X.j.; Keane, J.; Iosifidis, A. Evaluating data augmentation for financial time series classification. arXiv
2020, arXiv:2010.15111.
24. Teng, X.; Wang, T.; Zhang, X.; Lan, L.; Luo, Z. Enhancing stock price trend prediction via a time-sensitive data augmentation
method. Complexity 2020, 2020, 6737951 . [CrossRef]
25. Mallat, S.G. A theory for multiresolution signal decomposition: The wavelet representation. In Fundamental Papers in Wavelet
Theory; Princeton University Press: Princeton, NJ, USA, 2009; pp. 494–513.
26. Le Guennec, A.; Malinowski, S.; Tavenard, R. Data augmentation for time series classification using convolutional neural
networks. In Proceedings of the ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data, Würzburg,
Germany, 20 September2016.
Appl. Sci. 2021, 11, 11208 17 of 17
27. Um, T.T.; Pfister, F.M.; Pichler, D.; Endo, S.; Lang, M.; Hirche, S.; Fietzek, U.; Kulić, D. Data augmentation of wearable sensor
data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM International
Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; pp. 216–220.
28. Kamycki, K.; Kapuscinski, T.; Oszust, M. Data augmentation with suboptimal warping for time-series classification. Sensors 2020, 20, 98.
[CrossRef]
29. Yuan, Y.; Wen, W.; Yang, J. Using Data Augmentation Based Reinforcement Learning for Daily Stock Trading. Electronics 2020, 9, 1384.
[CrossRef]
30. Lapan, M. Deep Reinforcement Learning Hands-On: Apply Modern RL Methods, with Deep Q-Networks, Value Iteration, Policy Gradients,
TRPO, AlphaGo Zero and More; Packt Publishing Ltd.: Birmingham, UK, 2018.
31. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.;
Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef] [PubMed]
32. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International
Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 1889–1897.
33. Christodoulou, P. Soft actor-critic for discrete action settings. arXiv 2019, arXiv:1910.07207.
34. Wu, M.E.; Syu, J.H.; Lin, J.C.W.; Ho, J.M. Evolutionary ORB-based model with protective closing strategies. Knowl. Based Syst.
2021, 216, 106769. [CrossRef]