Joint Learning of Volume Scheduling and Order Placement Policies for Optimal Order Execution
Abstract
:1. Introduction
- To distill multi-granularity embeddings from noisy market data, we develop a new representation learning module with the variational recurrent neural network [11].
- We propose a hierarchical RL-based order execution approach, which can jointly learn the policies for volume scheduling and order placement. The hierarchical decision-making is built upon the state embeddings extracted by the representation learning module.
- Extensive experiment results demonstrate that the proposed approach outperforms the state-of-the-art baselines in order execution. Ablation studies validate the effectiveness of the various components in the proposed approach.
2. Related Work
2.1. Human Trader Workflow for Order Exeuction
2.2. Traditional Approaches for Order Execution
2.3. Reinforcement Learning for Order Execution
3. Preliminaries
3.1. OHLCV Data
3.2. Market Orders and Limit Orders
3.3. Limit Order Book
3.4. Matching System
3.5. Markov Decision Processes
4. Approach
4.1. Multi-Granularity Representation Learning
4.2. Hierarchical Order Execution Policy Learning
4.2.1. The High-Level MDP for Volume Scheduling
4.2.2. The Low-Level MDP for Order Placement
4.2.3. High-Level Policy Learning
4.2.4. Low-Level Policy Learning
4.2.5. Training Scheme
5. Experiments
- Q1: Can the proposed approach OEHRL successfully learn a proper order execution strategy that beats markets? (Section 6)
- Q2: Does the high-level agent in the hierarchical framework achieve effective price discovery? (Section 6.1)
- Q3: Does the low-level agent aid in further improving the performance? (Section 6.2)
- Q4: Is the proposed representation learning module effective? (Section 6.3)
- Q5: Can the proposed hindsight expert dataset improve high-level policy learning? (Section 6.4)
5.1. Datasets for the Experiments
5.2. Baselines
- Rule-Based and Traditional Methods
- TWAP strategy evenly divides an order into T segments and executes an equal share quantity at each time step [15].
- VWAP strategy aims to closely align the execution price with the true market average price, allocating orders based on the empirically estimated market transaction volume from the previous 10 trading days [17].
- AC (Almgren–Chriss) model analytically determines the efficient frontier for optimal execution. We only focus on the temporary price impact to ensure a fair comparison [13].
- RL-Based Methods
- DDQN (Double Deep Q-network) is a value-based RL method that adopts state engineering [6].
- PPO is a policy-based RL method [21] that utilizes a PPO algorithm with a sparse reward to train an agent with a recurrent neural network for feature extraction.
- OPD leverages RL with policy distillation to determine the size of each sub-order (volume scheduling) and place market orders [7].
- HALOP first, automatically scopes an action subset according to the market status and subsequently, chooses a specific discrete limit price from the action subset using a discrete-control agent [9].
5.3. Evaluation Metrics
- Price Advantage over VWAP (PA). We use the average excess return over VWAP-with-market-order as the return metric, displayed in basis points (bps, 1 bp = 0.01%). Note that here, VWAP is calculated with true trading volume instead of its estimation adopted in the VWAP baseline.
- Win Ratio (WR). WR measures the ratio of the days that the agent beats VWAP, defined as:
- Gain–Loss Ratio (GLR). GLR reports the gain–loss ratio, defined as:
- Averaged Final Inventory (AFI). AFI refers to the averaged remaining inventory at the end of the trading period, evaluating the non-fulfillment risk.
5.4. Hyperparameter Setup
6. Results and Discussion
6.1. Analysis of the High-Level Policy
6.2. Ablation Study on RL-Based Low-Level Policy
6.3. Ablation Study on Multi-Granularity Representations
6.4. Ablation Study on Imitation Learning
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Cartea, Á.; Sánchez-Betancourt, L. Optimal execution with stochastic delay. Financ. Stoch. 2023, 27, 1–47. [Google Scholar] [CrossRef]
- Carmona, R.; Leal, L. Optimal execution with quadratic variation inventories. SIAM J. Financ. Math. 2023, 14, 751–776. [Google Scholar] [CrossRef]
- Kath, C.; Ziel, F. Optimal order execution in intraday markets: Minimizing costs in trade trajectories. arXiv 2020, arXiv:2009.07892. [Google Scholar]
- Cont, R.; Kukanov, A. Optimal order placement in limit order markets. Quant. Financ. 2017, 17, 21–39. [Google Scholar] [CrossRef]
- Hendricks, D.; Wilcox, D. A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution. In Proceedings of the 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), London, UK, 27–28 March 2014; pp. 457–464. [Google Scholar]
- Ning, B.; Ling, F.H.T.; Jaimungal, S. Double Deep Q-Learning for Optimal Execution. Appl. Math. Financ. 2018, 28, 361–380. [Google Scholar] [CrossRef]
- Fang, Y.; Ren, K.; Liu, W.; Zhou, D.; Zhang, W.; Bian, J.; Yu, Y.; Liu, T.Y. Universal Trading for Order Execution with Oracle Policy Distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021. [Google Scholar]
- Nevmyvaka, Y.; Feng, Y.; Kearns, M. Reinforcement learning for optimized trade execution. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006. [Google Scholar]
- Pan, F.; Zhang, T.; Luo, L.; He, J.; Liu, S. Learn Continuously, Act Discretely: Hybrid Action-Space Reinforcement Learning For Optimal Execution. In Proceedings of the International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022. [Google Scholar]
- Chen, D.; Zhu, Y.; Liu, M.; Li, J. Cost-Efficient Reinforcement Learning for Optimal Trade Execution on Dynamic Market Environment. In Proceedings of the Third ACM International Conference on AI in Finance, New York, NY, USA, 2–4 November 2022; pp. 386–393. [Google Scholar]
- Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A.C.; Bengio, Y. A recurrent latent variable model for sequential data. arXiv 2015, arXiv:1506.02216. [Google Scholar]
- Niu, H.; Li, S.; Li, J. MacMic: Executing Iceberg Orders via Hierarchical Reinforcement Learning. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Jeju, Republic of Korea, 3–9 August 2024; Larson, K., Ed.; pp. 6008–6016, Main Track. [Google Scholar] [CrossRef]
- Almgren, R.; Chriss, N. Optimal execution of portfolio transactions. J. Risk 2001, 3, 5–40. [Google Scholar] [CrossRef]
- Huberman, G.; Stanzl, W. Optimal liquidity trading. Rev. Financ. 2005, 9, 165–200. [Google Scholar] [CrossRef]
- Bertsimas, D.; Lo, A.W. Optimal control of execution costs. J. Financ. Mark. 1998, 1, 1–50. [Google Scholar] [CrossRef]
- Wang, J.; Zhang, C. Dynamic Focus Strategies for Electronic Trade Execution in Limit Order Markets. In Proceedings of the 8th IEEE International Conference on E-Commerce Technology and The 3rd IEEE International Conference on Enterprise Computing, E-Commerce, and E-Services (CEC/EEE’06), San Francisco, CA, USA, 26–29 June 2006; p. 26. [Google Scholar] [CrossRef]
- Kakade, S.M.; Kearns, M.; Mansour, Y.; Ortiz, L.E. Competitive algorithms for VWAP and limit order trading. In Proceedings of the 5th ACM conference on Electronic Commerce, New York, NY, USA, 17–20 May 2004; pp. 189–198. [Google Scholar]
- Dabérius, K.; Granat, E.; Karlsson, P. Deep Execution-Value and Policy Based Reinforcement Learning for Trading and Beating Market Benchmarks. 2019. Available online: https://ssrn.com/abstract=3374766 (accessed on 1 November 2024).
- Hu, R. Optimal Order Execution Using Stochastic Control and Reinforcement Learning. Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2016. [Google Scholar]
- Lin, S.; Beling, P.A. Optimal liquidation with deep reinforcement learning. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Deep Reinforcement Learning Workshop, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Beling, P.A.; Lin, S. An End-to-End Optimal Trade Execution Framework based on Proximal Policy Optimization. In Proceedings of the International Joint Conference on Artificial Intelligence, Virtual, 7–15 January 2020. [Google Scholar]
- Watkins, C.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Akbarzadeh, N.; Tekin, C.; van der Schaar, M. Online Learning in Limit Order Book Trade Execution. IEEE Trans. Signal Process. 2018, 66, 4626–4641. [Google Scholar] [CrossRef]
- Gould, M.D.; Porter, M.A.; Williams, S.; McDonald, M.; Fenn, D.J.; Howison, S.D. Limit order books. Quant. Financ. 2013, 13, 1709–1742. [Google Scholar] [CrossRef]
- Puterman, M.L. Markov decision processes. In Handbooks in Operations Research and Management Science; Elsevier: Amsterdam, The Netherlands, 1990; Volume 2, pp. 331–434. [Google Scholar]
- Salehinejad, H.; Sankar, S.; Barfett, J.; Colak, E.; Valaee, S. Recent advances in recurrent neural networks. arXiv 2017, arXiv:1801.01078. [Google Scholar]
- Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.P.; Glorot, X.; Botvinick, M.M.; Mohamed, S.; Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
- Esser, A.; Mönch, B. The navigation of an iceberg: The optimal use of hidden orders. Financ. Res. Lett. 2007, 4, 68–81. [Google Scholar] [CrossRef]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning (PMLR), Beijing, China, 22–24 June 2014; pp. 387–395. [Google Scholar]
- Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.V.; Lanctot, M.; de Freitas, N. Dueling Network Architectures for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
- Li, S.; Zheng, L.; Wang, J.; Zhang, C. Learning subgoal representations with slow dynamics. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Dataset | LOB | Num of Stocks | From | To |
---|---|---|---|---|
CSI100 | Level-II | 100 | 22/09/01 | 23/09/01 |
NASDAQ100 | Level-II | 100 | 22/06/08 | 23/06/08 |
Parameter | Value |
---|---|
Length L | 128 |
Episode length T | 240 |
Execution time | 1 |
Price level | 7 |
Imitation coefficient | 1.0 |
Minibatch size | 4 |
VAE coefficient | 0.5 |
Learning rate | 3 × 10−4 |
Methods | PA (bps)↑ | PA-Std (bps)↓ | WR↑ | GLR↑ | AFI↓ |
---|---|---|---|---|---|
TWAP | −0.12 | 3.12 | 0.49 | 0.97 | 0 |
VWAP | −3.29 | 4.87 | 0.39 | 0.78 | 0.04 |
AC | −1.24 | 4.21 | 0.45 | 1.02 | 0 |
DDQN | −1.21 | 7.09 | 0.47 | 0.99 | 0.05 |
PPO | −0.98 | 7.01 | 0.52 | 0.92 | 0.03 |
OPD | 0.32 | 6.78 | 0.53 | 1.07 | 0.10 |
HALOP | 2.89 | 6.12 | 0.65 | 1.12 | 0.07 |
OEHRL | 3.31 | 6.23 | 0.72 | 1.37 | 0.02 |
Methods | PA (bps)↑ | PA-Std (bps)↓ | WR↑ | GLR↑ | AFI↓ |
---|---|---|---|---|---|
TWAP | −0.19 | 4.31 | 0.49 | 1.01 | 0 |
VWAP | −2.89 | 4.72 | 0.35 | 0.79 | 0.02 |
AC | −2.31 | 4.39 | 0.39 | 0.96 | 0 |
DDQN | −1.98 | 6.89 | 0.45 | 0.97 | 0.03 |
PPO | −1.20 | 6.79 | 0.46 | 0.96 | 0.04 |
OPD | 1.03 | 6.82 | 0.54 | 1.05 | 0.12 |
HALOP | 2.75 | 5.87 | 0.63 | 1.13 | 0.05 |
OEHRL | 3.14 | 5.69 | 0.74 | 1.28 | 0.01 |
Methods | PA (bps) ↑ | WR ↑ |
---|---|---|
w/o low-level learning | 2.29 | 0.65 |
w/ low-level learning | 3.31 | 0.72 |
Methods | PA (bps) ↑ | WR ↑ |
---|---|---|
w/o representation learning | −5.26 | 0.35 |
w/ representation learning | 3.31 | 0.72 |
Methods | PA (bps) ↑ | WR ↑ |
---|---|---|
w/o imitation learning | 1.36 | 0.43 |
w/ imitation learning | 3.31 | 0.72 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, S.; Niu, H.; Lu, J.; Liu, P. Joint Learning of Volume Scheduling and Order Placement Policies for Optimal Order Execution. Mathematics 2024, 12, 3440. https://doi.org/10.3390/math12213440
Li S, Niu H, Lu J, Liu P. Joint Learning of Volume Scheduling and Order Placement Policies for Optimal Order Execution. Mathematics. 2024; 12(21):3440. https://doi.org/10.3390/math12213440
Chicago/Turabian StyleLi, Siyuan, Hui Niu, Jiani Lu, and Peng Liu. 2024. "Joint Learning of Volume Scheduling and Order Placement Policies for Optimal Order Execution" Mathematics 12, no. 21: 3440. https://doi.org/10.3390/math12213440
APA StyleLi, S., Niu, H., Lu, J., & Liu, P. (2024). Joint Learning of Volume Scheduling and Order Placement Policies for Optimal Order Execution. Mathematics, 12(21), 3440. https://doi.org/10.3390/math12213440