Deep Learning in Quantitative Trading
Deep Learning in Quantitative Trading
DEEP LEARNING IN
QUANTITATIVE TRADING
Zihao Zhang
University of Oxford
Stefan Zohren
University of Oxford
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Shaftesbury Road, Cambridge CB2 8EA, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre,
New Delhi – 110025, India
103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467
[Link]
Information on this title: [Link]/9781009707121
DOI: 10.1017/9781009707091
© Zihao Zhang and Stefan Zohren 2025
This publication is in copyright. Subject to statutory exception and to the provisions
of relevant collective licensing agreements, no reproduction of any part may take
place without the written permission of Cambridge University Press & Assessment.
When citing this work, please include a reference to the DOI 10.1017/9781009707091
First published 2025
A catalogue record for this publication is available from the British Library
ISBN 978-1-009-70712-1 Hardback
ISBN 978-1-009-70711-4 Paperback
ISSN 2631-8571 (online)
ISSN 2631-8563 (print)
Additional resources for this publication at [Link]/deeplearningquant
Cambridge University Press & Assessment has no responsibility for the persistence
or accuracy of URLs for external or third-party internet websites referred to in this
publication and does not guarantee that any content on such websites is, or will
remain, accurate or appropriate.
For EU product safety concerns, contact us at Calle de José Abascal, 56, 1◦ , 28003
Madrid, Spain, or email eugpsr@[Link]
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading
DOI: 10.1017/9781009707091
First published online: September 2025
Zihao Zhang
University of Oxford
Stefan Zohren
University of Oxford
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Contents
Preface 1
1 Introduction 2
PART I: FOUNDATIONS 6
8 Conclusions 143
Acronyms 147
References 167
Preface
Over the past decade, deep learning has attracted considerable interest, prima-
rily due to its exceptional performance across a range of application domains,
with image recognition and natural language processing standing out as two of
the most notable examples. Deep learning algorithms possess the ability to learn
complex, nonlinear relationships from large volumes of data. Unlike traditional
mathematical or statistical models, which often struggle in such environments,
deep learning models excel at uncovering complex patterns and making pre-
dictions. The capacity to manage and learn from large volumes of data has
made deep learning models a transformative technology across industries like
healthcare, finance, entertainment, and many others.
Given its successful applications in other fields, deep learning has also
become a natural candidate for applications to quantitative trading, as trading
firms and investment managers continuously seek innovative ways to uncover
“alpha,” or excess returns. With the rise of electronic trading, exchanges now
process billions of messages daily, generating vast amounts of data well suited
for deep learning algorithms. Additionally, investors also have access to a
growing range of alternative data sources, such as mobile app downloads,
social media trends, and search engine activity (e.g., Google Trends), which can
be used to further improve decision-making. As a result, deep learning tech-
niques are increasingly becoming powerful tools for quantitative researchers
and traders, enabling more sophisticated strategies and potentially higher
returns.
A significant body of research has explored the diverse financial applica-
tions of deep learning, including areas such as alpha generation, time-series
forecasting and portfolio optimization. The goal of this Element is to weave
these disparate threads together, placing a particular emphasis on how deep
learning algorithms can be leveraged to develop quantitative trading strategies
and systems. Whether an experienced quantitative trader aiming to enhance
strategies, a data scientist exploring opportunities within the financial sector,
or a student eager to delve into cutting-edge financial technology, the reader of
this Element should come away with a comprehensive understanding of how
deep learning is transforming the landscape of quantitative trading. By com-
bining theoretical foundations with practical applications, we seek to equip
readers with the insights and tools necessary to excel in this rapidly evolv-
ing domain. Our objective is to navigate the complexities of the field while
inspiring innovation in the integration of deep learning within quantitative
finance.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
2 Quantitative Finance
1 Introduction
Quantitative trading boasts a rich and fascinating history, with its origins dat-
ing back to the groundbreaking work of Louis Bachelier in 1900. In his seminal
thesis, Bachelier introduced the concept of Brownian motion as a framework
for modeling the stochastic behavior of financial price series. This pioneering
work established the basis for the mathematical modeling of financial mar-
kets and set the stage for modern quantitative finance (Bachelier, 1900). Over
the years, the field has undergone remarkable evolution, propelled by progress
in mathematics, statistics, and computational advancements. From the intro-
duction of fundamental theories like the Black-Scholes model in the 1970s to
the emergence of algorithmic trading in the late twentieth century, quantitative
trading has consistently been at the forefront of financial innovation. Key devel-
opments have been documented in works such as Cesa (2017), which offers a
detailed exploration of quantitative finance’s historical trajectory and major
milestones.
As computational power and data availability have both increased, the
field has expanded further, incorporating machine learning and deep learning
techniques into its toolkit. Today, quantitative trading represents a dynamic
intersection of finance, mathematics, and computer science, continuing to
evolve as new methods and technologies emerge. Experts from diverse fields
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 3
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
4 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
6 Quantitative Finance
PART I: FOUNDATIONS
2.1 Returns
Returns are a key metric in the field of finance, playing an important role
in evaluating investment performance over time. They reflect the profit or
loss achieved relative to the initial value of an investment, demonstrating
insights into the potential profitability and risks associated with different traded
financial assets, including stocks, bonds, mutual funds, and other instruments.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 7
where pt denotes the price of a security at time t. The aforementioned can easily
be generalized for returns over multiple time steps from t − L to t.
Understanding and analyzing financial returns is crucial for several reasons.
First, returns directly impact an investor’s wealth and financial planning, as
they determine the growth of investments over time. Second, returns are used
when assessing investment risks, and effective risk control is the key to ensur-
ing long-term investment success. Third, analyzing historical returns helps
investors identify trends and patterns, informing future investment decisions
and strategy development. Finally, financial institutions and fund managers
rely heavily upon return analysis to manage large portfolios and ensure they
meet their performance benchmarks. By examining returns, they can allocate
assets more effectively, diversify their portfolios, and carry out risk manage-
ment strategies that can protect profits against adverse market movements.
In summary, financial returns are a cornerstone of investment analysis and
decision-making. They provide a complete view of the performance and risk
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
8 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 9
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
10 Quantitative Finance
14 [Link] (122)
15 a = [Link](ret , dist="norm", plot=plt)
16 [Link]("Theoretical Quantiles")
17 [Link]("Ordered Values")
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 11
negatively skewed distribution has larger values in the left tail that are further
from the mean than those of the left tail.
In finance, skewness may stem from diverse market forces. Investor sen-
timent can lead to asymmetrical buying or selling pressures as market par-
ticipants overreact to news or trends. Economic news can also introduce
sudden, unidirectional shocks to asset prices as markets rapidly adjust to new
information. Market microstructure might also contribute to skewness when
imbalances in order flow, liquidity constraints, or trading mechanisms create
price distortions. There are many other possible causes for deviations from a
normal distribution in the returns.
The skewness of a return distribution can typically inform the reward profile
of a security or strategy. A canonical example of a strategy that is negatively
skewed is a reversion strategy. We can expect many small positive rewards
when assets revert as expected, but we can also suffer large losses if reversion
does not occur say due to an unexpected news event. Selling options and VIX
futures are other examples of strategies with negatively skewed return distribu-
tions. Vice versa, a positively skewed return distribution typically corresponds
to many small losses with a few large gains – a canonical example being
momentum strategies. The most favorable type of skewness depends upon the
risk preferences of investors.
Kurtosis is the fourth normalized statistical moment, describing the tail and
peak of a distribution. In particular, kurtosis informs us whether a distribution
includes more extreme values than a normal distribution. All normal distribu-
tions, regardless of mean and variance, have a kurtosis of 3. If a distribution is
highly peaked and has fat tails, its kurtosis is greater than 3, and, vice versa, a
flatter distribution has a kurtosis lesser than 3. Excess kurtosis can be attributed
to market shocks, economic crises, and other rare but impactful events that
significantly affect asset prices.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
12 Quantitative Finance
4 The Z-score quantifies the number of standard deviations by which a data point deviates from
the mean and is typically employed for large samples or when the population variance is known.
The t-score is used for small samples or unknown variances to compare means, while the chi-
square statistic tests categorical data for goodness-of-fit or independence. The F-statistic, used
in regression, evaluates variance ratios to assess group mean differences or model fit.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 13
where X and s are the sample mean and standard deviation. Under H0 , this
statistic approximately follows a t-distribution with n − 1 degrees of freedom.
After computing the test statistic using sample data, we derive a value that
can be compared against a critical value for a given alpha. The probability of
observing a test statistic, under H0 , that is more extreme than the one we com-
puted from our data is called the p-value. The p-value decides the statistical
significance of our results compared to the null hypothesis. In the previous
example, we can find p = 2 × P(Tn−1 > |tobs |) where tobs is the observed value
of the statistic, and Tn−1 denotes a random variable following the t-distribution
with n−1 degrees of freedom. If the resulting p-value is smaller than the signifi-
cance level, for example, α = 0.05, then we can say the test result is significant,
indicating strong evidence against the null hypothesis.
In the previous section, we use graphical tools to assess data distributions
but we can now also use statistical hypothesis testing to validate data proper-
ties. For instance, the Jarque-Bera test can be utilized to assess the validity of
the normality assumption. This widely recognized statistical method evaluates
whether the sample data skewness and kurtosis are consistent with those
expected in a normal distribution, thereby determining if the return distribution
adheres to normality.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
14 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 15
eliminating the linear effects of intermediate lags. Unlike the ACF, which
includes the cumulative effect of all previous lags, the PACF isolates the direct
effect of a specific lag. For instance, the PACF at lag k measures the correla-
tion between Xt and Xt+k after removing the effects of lags 1 through k − 1. This
allows for a clearer understanding of the underlying relationship at each spe-
cific lag, making it easier to identify the appropriate number of lags to include
in an autoregressive model.
Mathematically, we can define the PACF at lag k as the correlation between
Xt and Xt+k that is not accounted for by their mutual correlation with Xt+1, Xt+2,
· · · , Xt+k−1 . We can obtain PACF values by fitting a linear model with Xt and
the regressors standardized:
where ϕk,k is the PACF value for lag k, and ranges from −1 to 1. With stand-
ardization, the regression slopes become the partial correlation coefficient, as
correlation is effectively the slope we get when both the response and predic-
tors have been reduced to dimensionless “z-scores.” The PACF plot is used in
conjunction with the ACF plot to identify the order of an autoregressive (AR)
model. While the ACF helps in understanding the overall autocorrelation struc-
ture, the PACF helps pinpoint the specific lags that should be included in the
AR component of an ARMA model, ensuring a more accurate estimation.
In summary, the ACF and PACF are powerful tools that enable a deeper
understanding of time-series data, guiding the development of robust and
effective forecasting models. Their combined use allows for the precise iden-
tification of temporal structures, leading to improved predictions and better
decision-making in fields where time-series data is prevalent. Figure 2 shows
an example of ACF and PACF plots for the same underlying data. The shaded
area in the plot represents an approximate confidence interval around zero cor-
relation. In other words, it is a visual guide for checking which autocorrelation
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
16 Quantitative Finance
(or partial autocorrelation) lags are statistically significant from zero. We can
make these plots using the following code:
1 from [Link] import plot_acf , plot_pacf
2 import [Link] as plt
3
by its own past values, which can be the case for stock prices or interest rates.
For example, an AR(1) model, where the output depends only on the immediate
past observation, is defined as:
where the model suggests that the current value of the time-series is influenced
directly by the observation at time t − 1.
Differently than the AR model, the MA model represents the current value
of a time-series as a linear combination of its previous error terms. The MA
model of order q, symbolized as MA(q), is defined as:
where µ is the mean of the series, ϵt, · · · , ϵq are error terms, θ 1, · · · , θ q are
the model parameters that represent the influence of past errors on the current
value. The MA model captures the influence of past shocks or disturbances on
the current observation, making it useful for modeling time-series with short-
term dependencies. This model is effective when the series is subject to random
shocks that have a lasting but diminishing impact over time. For instance, an
MA(1) model defines that the observation at time t is only influenced by the
immediate past error:
Xt = µ + ϵt + θ 1 ϵt−1, (13)
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 19
2.7 Extras
Alpha and Beta In quantitative finance, the notions of alpha and beta
are very important to understanding and evaluating the performance of invest-
ment strategies. These metrics are derived from the Capital Asset Pricing Model
(CAPM) and are used to measure the returns and risk associated with individ-
ual assets or portfolios relative to a benchmark, typically a market index. In
quantitative trading, where strategies are often driven by mathematical models
and algorithms, alpha and beta provide essential insights into the effectiveness
and characteristics of trading approaches.
Alpha assesses an investment’s performance relative to a benchmark index.
More specifically, it represents the surplus return that an investment or port-
folio achieves beyond the expected return predicted by the CAPM. In other
words, alpha signifies the additional value that a trader or investment strat-
egy contributes over what is anticipated based on the asset’s systematic risk.
Conversely, beta measures an investment’s responsiveness to market fluctua-
tions. It quantifies the relationship between the investment’s returns and those
of the overall market or benchmark, indicating the extent to which the invest-
ment’s returns are expected to vary in reaction to changes in the market index.
Mathematically, we define alpha (α) and beta (β) as:
where Ri is the return of the investment, Rf is the risk-free rate and Rm is the
return of the market. A positive alpha signifies that the investment has sur-
passed the benchmark, whereas a negative alpha indicates underperformance.
In quantitative trading, generating alpha is the primary goal as it reflects the
ability of a trading strategy to consistently beat its benchmark through superior
stock selection, timing, or other factors. Beta values have different meanings.
A beta exceeding 1 signifies that the investment is more volatile than the mar-
ket, indicating it tends to amplify market movements in response to changes.
Conversely, a beta below 1 indicates that the asset’s returns are less sensitive to
market movements than the market index itself. If an investment has a negative
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
20 Quantitative Finance
beta, it means that the investment moves inversely to the benchmark. We can
also think of beta as the covariance between strategy and market returns scaled
by the market’s variance.
A strategy that consistently generates positive alpha is considered successful,
as it indicates the ability to surpass the market performance on a risk-adjusted
basis. On the other hand, beta helps traders understand the risk profile of their
strategies and manage risk exposure to market volatility. For instance, a trader
seeking to minimize risk might construct a low-beta portfolio, while one aiming
for higher returns might opt for higher-beta assets. By utilizing alpha and beta
metrics, quantitative traders can make well-informed decisions and enhance
their trading performance.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 21
There are two popular models used to capture and analyze volatility cluster-
ing: the Autoregressive Conditional Heteroskedasticity (ARCH) model and the
Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model.
These models help in capturing the changing variance over time and provide a
better fit for the distribution of returns. Instead of predicting returns Rt , we now
model the variance of returns. An ARCH(p) process of order p is defined as:
where the GARCH model presents a dual dependence that is better at modeling
both short-term shocks and sustained persistence in volatility over time.
In practical terms, volatility clustering means that markets experience peri-
ods of turmoil and periods of calm. ARCH and GARCH models offer powerful
methods for analyzing this phenomenon, enabling more accurate forecast-
ing, risk management, and pricing of financial instruments. By recognizing
the temporal dependencies in volatility, these models enable us to better
understand market behavior and enhance decision-making in various financial
applications.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
22 Quantitative Finance
its predictions and the actual results. This methodology includes choosing
suitable algorithms, adjusting hyperparameters, and assessing the models’
effectiveness. In this section, we will examine these concepts comprehen-
sively, establishing a robust foundation for comprehending and implementing
supervised learning methodologies.
Additionally, we will introduce various neural network architectures, which
have become the cornerstone of modern machine learning. Neural networks,
modeled after the architecture of the human brain, are composed of intercon-
nected layers of nodes (neurons) that process and transform input data. We
will cover canonical neural network models, including feed-forward neural net-
works and state-of-the-art networks such as transformers, each designed for
specific types of data and tasks.
Upon finishing this section, you will have a detailed understanding of the
core concepts underpinning supervised learning and will better understand var-
ious types of neural networks. This knowledge will equip you with the skills
to apply these powerful techniques to a wide range of applications, unlocking
new possibilities in data analysis, prediction, and decision-making.
A supervised learning algorithm infers a function f that best defines the inter-
play between inputs and outputs by utilizing training data. The inferred function
can then be used to make estimates for new inputs. The function f can be as sim-
ple as a linear function or it can also be a highly nonlinear function as obtained
through deep learning models. During training, the true output values (labels)
are available, and our goal is to reduce the differences between the predicted
results and these actual labels. In mathematical terms, this reads:
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 23
1Õ
N
L(y,ŷ) = (yi − ŷi )2, (20)
N i=1
where the loss is merely the sum of residuals, ϵi = yi − ŷi , squared which we
aim to minimize to obtain a good fit to the data.5 The MSE loss is symmetric
and places greater emphasis on larger errors in the dataset.
5 We can also obtain the MSE for a linear model by a Maximum Likelihood approach, where we
Î
start with the likelihood of the data, i p(ϵi ), and assume that the distribution of the residuals
p(ϵi ) is Gaussian. Maximizing the likelihood of the data is equivalent to minimizing the log-
likelihood which, up to a constant, is equivalent to the MSE.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
24 Quantitative Finance
Metrics Formula
q Í
N i (yi − ŷi )
1 N 2
Root mean squared error (RMSE)
1 ÍN
N i (ln(1 + yi ) − ln(1 + ŷi ))
Mean squared log error (MELE) 2
1 ÍN
Mean absolute error (MAE) N i |yi − ŷi |
Median absolute error (MedAE) median(|y1 − ŷ1 |, · · · , |yN − ŷN |)
( Í
N Íi 2 (yi − ŷi ) , if |yi − ŷi | ≤ δ
1 N 1 2
Huber loss (HL)
N i δ|yi − ŷi | − 2 δ , otherwise
1 N 1 2
There are numerous options for objective functions. For example, the mean-
squared logarithmic error can be applied to outputs that exhibit exponential
growth, imposing an asymmetric penalty that is less harsh on negative errors
than on positive ones. Both the mean absolute error (MAE) and the median
absolute error (MedAE) are symmetric and do not assign additional weight to
larger errors. Moreover, Huber loss, which merges aspects of the mean squared
error and the mean absolute error, is resistant to outliers and can be used to
stabilize training when working with noisy data. Table 1 summarizes some
common loss functions for regression problems. It is also very straightforward
to implement these losses:
1 from [Link] import (mean_squared_error ,
2 mean_squared_log_error ,
3 mean_absolute_error ,
4 median_absolute_error)
5 from [Link] import huber
6 import numpy as np
7
The best choice of objective function depends on the specific task. Some-
times, we can create customized loss functions to ensure that performance
metrics best reflect the consequences of incorrect predictions. For example,
in applications like medical diagnosis or fraud detection, false negatives may
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 25
be more costly than false positives. In such cases, loss functions can be tailored
to penalize certain types of errors more severely, aligning the model’s training
with the specific needs of the problem.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
26 Quantitative Finance
The functional form of the cross-entropy loss might be less intuitive than
the MSE, but it can still be understood within the context of maximum likeli-
hood estimation.6 If we deal with a multi-class classification problem (M > 2),
a separate loss is needed for each class label and a summation is taken at
the end:
Õ
M
− yoc log(poc ), (26)
c=1
where yo,c is a binary indicator (0 or 1) that is activated when the model assigns
the right label c for observation o, and p is the output from the algorithm which
indicates the predictive probability of observation o for class c. The loss of the
data is then obtained by summing the multi-class cross-entropy of each point.
Once the predicted probabilities are transformed into predictions of one class
or another, we can evaluate model performance through several metrics. To
illustrate those metrics we focus on binary classification problems for simplic-
ity. A frequently employed measure is the misclassification rate which can be
defined as the fraction of misclassified labels:
1Õ
N
Misclassification rate = Iy ,ˆy . (27)
N i=1 i i
The confusion matrix is another important tool that can be used to visu-
alize various metrics. Table 2 illustrates a confusion matrix that enumerates
the quantities of correct and incorrect predictions for every class. For exam-
ple, the False Positives in the top right corner represent errors where an actual
label is negative but a prediction is positive. In the context of a stock price
reversion example, this would be a case when a stock price does not revert
but we predicted that it would revert. Such an error is much more costly to us
than a False Negative, where a stock does actually revert but we predicted it
would not.
Following the notation in the confusion matrix, we can thus introduce other
popular evaluation metrics, which are shown in Table 3. For instance, accuracy
is computed by summing the diagonal entries in the confusion matrix and then
Î y
6 The likelihood of the data is i p̂i i (1 − p̂i )1−yi , which merely assigns a probability p̂i to each
point with label yi = 1 and a probability 1 − p̂i to each point with label yi = 0. The negative
log-likelihood then corresponds to our loss function.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 27
Actual
Positive Negative
Metrics Formula
TP+TN
Accuracy TP+FP+TN+FN
TP
Precision TP+FP
TP
Recall TP+FN
F1 2 × Precision×Recall
Precision+Recall
dividing by the total number of predicted samples. Accuracy thus represents the
proportion of total predictions that are correct. Precision indicates the fraction
of predicted positives that are truly positive, while recall measures the fraction
of actual positives correctly identified. Lastly, the F1 score balances precision
and recall by using their harmonic mean.
It is very important to check all evaluation metrics when analyzing model
performance since a single performance metric can indicate misleading results.
For example, in an unbalanced data set, where 90% of labels are +1, we can
get an accuracy score of 90% by simply predicting everything as +1, even
though the model has not learned anything. Another issue arises when we assign
different importance to different types of errors. For example, a mean rever-
sion strategy usually makes frequent small gains but can make infrequent large
losses when a stock does not revert. Such a strategy might demonstrate a high
accuracy for predicting stock reversion but still lead to significant losses. To
implement these metrics, we can use the following code:
1 from [Link] import accuracy_score , f1_score ,
confusion_matrix , classification_report
2
3 acc = accuracy_score(y_true , y_pred)
4 f1_macro = f1_score(y_true , y_pred , average='macro ')
5 report = classification_report (y_true , y_pred)
6 cm = confusion_matrix(y_true , y_pred)
7
8 print("Accuracy:", acc)
9 print("F1 Score (macro):", f1_macro)
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
28 Quantitative Finance
10 print("Classification Report:")
11 print(report)
12 print("Confusion Matrix:")
13 print(cm)
TP
TPR = ,
TP + FN (28)
FP
FPR = .
FP + TN
Random predictions, on average, yield a diagonal line on the ROC curve
which has equal TPR and FPR rates. This diagonal line is the benchmark case,
so if the curve falls on the left side of the diagonal line, the learned model
is better than random guessing. The further from the margin, the better the
classifier (shown in Figure 4). We refer to the area under the ROC curve as
AUC and it is a summary measure that tells how good a classifier is. A higher
AUC score indicates a better algorithm.
1Õ
N
L= ( yi − ŷi )2, (30)
N i=1
where N is the number of sample points. We can therefore optimize the model
parameters by setting the partial derivatives of the L with respect to w and b
to 0:
∂L ∂L
= 0 and = 0, (31)
∂w ∂b
and the aforementioned can be solved analytically. In fact, by setting b = 0 for
simplicity, and writing X = (x1, ..., xN )T , the solution can easily be obtained as:
Moreover, one can directly recover the general case with b , 0 by noting that
we can always interpret the bias b as a weight w0 of a constant predictor x0 = 1.
While the analytical solution provides a direct method to find the optimal
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
30 Quantitative Finance
Figure 5 An FCN with two hidden layers in which each hidden layer has
five neurons.
weight matrix and bias vector, respectively. The function g(1) (·) is called the
activation function. We then define the following hidden layers as:
where the l-th hidden layer (h(l) ∈ RNl ) has weights W (l) ∈ RNl ×Nl−1 and
biases b(l) ∈ RNl . To better illustrate this, we present an example of an MLP
in Figure 5. At its core, each hidden layer computes a linear transformation of
the previous layer’s output, followed by a nonlinear activation. The ultimate
output is determined by the target’s nature and is once again derived from the
preceding hidden layer. The discrepancy between the model’s predictions and
the true targets is quantified using a specified loss or objective function. Gra-
dient descent is then employed to adjust the model parameters in an effort to
minimize this loss. We can easily build a fully connected network with Pytorch
using the following code snippet:
1 import [Link] as nn
2
3 class MLP([Link]):
4 def __init__(self , seq_length , n_features):
5 super(MLP , self).__init__ ()
6 self.flat_dim = seq_length * n_features
7 [Link] = [Link](
8 [Link] (),
9 [Link](self.flat_dim , 64),
10 [Link] (),
11 [Link] (64, 32),
12 [Link] (),
13 [Link] (32, 1)
14 )
15 def forward(self , x):
16 return [Link](x)
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
32 Quantitative Finance
7 Mathematical theory shows that a single hidden layer MLP with infinitely many neurons can
represent any continuous function. In practice, however, one aims for deeper, rather than wider,
networks as they make it easier to learn good feature representations.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Figure 6 Plots of various activation functions.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
34 Quantitative Finance
image, the CNN architecture enables the model to learn the same feature for
different parts of an image by sliding a convolutional filter across it.
Subsequently, CNNs were studied and applied to many domain areas. We
now demonstrate how the same concept can be used for time-series problems.
Time-series data, characterized by its sequential and temporal nature, can ben-
efit from the unique ability of CNNs to detect patterns and trends over different
scales. By adapting the convolutional operations used in image processing to
time-series data, CNNs can effectively capture local dependencies and extract
representative features. These qualities make them strong tools for time-series
forecasting, anomaly detection, and classification.
Unlike MLPs that receive inputs in vector format, CNNs are adept at process-
ing grid-structured input data through the use of two specialized layer types:
convolutional layers and pooling layers. Convolutional layers constitute the
primary components of a CNN, with each convolutional layer containing multi-
ple convolutional filters designed to extract local spatial relationships from the
input data. Convolutional filters, also known as kernels or feature detectors,
are designed to traverse and transform input data by detecting specific fea-
tures or patterns. In essence, a convolutional filter is a diminutive weight matrix
that slides across the input data, performing a dot product with each localized
region of the input. This procedure is referred to as the convolution operation.
We denote a standard convolutional filter as K and it processes the input data
X ∈ R NT ×Nx by utilizing a convolution operation:
ÕÕ
M−1 N−1
S(i, j) = (X ∗ K)(i, j) = X(i + m, j + n)K(m, n), (36)
m=0 n=0
where S signifies the resultant matrix (feature map) and (i, j) correspond to the
indices of its rows and columns. We denote the convolution process as ∗.
A single convolutional layer is capable of containing multiple filters, each
of which convolves the input data using a distinct set of parameters. The matri-
ces produced by these filters are often termed feature maps. Similar to MLPs,
these feature maps can be transmitted to subsequent convolutional layers and
subjected to activation functions to incorporate nonlinearities into the model.
In time-series modeling, the primary strategy involves applying convolutional
filters along the temporal axis, thereby enabling the network to discern and
learn temporal dependencies and patterns.
Another crucial component of a CNN is the pooling layer, which also fea-
tures a grid-like structure. This layer condenses the information from specific
areas of the feature maps by applying statistical operations to nearby outputs.
For example, the widely used max-pooling layer (Y.- T. Zhou & Chellappa,
1988) selects the highest value within a designated region of the feature maps,
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 35
whereas average pooling computes the mean value of that region. The study
by Boureau, Ponce, and LeCun (2010) explores the application of various
pooling methods in different contexts. However, in most scenarios, selecting
the appropriate pooling technique necessitates domain expertise and empirical
experimentation.
Pooling layers are utilized across various applications to make the resulting
feature maps relatively invariant to small changes in the input data. This type of
invariance is beneficial when the focus is on detecting the existence of particu-
lar features rather than their exact positions (Goodfellow, Bengio, & Courville,
2016). For instance, in certain image classification problems, it is only neces-
sary to recognize that an image contains objects with specific characteristics
without needing to pinpoint their exact locations. Conversely, in time-series
analysis, the precise timing or placement of features is often essential, and
therefore the use of pooling layers must be approached with caution.
In addition to convolutional and pooling layers, a CNN has additional pos-
sible operations: padding and stride. Padding is employed to preserve the
dimensions of the feature maps, as convolution operations would otherwise
“shrink” the dimension of original inputs (demonstrated in Figure 7). Padding
solves this by adding, or “padding,” the original inputs with zeros around
the borders (zero-padding) so that the resulting feature maps have the same
dimension as before (the top-right figure of Figure 7).
3.4 WaveNet
CNNs are naturally desirable for dealing with stochastic financial time-series
as convolutional layers have smoothing properties that facilitate the extraction
of valuable information and discard the noise. In addition, a convolutional filter
can be configured to have fewer trainable weights than fully connected layers.
To some extent, this remedies the problem of overfitting (defined in Section 4).
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 37
where Ml denotes the number of channels and d is the dilation factor. A dilated
convolutional filter operates with every dth element in the input, therefore, it
can access a broad range of inputs compared to standard convolutional filters.
The causal nature of the convolutions ensures that the model does not violate
the temporal order of the time-series, making it suitable for prediction tasks.
We can stack multiple such layers to extract even longer dependencies. For
a network with L dilated convolutional layers, we increase the dilation fac-
tor by two at each layer, so that d ∈ [20, 21, · · · , 2L−1 ], and the filter size w
is 1 × k := 1 × 2. As a result, the dilation rate exponentially increases with
each layer, allowing the network to efficiently model prolonged dependencies
over sequences. An example of a dilated convolutional network that consists
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
38 Quantitative Finance
Figure 9 A WaveNet with three layers. The dilation factors for the first,
second, and third hidden layers are 1, 2, and 4 respectively.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 39
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
40 Quantitative Finance
Figure 11 A recurrent network that processes information from the input and
the past hidden state.
where W h ∈ RNh ×Nh , W x ∈ RNh ×Nx , b ∈ RNh constitute the linear weights
and biases for the hidden state, while g(·) designates the activation function.
The quantity Nh corresponds to the number of hidden units, and Nx refers to
the number of input features observed at any time t. An illustrative example of
such an RNN is depicted in Figure 11.
Nevertheless, due to the model’s recursive architecture, taking the derivative
of the objective function with respect to its parameters involves a sequence
of multiplicative terms that could lead to vanishing or exploding gradients
for RNNs (Bengio, Simard, & Frasconi, 1994). This issue complicates the
back-propagation of gradients, resulting in an unstable training procedure and
limiting RNNs’ effectiveness in modeling long-term dependencies.
A significant breakthrough came in 1997 when Hochreiter and Schmidhuber
(1997) introduced the Long Short-Term Memory (LSTMs) network. LSTM
addressed the vanishing gradient problem by introducing memory cells and
gating mechanisms that allowed the model to retain and selectively update
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 41
where we define ht−1 as the LSTM’s hidden state at time step t−1 and apply the
sigmoid activation function σ(·). The parameters W and b represent the model’s
weights and biases. The resulting cell state and hidden state at the current time
step are then described by:
Cell state: ct = f t ⊙ ct−1 + it ⊙ tanh(W c,h ht−1 + W c,x xt + bc ),
(42)
Hidden state: : ht = ot ⊙ tanh(ct ),
where ⊙ is the element-wise product, W c,x ∈ RNh ×Nx , bc ∈ RNh , W c,h ∈ RNh ×Nh ,
and tanh(·) is the hyperbolic tangent activation function. Figure 12 plots an
LSTM cell with all gates mechanisms.
LSTMs have been applied successfully in numerous fields because of their
unique properties for dealing with prolonged sequences. For instance, LSTMs
are widely used in language modeling, where they predict the next word in a
sequence, as well as in machine translation, where they translate text from one
language to another. For financial applications, LSTMs are also well studied
and there exists a large amount of literature that applies LSTMs to predict finan-
cial time-series. Despite their success, LSTMs still suffer from several issues.
Firstly, due to the gating mechanism and cell structure, LSTMs are very com-
plex, which leads to a considerable amount of parameters that must be learned.
As a result, the problem of overfitting is severe in certain applications. LSTMs
are also computationally intensive, requiring lengthy training schedules.
In an effort to address the complications and drawbacks of LSTMs, Cho et al.
(2014) proposed the Gated Recurrent Units (GRUs) as a more straightforward
alternative. GRUs also aim to mitigate the vanishing gradient problem but
do so by utilizing a reduced parameter set. This makes them computationally
more efficient while often achieving performance on par with LSTMs. Unlike
LSTMs, GRUs merge the forget and input gates into an update gate and also
combine the cell and hidden states into a single vector. This results in fewer
parameters and a leaner design. In a GRU, there are two primary gates: the
update gate and the reset gate. We can summarize the operation of a GRU as
follows:
zt = σ(W z,h ht−1 + W z,x xt + bi ),
rt = σ(W r,h ht−1 + W r,x xt + bo ),
(43)
h̃t = tanh(W h (rt ⊙ ht−1 ) + W h,x xt ),
ht = (1 − zt ) ⊙ ht−1 + zt ⊙ h̃t,
where zt functions as the update gate, rt serves as the reset gate, h̃t denotes the
candidate hidden state, and ht corresponds to the new hidden state.
Overall, GRUs feature a more streamlined architecture than LSTMs, mak-
ing them less complex to implement and quicker to train. With fewer gates
and combined states, GRUs have fewer parameters, reducing the risk of over-
fitting. GRUs offer an efficient alternative that retains the key advantages of
LSTMs. Understanding the differences and trade-offs between LSTMs and
GRUs allows practitioners to choose the appropriate architecture for their
specific needs.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 43
information flows from past to future for time-series. We could thus expect
that features that are meaningful for short-term predictions could be used for
long-term predictions. Therefore it would be a waste to treat them independ-
ently. Here, we introduce the Sequence-to-Sequence model (Seq2Seq) and the
Attention mechanism that enable us to make multi-horizon forecasts. Both
models have an encoder-decoder structure and we can simultaneously forecast
all horizons of interest.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
44 Quantitative Finance
hidden state serves as a summary of the entire input. In Seq2Seq models, this
final hidden state is often taken as the context vector c, functioning as the
“bridge” between the encoder and decoder. For the decoder, the hidden state
ht′ is defined as:
where f and g can be various functions but g needs to produce valid prob-
abilities, which could be achieved through a softmax activation function
(Equation 47). Figure 13 shows an example of a standard Seq2Seq network.
exp(zi )
softmax(z)i = ÍK , i = 1, · · · , K. (47)
j exp(zj )
Seq2Seq models have advanced a wide array of tasks in NLP and other
fields. In machine translation, Seq2Seq revolutionized the field by providing
more accurate and fluent translations compared to traditional methods. These
abilities enabled the generation of concise summaries from long documents,
aiding in information extraction and content curation. Further, Seq2Seq mod-
els powered early chatbots and virtual assistants, allowing for context-aware
responses in dialogues.
A primary drawback of traditional Seq2Seq architectures is that they com-
press the entire input sequence into one fixed-dimensional context vector.
For short sequences, this approach works reasonably well. But, for longer
sequences, it becomes problematic as the context vector may not encapsulate all
the relevant information. This can potentially lead to a loss of important details
and degrade the quality of the generated output. Consequently, the decoder
can find it challenging to generate precise and coherent outputs, especially
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 45
for tasks that require retaining detailed information over extended sequences.
These shortcomings led to the subsequent development of the attention
mechanism.
3.6.2 Attention
The attention mechanism, introduced by Bahdanau, Cho, and Bengio (2014),
enables a model to dynamically attend to different parts of an input sequence
instead of relying solely upon a single fixed-size context vector. This approach
leverages a system of alignment scores, attention weights, and context vectors
to provide the model with greater flexibility, thereby improving its ability to
handle longer sequences effectively.
In attention-based models, alignment scores are first calculated to assess the
relevance of each encoder hidden state to the current decoder state, indicat-
ing how each input token influences the token being generated. These scores
are then normalized with a softmax function to yield attention weights, which
dynamically control how much emphasis each input token receives at each
decoding step. Next, a weighted sum of the encoder hidden states is taken
according to these attention weights, resulting in a context vector that highlights
the most pertinent aspects of the input. This context vector is then used by the
decoder to generate the next token in the output sequence.
The attention mechanism also follows an encoder-decoder architecture. We
can denote the encoder’s hidden state at time t by ht :
Encoder: ht = f(ht−1, xt ), (48)
where f is a non-linear function that is similar to a Seq2Seq model. The dif-
ference lies in the decoder structure as we now need to compute attention
weights, alignment scores, and context vectors. Specifically, we define the
context vector ct and attention weights at the time stamp t as:
Õ
T
Context vector: ct = αt,i hi,
i=1
′
(49)
exp(e(ht−1 , hi ))
Attention weight: αt,i = ÍT ′
,
j=1 exp(e(ht−1, hj ))
′
where e(ht−1 , hi ) is the attention score that indicates the weights placed by the
context vector on each time step of the encoder. The work of Luong, Pham, and
Manning (2015) introduces three methods to compute the score:
′
hTi ht−1
dot,
′
′
e(ht−1, hi ) = hTi W a ht−1 (50)
general,
tanh(W a [hT ; h′ ]) concatenate.
i t−1
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
46 Quantitative Finance
Finally, similar to the process for a Seq2Seq model, the context vector ct is
fed to the decoder:
Decoder: ht′ = f(ht−1
′
, yt−1, ct ),
(51)
P(yt |yt−1, yt−2, · · · , y1, ct ) = g(ht′, ct ),
where ht′ denotes the hidden state at time t and the activation function is denoted
by g. An illustrative example of the Attention mechanism is shown in Figure 14.
In essence, the attention mechanism was conceived to address the drawbacks
of Seq2Seq models – namely, their dependence on a fixed-size context vector
and the ensuing information bottleneck. By enabling the model to selectively
focus on different regions of the input sequence, attention mechanisms sub-
stantially improve the handling of lengthy inputs and the retention of crucial
contextual information.
By granting the decoder access to all the encoder hidden states, rather than
relying on a single fixed-size context vector, the information bottleneck issue is
significantly alleviated. Moreover, attention mechanisms promote better gradi-
ent flow during training, helping to mitigate the vanishing gradient problem in
RNNs and enhancing the model’s capacity to capture long-range dependencies.
3.7 Transformers
The attention mechanism is very powerful as it enables context vectors to incor-
porate information across longer sequences. However, such a model possesses
a chain structure that is very slow to train. This problem worsens as input
lengths increase. To address this issue, the Transformer network was designed
by Vaswani et al. (2017) and represents a major advancement in leveraging
attention mechanisms.
Unlike Seq2Seq models which have a recurrent structure that is slow to train,
the Transformer architecture introduces a parallelizable attention mechanism
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 47
3.7.1 Encoder
In traditional machine learning models, data is often represented in raw, high-
dimensional forms. For transformers, this raw data is transformed into dense,
continuous representations known as Input Embeddings. In the context of NLP
applications, the input embedding layer transforms discrete tokens – like words
or subwords – into dense vectors of a predefined dimension dmodel . These vec-
tors capture semantic relationships and contextual meanings of the tokens,
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
48 Quantitative Finance
allowing the model to learn complex dependencies within the data. By map-
ping tokens into a continuous space, embeddings facilitate more efficient and
effective learning and processing by the model. For time-series, we can take,
for example, 1-D convolutional layer to carry out the embedding step.
Unlike RNNs or CNNs, transformers do not inherently process data in a
sequential manner. This poses a challenge for capturing the order of inputs
in a sequence. Transformers address this need by employing Positional Encod-
ings, which are combined with the input embeddings to inject positional context
into the model. These encodings are designed to be unique for each position in
the sequence and can be generated using various methods, such as sinusoidal
functions:
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 49
PEpos,2i = sin(pos/100002i/dmodel ),
(52)
PEpos,2i+1 = cos(pos/100002i/dmodel ),
where i is the dimension and pos is the position. This function is used as we
inspect that this simple form allows the model to study the relative position
of inputs. Position encodings ensure that the model can distinguish between
different positions in the sequence, thereby preserving the order and relational
information that is vital for understanding positional information.
Together, input embeddings and position encodings enable transformers to
handle sequential data with high flexibility and efficiency. They transform raw
features into meaningful representations and incorporate positional informa-
tion, allowing transformers to model intricate interconnections and dependen-
cies. This combination is a key factor behind the impressive performance of
transformers across a variety of tasks in NLP problems and beyond.
The core strength of the Transformer is rooted in the attention mechanism,
specifically self-attention, which enables the model to assign varying levels of
significance to different elements of the input sequence when encoding each
token. Leveraging well-established mathematical foundations, this mechanism
effectively manages long-range dependencies. Within the encoder, the Self-
Attention Mechanism allows the model to assign varying degrees of importance
to different segments of the input sequence for each token. The first step in this
process consists of linear projections:
Qi = W Q x i ,
K i = W K xi, (53)
V i = W xi,V
where each token’s embedding is transformed into Query (Q), Key (K), and
Value (V) vectors via learned weight matrices. Here xi represents the internal
representation of a single token for NLP tasks or a single timestamp for time-
series problems.
Attention scores are determined by the dot product of the Query and Key
vectors, scaled by the square root of the Key vector dimension dk to keep the
variance close to 1. These scaled scores are then passed through a softmax
function to produce the attention weights:
QT K j
attention weights(xi, xj ) = softmax √i , (54)
dk
where the final representation for each token is computed as a weighted sum
of the Value vectors:
Õ
outputi = attention weights(xi, xj ) · V j . (55)
j
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
50 Quantitative Finance
where outputs from each attention head are concatenated and then passed
through a learned weight matrix:
where H indicates the parallel heads. After the self-attention step, each token
in the sequence is independently processed by a position-wise feed-forward
network, introducing additional nonlinearity and enhancing the transformer’s
capacity to capture complex features beyond what self-attention alone can
achieve. This network is composed of two linear transformations with a ReLU
activation in between. Formally, for each token, this can be represented as:
where W 1 and W 2 are trainable weight matrices, b1 and b2 are learnable biases,
and max(0, ·) represents the ReLU activation function. The FFN is applied inde-
pendently to each position (i.e., each token embedding) and transforms the
embeddings into a different feature space. In a two-layer FFN, the first linear
transformation is often used to expand the dimensionality, while the second
rescales it back to the original size.
Both the self-attention and feed-forward sub-layers incorporate residual con-
nections and layer normalization, which help stabilize training and enhance
overall performance. Layer normalization is a technique used to stabilize train-
ing by normalizing activations within each training example across the features
of a given layer. Through these residual connections, the input to each sub-layer
is added directly to its output, and layer normalization is applied to the sum to
maintain numerical stability and convergence:
3.7.2 Decoder
The decoder in a Transformer architecture is integral to generating output
sequences for purposes such as machine translation, text generation, and
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 51
where Qi and K j are the Query and Key vectors, respectively. To prevent the
leakage of future information, a mask M is applied to the scores:
QT K j
masked score(xi, xj ) = √i + Mi,j, (61)
dk
where Mi,j is −∞ if j > i, ensuring that the softmax function will yield zero
weights for future tokens:
In the decoder, multi-head attention helps the model capture various aspects
of the relationships between the decoder’s tokens and the encoder’s output.
Through cross-attention, the decoder focuses on the encoder’s output. For each
head h, the cross-attention mechanism computes:
where Qhdec are the Query vectors from the decoder, and K henc and V henc are the
Key and Value vectors from the encoder. We can concatenate the outputs from
each head and transform them as:
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
52 Quantitative Finance
where the network enables the decoder to capture complex feature interactions.
Residual connections are applied around each sub-layer (self-attention, cross-
attention, and feed-forward network), followed by layer normalization. For a
given sub-layer output, the layer normalization is:
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 53
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
54 Quantitative Finance
Patch-wise strikes a good balance in this trade-off which has led PatchTST to
outperform many other architectures in forecasting benchmarks.
Besides, there are many other models that apply one of these atten-
tion modules, including the Autoformer (H. Wu et al., 2021), Crossformer
(Y. Zhang & Yan, 2023), and others. Readers might also find some of the earlier
works in this area useful. These are widely covered in an earlier survey paper
by Lim and Zohren (2021). Note that there are still interesting modules that we
have not covered here. For example, the work of Lim, Arık, Loeff, and Pfister
(2021) on the Temporal Fusion Transformer (TFT) designed a transformer
architecture specifically for multi-horizon forecasting, combining the strengths
of transformers with recurrent layers to handle both static and time-varying
features.
Overall, we think that it is important to recognize the scope and prog-
ress that has been made on the development of transformers for time-series
applications. It has become clearer which Transformers are best suited for
specific time-series challenges, whether that involves capturing nuanced local
trends for anomaly detection or learning broad seasonal patterns for long-
term forecasting. The interplay between the nature of the data and the chosen
architecture continues to shape ongoing innovations in transformer-based time-
series modeling. This will continue to pave the way for increasingly accurate
and robust models. For interested readers, the aforementioned recent review
paper (Y. Wang et al., 2024) is a good place to start any further reading.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 55
Basics of Networks and Graphs The core strength of GNNs stems from their
capacity to learn representations of nodes (or entire graphs) that encapsulate not
only their features but also the rich context provided by their connections. Such
operations are achieved through mechanisms like message passing, aggregating
information across neighboring nodes, and iteratively refining their represen-
tations. This process allows GNNs to capture both local structures and global
graph topology, offering a nuanced understanding of graph-structured data.
Before delving into GNNs, we need to understand the basics of networks and
graphs. To start, we define a graph G as:
G = (V, E), (68)
where V = {v1, · · · , vn } denotes the set of n nodes and E represents the set
of edges. An edge eij = (vi, vj ) ∈ E indicates a connection between nodes vi
and vj .
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
56 Quantitative Finance
• Undirected graphs: These are graphs with edges that lack direction, mean-
ing each connection between two nodes is inherently bidirectional.
• Directed Graphs (Digraphs): Graphs in which edges carry a direction, rep-
resenting a one-way relationship from one node to another. For example,
eij = (vi, vj ) ∈ E denotes an edge pointing from node vi to node vj .
• Bipartite Graphs: This is a distinct type of graph in which nodes are divided
into two separate groups, and every edge connects a node from one group
to a node in the other group, with no edges existing within the same group.
• Homogeneous graphs: Graphs where all nodes and edges are of a single
type.
• Heterogeneous Graphs: Graphs that contain multiple types of nodes and/or
edges. For example, we can denote a graph as G = (V, E, t : V → A, τ :
E → R), where each node vi ∈ V is assigned a type ai ∈ A by function t
and each edge eij ∈ E is assigned a type rij ∈ R by function τ.
• Dynamic graph: A dynamic graph is defined as a sequence of graphs G seq =
{G1, · · · , GT }, where each Gi = (Vi, Ei ) for i = 1, · · · , T. In this sequence,
Vi , Ei represent the sets of nodes and edges for the i-th graph, respectively.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 57
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
58 Quantitative Finance
Figure 18 A graph convolution layer that pools information for node A from
its neighbors.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 59
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
62 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 63
One notable approach along these lines is Time-LLM (Jin et al., 2023). This
method cleverly reprograms a large language model to treat time-series obser-
vations as tokens in a sequence. Specifically, the time-indexed data points are
formatted into a textual prompt in which the LLM is asked to “complete” the
sequence effectively performing a forecast. Despite being originally designed
for language tasks, this work provides an example of how the LLM’s internal
attention mechanisms and capacity for pattern recognition can be extended to
temporal prediction. Time-LLM has shown promising results on a variety of
benchmarks, demonstrating that large language models can be repurposed for
time-series forecasting with relatively minimal changes to their architecture.
By leveraging training on massive text corpora, Time-LLM highlights a new
direction for cross-domain learning, where the underlying skills of an LLM are
refocused on numerical patterns and trends over time.
Despite their advanced capabilities, the usage of LLMs in quantitative
finance is still in the early stage. In the second part of this Element, we
will discuss how LLMs can be used for volatility forecasting and portfolio
optimization. In this section, we discuss some limitations that LLMs face when
applied to the domain of quantitative finance. These limitations stem from the
unique challenges and requirements of the financial sector, including the need
for precise numerical analysis, real-time decision-making, and understanding
of complex financial instruments and markets.
LLMs excel at processing and generating text but often struggle with under-
standing and manipulating numerical data to the extent required in quantitative
finance. Financial analysis often involves complex mathematical models and
statistical methods that are beyond the current capabilities of language-based
models. Integrating LLMs with specialized numerical processing systems
remains a challenge. Furthermore, the financial markets are dynamic, with
conditions that change rapidly. LLMs trained on historical data may not
adapt quickly enough to real-time data or sudden market shifts. The latency
in processing new information and updating models can be a limitation in
time-sensitive financial applications.
On the one hand, LLMs can be fine-tuned with financial texts to understand
domain-specific language. However, truly grasping the intricacies of financial
instruments, regulatory environments, and market mechanisms requires a level
of expertise that LLMs may not achieve solely through language training. This
gap can lead to inaccuracies or oversimplified analyses when processing com-
plex financial scenarios. On the other hand, there is a continual concern with
respect to overfitting, a scenario where a model excels on its training dataset but
fails to perform well with new, unseen data as future market conditions can dif-
fer significantly from historical patterns. Ensuring that LLMs generalize well to
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
64 Quantitative Finance
new, unseen market conditions without overfitting to past data remains a chal-
lenge. Also, most existing LLMs are trained up-to-date, so they can not be used
for historical backtests because of the look-ahead bias due to the information
leakage problem.
While large language models hold strong potential for revolutionizing many
aspects of quantitative finance, addressing these limitations is important for
their effective and responsible application. Ongoing research and development
efforts are focused on overcoming these challenges and show promise for the
improvement of the capabilities of LLMs in financial analysis, prediction, and
decision-making.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 65
Another recent work proposed by Beck et al. (2024) designed xLSTM which
expands upon a traditional LSTM network to tackle certain inherent short-
comings of standard recurrent networks while enhancing their capabilities for
complex sequence modeling tasks. While LSTMs have shown to be very effec-
tive in sequential modeling, they can sometimes struggle with certain types
of data patterns, especially when tackling very long sequences or when the
relationships between data points are highly nonlinear and intricate.
One of the key innovations in xLSTM is the ability to dynamically adjust
its memory and learning mechanisms based on the complexity and nature of
the data it encounters. Traditional LSTMs use fixed gates for controlling the
flow of information, which can be limiting when faced with varying data char-
acteristics. In contrast, xLSTM introduces adaptive mechanisms that allow the
network to modulate its memory retention and forgetfulness more effectively.
This adaptability enables xLSTM to maintain a high level of performance even
when dealing with sequences that have non-stationary patterns or when the rele-
vant information spans a wide range of time steps. By extending the core LSTM
architecture, xLSTM is better equipped to capture complex dependencies that
might be missed by more rigid models.
The introduction of xLSTM is a significant breakthrough in the ongo-
ing development of neural network architectures for sequence modeling.
Kong, Wang, et al. (2024) builds on xLSTM to particularly model multivariate
time-series. They improve and revise the memory storage of xLSTM to fit with
time-series analysis and adopt patching techniques to ensure that long-term
dependencies can be studied.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
66 Quantitative Finance
our interest. We have introduced several methods to source market price data in
Appendix B and, importantly, we need to choose the frequency of interest and
price formats. High-frequency microstructure market data or down-sampled
price and volume data are two possible examples specific to quantitative trad-
ing. Different formats might influence network architectures and change the
amount of training data required.
Beyond obtaining the right dataset, data preparation is a vital step and might
affect our model performance in unexpected ways if it is carried out poorly.
Missing data is one of the common problems that we might encounter when
dealing with time-series. Hence, we need be extremely careful to make sure
that there is no leakage of future information (also known as a look-ahead bias)
when choosing to impute these missing values. Having access to future infor-
mation might erroneously boost training performance but will lead to very poor
out-of-sample results.
Additionally, it is important to store data in a format that permits swift
exploration and iteration. Beyond databases, popular choices are pickle, HDF,
or Parquet formats – each with its own advantages and disadvantages. For
data exceeding available memory or requiring distributed processing across
multiple machines, parallel computing can also be employed.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 71
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
72 Quantitative Finance
way to search for optimal parameters. This process of searching for optimal
parameters is called hyperparameter tuning. These parameters are not part of
the so-called “inner” optimization of the model, such as learning the weights
of the neural network using gradient descent. During this inner optimization,
the hyperparameters are kept fixed. In hyperparameter optimization, which is
sometimes also called “outer” optimization we now repeat the inner optimi-
zation multiple times for different choices of hyperparameters with the aim of
finding the model with the lowest cross-validation error. There are many ways
to search for optimal hyperparameters and we introduce three popular methods
here.
The most basic hyperparameter tuning method is grid search in which we fit
a model for each possible combination of hyperparameters over a grid of pos-
sible values. Obviously, if we have a large number of hyperparameters to tune,
this method would be extremely time-consuming and inefficient. An alternative
to grid search is random search. Random search is different from grid search
in the sense that we do not come up with an exhaustive list of combinations.
Rather, we can give a statistical distribution for each hyperparameter and sam-
ple a candidate value from that distribution. This gives better coverage on the
individual hyperparameters. Indeed, empirical evidence suggests that only a
few of the hyperparameters matter which makes grid search a poor choice for
dealing with a larger number of candidates.
The previous two methods perform individual evaluations of hyperparame-
ters without learning from previous hyperparameter evaluations. The advantage
of these approaches is that they allow for trivial parallelization. However, we
discard the information from previous evaluations that could otherwise be used
to inform regions where we are more likely to find better hyperparameters.
For example, if initial evaluations show that the generalization error plateaus
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 73
quickly after reducing the learning rate, it might be less likely to find better
models when reducing the value even further.
Bayesian optimization (Frazier, 2018) is a sequential global optimization
(SMBO) algorithm that can be used to inform at which point in the hyperparam-
eter space to evaluate a model’s performance next given generalization errors
obtained from previous evaluations. It is specifically designed for scenarios
in which each evaluation of a target function is complex or expensive to run.
To implement such an approach, we first construct a model with some hyper-
parameters and obtain a score v according to some evaluation metric. Next, a
posterior distribution of the hyperparameter is computed and the choices for
the next experiment can be sampled according to this posterior expectation.
We would then repeat this process until convergence.
In practice, Gaussian Processes (GPs) are often used to model the objective
function. An intuitive way of thinking about a GP is as a Gaussian distribution
over continuous functions. Any finite number of points on this function are
distributed according to a multi-variate Gaussian – thus another way of thinking
about the GP is as a multi-variate Gaussian where the number of possible points
goes to infinity. The correlation between points is given by a kernel function
which depends on the distance between the points. Thus, the closer the points
the more correlated they are, which enforces the continuity of the GP.
In Bayesian optimization, one typically starts by specifying a GP prior over
the model’s generalization error across the hyperparameter space, often with
a zero mean and constant variance for simplicity. An initial evaluation is
performed on a random hyperparameter setting, after which the posterior dis-
tribution is updated based on the observed outcome. This updated posterior
then guides the selection of the next hyperparameters to explore, aiming to
efficiently locate optimal configurations. Intuitively, when choosing the next
point to evaluate the model, we have to trade off exploration and exploitation:
It makes sense to search further in regions where the GP indicates that the objec-
tive function is improving (exploitation). However, we also want to search in
areas where the uncertainty is large and we have no knowledge yet regarding
how good the objective might be (explorations).
In practice, we can carry out hyperparameter tuning by using Optuna
(Akiba et al., 2019), which is an open-source optimization framework designed
for hyperparameter tuning. It leverages techniques such as Bayesian optimi-
zation to systematically explore large search spaces and find optimal con-
figurations. We can easily integrate it with cross-validation to ensure that
optimizations are evaluated on multiple splits of data for reliable results. By
intelligently selecting the most promising hyperparameter settings to evaluate
at each step, Optuna minimizes the amount of training required and reduces the
need for extensive manual tuning.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
74 Quantitative Finance
MLlib for machine learning, and GraphX for graph processing. Its widespread
adoption ensures a wealth of resources and community support.
The choice between platforms depends on the problem to be solved and, in
some cases, you might even integrate these platforms to achieve desired out-
comes. For example, we could use Ray to build a start-to-finish framework for
deep learning models. For initial data preprocessing, we can use Ray’s remote
functions (@[Link]) to parallelize data fetching and use libraries like Ray
Pandas to normalize or extract meaningful features. Such libraries provide us
with Pandas-like operations but on a much larger scale. For deep learning mod-
els, Ray integrates seamlessly with frameworks like TensorFlow and PyTorch,
distributing the training process and making efficient use of available compu-
tational resources. In terms of hyperparameter tuning and cross-validation, Ray
Tune is an excellent tool that empowers us to distribute the search for the best
model parameters across multiple workers simultaneously. This is particularly
beneficial when experimenting with large models or when you need to iterate
quickly over many hyperparameter combinations.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
76 Quantitative Finance
This strategy highlights the importance of interest rates and funding costs in
trading decisions.
The second part of the section shifts focus to classical cross-sectional
strategies, which are important in the equity market. We explore the “long-
short” strategy via cross-sectional momentum, in which long positions are
taken in stocks showing strong performance and short positions in those with
weak performance. This method aims to capitalize on the relative momentum
across different securities, hedging market-wide risk by maintaining bal-
anced portfolio-level long and short exposures. We next discuss “Statistical
Arbitrage” (StatArb) strategies, which involve employing statistical models to
identify and exploit price inefficiencies between closely related assets. By ana-
lyzing historical price relationships and using statistical methods to identify
deviations from expected values, traders can execute high-frequency trades to
take advantage of temporary mispricings, all while managing risk and exposure
through sophisticated mathematical models.
The third part is the core of this section, in which we address the trans-
formative potential of deep learning to refine and revolutionize such classical
quantitative strategies. By leveraging deep learning algorithms, with their
ability to analyze vast datasets, traders can uncover complex nonlinear pat-
terns, and improve the predictive accuracy of models. This section covers how
deep learning can be integrated into both futures/FX and equity strategies,
from augmenting trend analyses in CTA-style strategies to refining the selec-
tion process in long-short equity approaches and improving the detection of
arbitrage opportunities in StatArb.
By providing insights into these cutting-edge techniques, this section aims
to equip readers with the knowledge to harness the power of deep learning,
pushing the boundaries of traditional quantitative trading strategies to achieve
enhanced performance and risk management in an increasingly complex mar-
ket environments.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 77
“roll forwar” the contract by closing the one set to expire and opening a new
one with a future expiry. In essence, if the new contract is 20% cheaper, you
would be able to buy 25% more of those, for the same dollar amount.
On the contrary, the continuous futures approach reflects actual price move-
ments by linking successive contracts in a way that eliminates price distortions
(price gaps) at rollover points. This alternate linked-contract representation can
thus be used for back-testing and more accurately reflects the hypothetical gains
and losses of a trader. However, the trade-off is that the price series from con-
tinuous futures contracts will not match actual historical prices whereas those
generated by the nearest futures approach do. In some cases, we might even
observe the negative price series from the continuous futures approach. As a
result, the appropriate method to join futures contracts together depends on
the specific use case. Generally, the nearest futures contracts should be used if
the actual historical price is important, but if the goal is to simulate the gains
and losses of a strategy, the continuous contracts approach should be adopted
instead.
the level of risk remains stable over time. This method is especially significant
in the management of futures, FX trading, where market conditions fluctuate
significantly. By considering volatility – a primary measure of risk – investors
can potentially enhance risk-adjusted returns and better manage the drawdowns
associated with periods of high market turbulence.
The core idea behind volatility targeting involves scaling an asset’s invest-
ment exposure according to the ratio of a target volatility level to the current or
expected volatility of that asset. This adjustment factor can be defined as:
σtarget
A= , (70)
σcurrent
where σcurrent is the current asset volatility typically estimated using the stand-
ard deviation of historical returns over a specified look-back period. The target
volatility (σtarget ) is a predetermined level of risk that the investor aims to main-
tain. Its determination is guided by the investor’s risk tolerance, investment
timeline, and perspective on market conditions.
The trading positions are then scaled by the adjustment factor A to align
the volatility with the target level. Hence, if an asset’s current volatility is
higher than the target, its exposure is reduced (and vice versa), thereby aiming
to stabilize the risk profile. Figure 26 shows an example of a long-only S&P
500 benchmark strategy which has a Sharpe ratio of 0.461. It also includes a
version of the strategy that uses volatility targeting (to an annual volatility of
σtgt = 15%) to scale positions and consequently increases the Sharpe ratio to
0.632.
In practice, implementing a volatility targeting strategy involves continuous
monitoring of market conditions and trading performance. As market volatility
changes, the risk exposure must be periodically adjusted to maintain the tar-
get risk level. This dynamic rebalancing requires a disciplined approach and
an efficient execution mechanism to minimize transaction costs and slippage.
Moreover, investors often employ advanced forecasting tools that consider fac-
tors like market sentiment, economic metrics, and geopolitical conditions, that
allow them to adjust their risk exposure in anticipation of potential volatility.
These models can range from simple historical volatility measures to complex
GARCH models and machine learning algorithms.
As we discuss in greater detail in the next section, volatility targeting across
multiple instruments can also be interpreted as a simple form of portfolio con-
struction. In particular, when assuming that the covariance matrix of portfolio
constituents is a diagonal matrix with respective variances on its diagonal
entries, then a standard mean-variance portfolio reduces to volatility targeting.
While assuming a diagonal covariance matrix tends to be a poor assumption for
equity markets, we can see that the covariance matrix of a universe of future
contracts is roughly block-diagonal with very small terms in the off-diagonals
(Figure 27).
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 81
where σtgt is the annualized volatility target and σti is an estimate of current
market volatility, which can be calculated by using an exponentially weighted
moving standard deviation on rit,t+1 . Note that in the previous formulation, when
working with returns (and ignoring or only using linear transaction costs), the
result does not depend on the actual overall position size. However, in practice,
one would actually target a dollar volatility, such as an volatility of σtgt USD =
10 million USD, rather than a percentage volatility of say σtgt = 15%. Then
USD /σ i,USD would correspond to the actual target trading position
sign(rit−k:t )σtgt t
in USD.
In this case, sign(rst−k:t ) is essentially the time-series momentum factor, where
we go long if the 12-month return is positive and vice versa. In practice, there
are various ways to decide the direction of our positions and we use Yt to
indicate trading directions in a more general case. We here introduce two pop-
ular trend-following strategies: simple moving-average crossover (SMA) and
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
82 Quantitative Finance
where a MACD signal has two time-scales S, which captures short-term move-
ment and L, which captures the long-term trend. α is the smoothing factor
(0 < α ≤ 1), which controls the degree of the weighting decrease for the
EWMA and we can define α in terms of a span S via α = S+1 2
. We can further
improve the signal by combining multiple MACD signals together. In such a
case, each MACD signal has a different time-scale and a final position could
be decided according to:
Õ
3
Ỹt = Yt (Sk, Lk ), (74)
k=1
where, for example, Sk ∈ {8, 16, 32} and Lk ∈ {24, 48, 96} days. Note that the
long look-back is often chosen to be roughly three times the short look-back.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 83
for which going long Currency A and shorting Currency B, we will earn interest
on Currency A and pay interest on Currency B. The net interest earned per day
(I) on a notional amount of capital C can therefore be calculated as:
(iA − iB ) × C × l
I= , (76)
365
where l is the leverage that magnifies both potential profits and potential losses.
While the interest differential might be positive, there remains a risk that the
currency pair’s exchange rate moves against the position. If Currency A depre-
ciates against Currency B, it can negate the interest earnings or even lead to a
net loss. Accordingly, carry trading in FX markets involves not only a simple
interest rate arbitrage but also entails significant exchange rate risk. Traders
thus need to account for the possibility that currency movements could wipe
out the interest gains. Additionally, leverage, which is frequently employed in
carry trades, can magnify returns, but also heightens the potential for losses.
This makes it crucial to manage risk effectively in carry trading strategies.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
84 Quantitative Finance
1 Õ σtgt i
N
p
rt = r , (77)
N i=1 σti t,t+1
where Nt denotes the total number of assets within the portfolio, and rit repre-
sents the return of the asset i. The upcoming sections will outline traditional
trading strategies and illustrate how deep learning models can be utilized to
enhance these methodologies.
rp = wL · rL − wS · rS, (79)
where wL and wS are the weights of the long and short positions, respectively.
Another strategy for stock selection is the cross-sectional momentum strategy,
which capitalizes on the momentum factor across different stocks or sectors.
The underlying concept is that stocks that have outperformed their competi-
tors in the past are expected to sustain their strong performance in the short
to medium term, while those that have underperformed are likely to continue
struggling.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 85
Specifically, this strategy involves ranking stocks based on their past returns
and taking long positions in those within the top percentile while shorting those
within the bottom percentile. Mathematically, the strategy first ranks stocks
based on rit−1 , which is the return in the previous period. It then goes long stocks
with rit−1 in the top x% and short stocks in the bottom x%, with a typical value
for x% being 10%. To avoid sector biases and sector-specific exposure, the
strategy can be applied within sectors, buying the best performers and selling
the worst performers within each sector. Momentum strategies can exhibit con-
siderable variation in their effectiveness based on the chosen time frame for
measuring past returns, and often require back-testing to determine optimal
parameters. These strategies are staples in the quantitative trading world and
are widely applied in today’s trading markets.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
86 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 87
where uit are market features and θ are network parameters. In particular, we
aim to optimize the average return and the annualized Sharpe ratio using the
following loss functions:
Lreturns (θ) = −µR,
1Õ
=− R(i, t),
N Ω
√
µR × 252 (81)
Lsharpe (θ) = − q ,
Í
( Ω R(i, t)2 )/N − µ2R
σtgt
R(i, t) = wit i rit,t+1,
σt
where µR represents the average return across the entire universe Ω of size
N and R(i, t) denotes the return generated by the trading strategy for asset i at
time t. We can employ different network architectures to model the relationship
between the position wit and the market features uit . The entire computational
process is differentiable, which allows for the use of gradient ascent to max-
imize the objective functions. In practice, we multiply the loss functions by
minus one and use gradient descent to minimize them. The following code
snippet demonstrates how to construct a negative Sharpe ratio loss function in
Pytorch:
1 import torch
2 import [Link] as nn
3
4 def Neg_Sharpe(portfolio):
5 return -[Link](portfolio) / [Link](portfolio)
6
7 class SharpeLoss([Link]):
8 def __init__(self):
9 super ().__init__ ()
10
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
88 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
use, available at [Link] [Link]
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
References
Long Only 0.039 0.052 0.035 0.167 0.738 1.086 0.230 53.8% 0.970
Sgn(Returns) 0.054 0.046 0.032 0.083 1.192 1.708 0.653 54.8% 1.011
MACD 0.030 0.031 0.022 0.081 0.976 1.356 0.371 53.9% 1.015
Linear
Sharpe 0.041 0.038 0.028 0.119 1.094 1.462 0.348 54.9% 0.997
Ave. Returns 0.047 0.045 0.031 0.164 1.048 1.500 0.287 53.9% 1.022
MLP
Sharpe 0.044 0.031 0.025 0.154 1.383 1.731 0.283 56.0% 1.024
Ave. Returns 0.064 0.043 0.030 0.161 1.492 2.123 0.399 55.6% 1.031
WaveNet
Sharpe 0.030 0.035 0.026 0.101 0.854 1.167 0.299 53.5% 1.008
Ave. Returns 0.032 0.040 0.028 0.113 0.788 1.145 0.281 53.8% 0.980
LSTM
Sharpe 0.045 0.016 0.011 0.021 2.804 3.993 2.177 59.6% 1.102
Ave. Returns 0.054 0.046 0.033 0.164 1.165 1.645 0.326 54.8% 1.003
use, available at [Link] [Link]
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
References
Long Only 0.117 0.154 0.102 0.431 0.759 1.141 0.271 53.8% 0.973
Sgn(Returns) 0.215 0.154 0.102 0.264 1.392 2.108 0.815 54.8% 1.041
MACD 0.172 0.155 0.106 0.317 1.111 1.622 0.543 53.9% 1.031
Linear
Sharpe 0.232 0.155 0.103 0.303 1.496 2.254 0.765 54.9% 1.056
Ave. Returns 0.189 0.154 0.100 0.372 1.225 1.893 0.507 53.9% 1.047
MLP
Sharpe 0.312 0.154 0.102 0.335 2.017 3.042 0.930 56.0% 1.104
Ave. Returns 0.266 0.154 0.099 0.354 1.731 2.674 0.752 55.6% 1.065
WaveNet
Sharpe 0.148 0.155 0.103 0.349 0.956 1.429 0.424 53.5% 1.018
Ave. Returns 0.136 0.154 0.101 0.356 0.881 1.346 0.381 53.8% 0.993
LSTM
Sharpe 0.451 0.155 0.105 0.209 2.907 4.290 2.159 59.6% 1.113
Ave. Returns 0.208 0.154 0.102 0.365 1.349 2.045 0.568 54.8% 1.028
Deep Learning in Quantitative Trading 91
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
92 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
use, available at [Link] [Link]
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
Table 6 (cont.)
% of Ave.P
E(R) Std(R) DD MDD Sharpe Sortino Calmar +Ret Ave.L
Decoder-Only Trans. 0.013 0.019 0.013 0.026 0.72 1.03 0.60 52.7% 1.012
Conv. Transformer 0.018 0.019 0.007 0.031 0.98 1.47 0.77 52.9% 1.056
Informer 0.016 0.011 0.008 0.017 1.51 2.30 1.44 54.3% 1.089
Decoder Only TFT 0.019 0.012 0.006 0.017 1.71 2.61 2.06 55.7% 1.073
COVID-19
Long Only −0.014 0.067 0.056 0.123 −0.19 −0.22 −0.12 57.2% 0.720
TSMOM 0.009 0.047 0.031 0.041 0.21 0.32 0.22 50.0% 1.041
LSTM −0.041 0.028 0.025 0.053 −1.50 −1.67 −0.78 52.2% 0.643
Transformer 0.042 0.012 0.008 0.008 3.38 5.55 7.31 64.8% 1.066
Decoder-Only Trans. 0.080 0.025 0.014 0.010 3.01 5.55 8.56 58.8% 1.243
Conv. Transformer 0.031 0.019 0.014 0.016 1.81 2.74 3.17 57.4% 1.058
Informer 0.043 0.016 0.010 0.010 2.71 4.45 4.28 59.6% 1.137
Decoder Only TFT 0.018 0.017 0.013 0.021 1.22 1.74 1.57 60.3% 0.831
Deep Learning in Quantitative Trading 95
2015–2020 COVID-19
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
96 Quantitative Finance
1 Õ i σtgt i
N
t,t+1 =
rCSM X r , (82)
N i=1 t σti t,t+1
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 97
where Zti ∈ {1, · · · , N} signifies the ranking position of asset i after the scores
are sorted in ascending order using the operator R(·). The third step is the selec-
tion process and typically involves applying a threshold to retain a specific
proportion of assets, which are then used to construct the corresponding long
and short portfolios. Equation 85 follows the assumption that the strategy uti-
lizes standard decile-based portfolios, meaning that the top and bottom 10% of
assets are selected:
−1 Zti ≤ (0.1 × N),
Security Selection: Xti = 1 Zti > (0.9 × N), (85)
0
Otherwise.
The last step is portfolio construction. For example, we might construct
an equally weighted portfolio scaled by volatility targeting as shown in Equa-
tion 82. Most cross-sectional momentum strategies conform to this framework
and are generally consistent in the final three steps: ranking scores, select-
ing assets, and constructing the portfolio. However, it can differ in the choice
of prediction models f used to calculate the asset scores, ranging from sim-
ple heuristic methods to advanced models that incorporate a wide array of
macroeconomic inputs. While there are numerous techniques available for
scores computation, we typically focus on three primary approaches: classical
momentum strategies, Regress-then-Rank, and Learning to Rank.
For classical momentum strategies, we calculate scores with time-series
momentum factors or signals, such as MACD. Equation 86 illustrates how an
asset could be scored based on its raw cumulative returns calculated over the
preceding 12 months:
Score Calculation: Yit = rit−252,t, (86)
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
98 Quantitative Finance
where rit−252,t represents the unadjusted returns of asset i over the 252-day
period ending at time t.
Differently, the regress-then-rank approach first requires a predictive model,
such as a standard regression or deep learning model. A score is then calculated
so that:
where f denotes a prediction model that receives an input vector uit and is param-
eterized by θ. We then designate a target variable, such as volatility-normalized
returns, and train the model by minimizing the MSE loss:
1 Õ i rt,t+1 2
i
L(θ) = (Y − ), (88)
N Ω t σti
where Ω denotes the collection of all N possible forecasts and target pairs across
the set of instruments and their corresponding time steps.
Learning to Rank (LTR) (T.- Y. Liu et al., 2009) is a research domain in
Information Retrieval that emphasizes the use of machine learning techniques
to develop models for executing ranking tasks. To introduce the framework
of LTR, we borrow examples from document retrieval. For training purposes,
we are provided with a collection of queries Q = {x1, · · · , xN }. Each query
xi is linked to a set of documents {x1i , · · · , xm i } that must be ranked according
to their relevance to the respective query. An accompanying set of document
labels yi = {y1i , · · · , ymi } is provided to indicate the relevance scores of the doc-
uments. The goal of LTR is essentially to learn a ranking function f that takes as
j j
input a pair (xi, xi ) and outputs a relevance score f(xi, xi ) that can then be used to
rank the j-th item for query i. There are several ways to train LTR algorithms,
but we choose to introduce the framework here using the point-wise approach.
j
We can treat each query-item pair (xi, xi ) as an independent instance and train
the model with the objective of minimizing the mean squared error between
the estimated scores and the actual relevance scores, expressed formally
as:
Õ
j j
Lpoint wise = ( f (xi, xi ) − yi )2 . (89)
i,j
The studies by Poh et al. (2021a); Poh, Lim, Zohren, and Roberts (2021b,
2021c); Poh, Roberts, and Zohren (2022) adopt the concept of Learning to Rank
and introduce a framework for integrating LTR models into cross-sectional
trading strategies. To apply this framework to momentum strategies, we can
equate each query to a portfolio rebalancing event. In this analogy, each asso-
ciated document and its corresponding label can be viewed as an asset and its
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 99
designated decile for the next rebalance. This decile is based on a performance
metric, typically returns.
Figure 30 illustrates a schematic representation of this adaptation. Follow-
ing this framework, for the training process, let B = {x1, · · · , xN } represent a
sequence of monthly rebalancing events. At each rebalancing point xi , there
is a collection of equity instruments xi = {x1i , · · · , xm
i } along with their corre-
sponding assigned deciles yi = {yi , · · · , yi }. With all rebalance-asset pairs, we
1 m
j
can form the training set {(xi, xi ), yi }i=1
N to obtain a trained function g to produce
scores. During testing, we inject out-of-sample data to obtain scores and then
rank these scores to select securities. Accordingly, we construct portfolios that
invest in the assets projected to deliver the highest returns and divest from those
expected to generate the lowest.
As a concrete example, Poh et al. (2021a) applied this approach to actively
trade companies listed on the NYSE from 1980 to 2019. At each rebalancing
interval, 100 stocks – representing 10% of all tradable stocks – were selected
and actively traded according to multiple different LTR algorithms. These
include RankNet (RNet), LLambdaMART (LM), ListNet (LNet), and ListMLE
(LMLE). To verify the effectiveness of LTR, they include four benchmarks: a
random selection of stocks (Rand), classical time-series momentum strategies
that use past returns (TM) or MACD signals (MACD) to calculate scores, and
a regress-then-rank technique that uses a MLP network (MLP).
The out-of-sample effectiveness of these different strategies can be evaluated
by the results shown in Figure 31 and Table 8. Figure 31 displays the strategies’
cumulative returns, while Table 8 presents the strategies’ principal financial
performance indicators. To enhance the comparability of each strategy’s perfor-
mance the overall returns are standardized to an annualized 15% portfolio-level
volatility target for all strategies. In this analysis, all returns are calculated
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
use, available at [Link] [Link]
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
Benchmarks
Rand 0.024 0.156 0.106 0.584 0.155 0.228 0.042 54.5% 0.947
TM 0.092 0.167 0.106 0.328 0.551 0.872 0.281 58.2% 1.114
MACD 0.112 0.161 0.097 0.337 0.696 1.157 0.333 59.1% 1.184
MLP 0.044 0.165 0.112 0.641 0.265 0.389 0.068 55.1% 1.001
without accounting for transaction costs, focusing on the models’ inherent pre-
dictive capabilities. Both the graphical data and the statistical metrics clearly
indicate that the LTR algorithms surpass the benchmark group across all per-
formance criteria, with LambdaMART achieving the highest scores on the
majority of the evaluated metrics.
More generally, the ranking algorithms notably enhance profitability,
demonstrating both higher expected returns and the rate percentages. Even
the least effective LTR model significantly surpasses the top reference bench-
mark across all evaluated metrics. Although all models have been adjusted to
maintain similar levels of volatility, LTR-based strategies tend to experience
fewer severe drawdowns and reduced downside risks. Moreover, the lead-
ing LTR model achieves substantial improvements across various performance
indicators. This pronounced difference in performance highlights the value of
learning cross-sectional rankings, as it can lead to better results for momentum
strategies.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
102 Quantitative Finance
deviation and Value at Risk (VaR), which have long been employed to capture
both the volatility and potential downside of asset returns. These metrics are
foundational concepts for understanding market risk and inform a wide range
of decision-making processes in financial institutions. Next, we delve into clas-
sical models for volatility forecasting – covering established approaches such
as the HAR (Heterogeneous Auto-Regressive) model – that provided finan-
cial practitioners with insights into how market fluctuations evolve over time.
While these methods remain useful, they may not always capture the complex
structures present in modern, high-frequency financial market data. Conse-
quently, we also introduce deep learning models for volatility forecasting,
emphasizing how neural networks can learn intricate, nonlinear dynamics from
large datasets in ways that traditional econometric tools often cannot.
Following this discussion of measuring and forecasting risk, we shift our
focus to portfolio optimization strategies. The essence of portfolio optimization
is to find an asset allocation that optimizes for some investment performance
criteria. For example, a portfolio manager might aim to minimize volatility
or maximize the Sharpe ratio. The main benefit of investing in a portfolio
is the diversification which decreases overall volatility and increases return
per unit risk. We continue by exploring the classic mean–variance framework
pioneered by Markowitz (1952), which remains a foundational element of
modern portfolio theory. This approach weighs expected returns against the
portfolio’s variance (risk), enabling investors to construct an efficient frontier
of optimal risk–return trade-offs. We then discuss maximum diversification,
a strategy designed to spread risk across diverse assets or factors, and con-
sequently achieve a more stable performance profile across varying market
conditions.
Moving beyond these traditional methods, we next demonstrate how deep
learning algorithms can be applied to portfolio optimization. Based on two
works C. Zhang, Zhang, Cucuringu, and Zohren (2021); Z. Zhang, Zohren,
and Roberts (2020), we present an end-to-end approach that leverages deep
learning models to optimize a portfolio directly. Instead of predicting returns
or constructing a covariance matrix of returns, the model directly optimizes
portfolio weights for a range of objective functions, such as minimizing vari-
ance or maximizing the Sharpe Ratio. Deep learning models are adaptable to
portfolios with distinct characteristics, allowing for short selling, cardinality,
maximum position, and several other constraints. All constraints can be encap-
sulated in specialized neural network layers, enabling the use of gradient-based
methods for optimization.
By bringing risk measurements, volatility forecasting, and portfolio optimi-
zation together in one section, we underscore the integral connection between
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 103
these topics. Accurately forecasting volatility is vital not only for effective risk
management but also for informing the dynamic allocation of assets in a port-
folio. When market volatility patterns are well understood, practitioners can
align their portfolio strategies in a way that accounts for fluctuating levels of
uncertainty. In other words, volatility forecasting is not merely an isolated exer-
cise and it provides a predictive lens through which portfolio decisions can be
refined. Combining these topics ensures a holistic perspective, from quanti-
fying and forecasting market risk to deploying those insights in a systematic
strategy that seeks to balance returns and risk.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
104 Quantitative Finance
confidence level. In other words, with probability (for example) 95%, losses
will not exceed the VaR figure. When VaR is used as a risk constraint, it
effectively places a limit (threshold) on the acceptable level of potential
loss. CVaR extends this by indicating the expected loss beyond the VaR.
Thus, CVaR focuses specifically on the distribution’s tail, capturing the
average magnitude of losses that surpass the VaR.
• Downside Risk: This measures the potential for loss in adverse scenarios,
focusing on negative returns. Metrics like the Sortino ratio, which is the
ratio of the asset’s return relative to its downside risk, are particularly useful
in this context.
These risk metrics are popular indicators in both academia and industry.
Thus, a good understanding of these metrics provides us with the foundation for
managing our portfolio risks. Note that risk measurement is not a set-and-forget
process. Continuous monitoring is vital as market conditions, asset correlations,
and volatilities evolve. Consistent reviews are imperative to maintain align-
ment between the portfolio and an investor’s risk preferences and objectives.
By applying diverse risk metrics and regularly monitoring and adjusting their
holdings, investors can improve the likelihood of meeting their financial targets
while effectively managing their risk exposure.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 105
where vd,t, vw,t, vm,t represent the daily, weekly, and monthly volatilities respec-
tively, and β coefficients are the parameters that need to be inferred. The linear
HAR model is straightforward to estimate and interpret, which makes it a
valuable tool for capturing the dynamics of financial market volatilities.
While the HAR model is an effective method, the HEAVY model is designed
to forecast volatility using high-frequency data, which provides more granu-
lar insights into the market’s behavior compared to traditional low-frequency
data. HEAVY models are commonly used for modeling volatility from high-
frequency data like tick-by-tick or minute-by-minute price movements. In order
to estimate volatility from high-frequency data, we introduce the notion of
realized volatility (RVt ). A common estimate for realized volatility is:
v
tm
Õ
RVt = (r2ti ), (91)
i
where r2ti are the high-frequency returns, and m represents the number of high-
frequency intervals within a day (e.g., minutes). RV is used as a measure of the
total variance in asset prices over a specific time interval, and the idea is that
volatility can be obtained from the squared returns of the high-frequency price
series. We can then express the HEAVY model as the following:
where the realized volatility is used to capture the short-term volatility from
high-frequency data and the lagged volatility component is used for the long-
term trends. The benefits of leveraging high-frequency data allow for more
accurate volatility estimation, and HEAVY models are well-equipped to handle
the phenomenon of volatility clustering.
However, microstructure noise exists and the HEAVY model remains sensi-
tive to certain market effects, such as bid-ask spreads, and sampling frequency.
It is nontrivial to eliminate the noise, which could affect the accuracy of
volatility estimates and predictions.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
106 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 107
max rp = wT r, s.t.,
w
(93)
σp2 = wT Σw = σp,0
2
, and wT 1 = 1,
where Σ is the covariance matrix of asset returns with σij representing the covar-
iance between asset i and j. σp,0
2 denotes a target level of risk and 1 is a vector
of ones ensuring that the weights sum to 1 (i.e., fully invested portfolio). In
solving the constrained maximization problem outlined earlier, one determines
the optimal portfolio weights that maximize returns while keeping risk at a
given level. An alternative formulation of the mean-variance problem focuses
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
108 Quantitative Finance
where rp,0 denote a target expected return level. To solve this formulation of
the constrained optimization problem, we introduce Lagrange multipliers λ and
γ for the constraints. The Lagrangian L is:
To find the optimal solution, we take the partial derivatives of L with respect
to w, λ and γ and set each of them to zero:
∂L
= 2Σw − λrp,0 − γ1 = 0,
∂w
∂L
= rp,0 − wT r = 0, (96)
∂λ
∂L
= 1 − wT 1 = 0,
∂γ
where the solution obtained provides the optimal portfolio allocation that min-
imizes the portfolio risk for a given expected return. By setting the partial
derivatives equal to zero, we are essentially finding the point where the rate
of change of the objective function with respect to each asset weight is zero,
implying that the portfolio has reached an optimal balance between risk and
return. The Lagrange multiplier in this context represents the trade-off between
the expected return and the risk of the portfolio. It provides insight into how
much additional return can be achieved by increasing the overall level of risk in
the portfolio. The solution essentially tells us the proportion of wealth to allo-
cate to each asset in order to achieve the best risk-return trade-off, considering
both the covariance between asset returns and the constraints set.
The strategy of maximum diversification is based on the premise that a
portfolio that diversifies across a wide range of assets will typically have a
lower risk than the sum of its individual components. Accordingly, the objec-
tive is to trade a selection of assets that effectively lowers unsystematic risk,
thereby minimizing the overall portfolio’s volatility. As a result, maximum
diversification considers the correlations between assets rather than just their
individual risks. By holding assets with low or negative correlations, the aggre-
gate risk of a portfolio can be meaningfully reduced. A central measure for
this approach is the diversification ratio (DR), defined as the ratio between the
sum of the individually weighted asset volatilities and the total volatility of
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 109
wT σ
max √ , s.t. wT 1 = 1, (98)
w wT Σw
where 1 is a vector of ones, ensuring the weights sum to 1. The optimization
problem is a fractional programming problem due to the ratio in the objective
function. We can simplify this by maximizing the numerator while holding the
denominator constraint. This reformulation constrains the portfolio variance to
be constant (usually set to 1), and focuses on maximizing the weighted average
volatility. One approach to tackle this optimization problem is to again employ
Lagrange multipliers:
where λ and γ are Lagrange multipliers for the constraints. We optimize the
portfolio weights by differentiating with respect to w, λ and γ:
∂L
= σ − 2λΣw − γ1 = 0,
∂w
∂L
= wT Σw − 1 = 0, (100)
∂λ
∂L
= wT 1 − 1 = 0.
∂γ
Intuitively, an investor allocates capital across a variety of assets that have
low or negative correlations with each other to achieve maximum diver-
sification (MD). Following the same logic, this strategy aims to minimize
unsystematic risk, capitalizing on the unique price movements of each asset.
The key advantage of MD is risk reduction without a proportional decrease in
potential returns, which is particularly appealing during turbulent market con-
ditions. This diversification can help protect against significant downturns in
any single investment or asset class as negatively correlated assets are unlikely
to all move in the same direction.
Although MPT and MD are popular, their underlying assumptions have
been widely questioned and frequently do not hold true in real financial mar-
kets. In particular, MPT presupposes normally distributed asset returns and
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
110 Quantitative Finance
assumes that investors are rational, risk-averse, and chiefly focused on mean
and variance. Nevertheless, financial datasets frequently exhibit highly erratic
behavior, making them prone to deviating from these assumptions, particularly
during episodes of sharp market fluctuations (see for instance (Cont & Nitions,
1999; Z. Zhang, Zohren, & Roberts, 2019b)). Additionally, MPT assumes a
static view of risk and return, ignoring the dynamic nature of asset perfor-
mance and market conditions. The estimates of expected returns, variances, and
covariances are also very difficult to obtain, and small errors in these estimates
can lead to significant discrepancies in the model results and consequently over-
or under-allocation to certain assets.
where rp,t+1 = wTt rt+1 represents the portfolio return, while rt = (r1,t, · · · , rn,t )T
denotes the vector of returns for n assets at time t, with ri,t referring to the
return of asset i (i = 1, · · · , n). The index t can be any chosen interval, such as
minutes, days, or months. λ is the risk aversion rate that controls the trade-off
between returns and risks, and wt = (w1,t, · · · , wn,t )T represents the portfolio
weights that need to be optimized. In order to obtain wt , we adopt a deep neural
network (f) that outputs portfolio weights:
wt = f (X t ) , (102)
where X t denotes the inputs to the network. Figure 32 depicts the proposed
end-to-end framework, which contains two main components: the score block
and the portfolio block.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 111
The score block maps inputs to portfolio scores. Inputs can be any market
information that might be useful for adjusting portfolio weights. For example,
past returns up to lag p (rt−p, . . . , rt ), momentum features (MACD). More spe-
cifically, a neural network maps the input data to fitness scores for each asset.
Higher fitness scores indicate a greater likelihood of receiving larger portfolio
weights. We denote this network as fscores and the resulting fitness scores as:
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
112 Quantitative Finance
minVar(rp.t+1 ), (105)
wt
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 113
Õ
Λti,j = (n + 1 − 2i)sj,t − |sj,t − sm,t |. (110)
m
According to previous works (Blondel et al., 2020; Cuturi, Teboul, & Vert,
2019; Grover, Wang, Zweig, & Ermon, 2018; Ogryczak & Tamir, 2003), the
permutation matrix Π(st ) can be constructed as:
1, if j = argmax(Λti,: ),
Π(st )i,j = (111)
0, otherwise.
Since the argmax function is not differentiable, Grover et al. (2018) intro-
duce a NeuralSort layer that substitutes argmax with softmax, producing a
differentiable approximation of Π(st ):
t )i,: = softmax(Λt ).
Π(s (112)
i,:
Thus Equation (109) becomes differentiable, allowing for the use of standard
gradient descent.
Constraint (4) Leverage, i.e. L∥wt ∥1 = L: In line with Equation (104), we scale
the overall exposure of the positions by a factor of L:
e |si,t |
wi,t = L × sign(si,t ) × Ín |s | . (113)
j=1 e
j,t
Baselines
S&P 500 0.061 0.196 0.140 0.568 0.402 0.563 1.000 54.1%
EWP 0.130 0.212 0.148 0.548 0.682 0.973 1.000 54.6%
MD 0.439 0.239 0.141 0.519 1.641 2.785 0.599 54.8%
GMVP 0.080 0.081 0.059 0.408 0.992 1.360 0.257 56.4%
MSRP
MPT-LM 0.004 0.015 0.011 0.062 0.290 0.414 0.009 50.4%
MPT-MLP 0.008 0.027 0.019 0.140 0.299 0.424 0.036 51.5%
MPT-LSTM 0.014 0.017 0.011 0.043 0.858 1.259 0.014 52.0%
MPT-CNN 0.007 0.017 0.012 0.093 0.426 0.609 0.014 51.3%
E2E-LM 0.049 0.044 0.030 0.168 1.116 1.649 0.011 54.6%
E2E-MLP 0.044 0.026 0.016 0.073 1.688 2.657 0.008 55.2%
use, available at [Link] [Link]
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
Constraints
E2E-LSTM-MSRP-Long 0.368 0.197 0.125 0.253 1.691 2.666 0.767 56.6%
E2E-LSTM-MSRP-LEV 0.321 0.112 0.068 0.151 2.540 4.203 0.132 57.6%
E2E-LSTM-MSRP-CAR 0.032 0.056 0.039 0.167 0.588 0.844 -0.011 52.0%
E2E-LSTM-MSRP-MAX 0.057 0.021 0.012 0.026 2.683 4.459 0.021 57.8%
116 Quantitative Finance
The third block (other objective functions) indicates the results for the
application of deep learning to different objective functions including global
minimum variance portfolio (GMVP) and mean-variance problem in Equa-
tion 101. The final section (Constraints) explores the influence of multiple
constraints by constructing a strictly long portfolio aimed at maximizing the
Sharpe ratio (MSRP-LONG), a leveraged portfolio (LEV) with L = 5, a
cardinality-constrained strategy (CAR) that selects 20% of the instruments
thereby going long on the top decile and shorting the bottom decile, and lastly
a portfolio that imposes a 5% maximum position limit for each instrument
(MAX).
In the second block of Table 9, the end-to-end (E2E) deep learning methods
outperform both the MPT and baseline models. The third block highlights how
varying objective functions influence model performance. Specifically, GMVP
not surprisingly provides the lowest variance. Additionally, adjusting the risk
aversion parameter λ in the mean-variance approach allows users to control
their preferred risk level – raising λ increases the penalty on risk, thereby
reducing variance. The final block presents results under different constraints,
demonstrating the framework’s flexibility. Users thus have the ability to cus-
tomize these constraints to align with their individual requirements and trading
conditions.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 117
vd,t+1 = α + V t β + H (l) γ,
(115)
H (l) = GNN(H (l−1), A) = σ(ÃH (l−1) W (l) ),
where V t = (vd,t+1, vw,t+1, vm,t+1 ) ∈ R n×3 and H (0) = V t . We define W (l) as the
learnable weights and σ denotes the ReLU activation function.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
118 Quantitative Finance
GNNs possess several advantages for forecasting volatility. One key benefit
is that a GNN can model both the direct and indirect effects of asset inter-
actions. For example, when one asset experiences a large shock, it can cause
volatility to spill over to other assets in the network, even if those assets were
not directly affected by the initial event. This phenomenon is known as the
spillover effect and can reverberate between assets that are not directly related.
In other words, it describes the transmission of financial disturbances, such
as price movements, volatility shocks, or shifts in market sentiment, as they
propagate between a network of assets. GNNs are capable of modeling these
spillover effects, as they can incorporate a broader set of market dynamics that
traditional methods may miss.
In addition, GNNs can handle high-dimensional data efficiently. By leverag-
ing the graph structure, GNNs can learn from a vast array of asset interactions
without becoming overwhelmed by the dimensionality of the data. As a result,
GNNs can learn complex dependencies from historical data, making them more
adept at forecasting future volatility in a multivariate setting. Moreover, GNNs
have the ability to adapt to the evolving relationships between assets which
allows them to respond to changing market conditions, an especially valuable
trait in the fast-moving world of financial markets.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 119
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
120 Quantitative Finance
where H (l+1) is the matrix of the node embedding at layer l and H (0) = X, W (l)
is a trainable weight matrix, σ is the nonlinear activation function and à is a
normalized version of the adjacency matrix A. Depending on the purpose of
the task, the final output embedding can vary.
Ekmekcioğlu & Pınar (2023) extend the framework introduced in Section 6.5
with graph layers to directly learn optimal asset allocations. By treating each
asset as a node and connections between assets as edges, they outline a frame-
work to capture intricate relationships that traditional models often overlook.
In this approach, GNNs are used as the primary tool for learning these relation-
ships and aggregate signals from each node’s neighbors to form more expres-
sive embeddings of each asset. The results indicate that GNN-based models
can provide better insights into how assets co-move and how certain market
events propagate through a network of financial assets. Moreover, graph-based
approaches allow the model to dynamically learn higher-order dependencies
among clusters of assets, rather than simply relying upon pairwise correlations
or static factor models.
Another interesting work by Korangi, Mues, and Bravo (2024) seeks to
capture the evolving relationships among hundreds of assets over extended
horizons. They elect to use Graph Attention Networks (GATs) to incorporate
dynamic information about how assets co-move and influence one another.
In such a framework, each asset is a node in a time-evolving graph, and the
adjacency matrix is periodically updated using rolling windows of returns or
other market signals. At each network snapshot, the GAT layer uses attention
mechanisms to assign weights to edges, so that connections with higher rele-
vance receive proportionally more information flow. The authors demonstrate
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 121
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
122 Quantitative Finance
Figure 34 A graph built from a news network. Colors indicate that assets are
allocated to the same group.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 123
not only forecasts future returns but also predicts the corresponding portfolio
weights:
where LMSE is the standard MSE loss that measures the discrepancy between
actual returns and predicted returns, and LDecision measures how inaccuracies
in predicted returns translate into suboptimal portfolio decisions. An input
embedding is then used to process data from multiple modalities, specifi-
cally time-series decomposition and LLM-enhanced semantic embeddings.
After that, several network layers are implemented to detect temporal patterns
with LLM-derived semantic embeddings and convert predictions into portfolio
weights. Finally, the hybrid loss function (Equation 118) is optimized to derive
forecasts and portfolio weights. The field of LLMs in finance is still in its early
stage with a limited number of published works. For a broad coverage of how
LLMs can be applied to quantitative finance, interested readers can refer to
Kong, Nie, et al. (2024).
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
124 Quantitative Finance
obvious limitations, such as restricted market access for remote participants and
slower dissemination of price information. These shortcomings eventually led
to the adoption of electronic trading systems.
The transition from traditional pit trading to electronic trading systems,
marked a pivotal transformation in financial markets, fundamentally alter-
ing the landscape of global finance and paving the way for modern trading
practices. This shift began in earnest during the late twentieth century as tech-
nological advancements made electronic trading feasible and attractive. One
of the earliest and most notable shifts occurred with the establishment of the
NASDAQ in the early 1970s. As the first electronic stock market, the NAS-
DAQ used computer and telecommunication technology to facilitate trading
without a physical trading floor. Around the same time, the New York Stock
Exchange (NYSE) introduced the Designated Order Turnaround (DOT) sys-
tem, which routed orders electronically to the trading floor, although they were
still executed via open outcry.
In the late 1980s and early 1990s, as computers became more powerful and
network technology more sophisticated, more exchanges began to explore elec-
tronic trading options. The London Stock Exchange (LSE) moved away from
face-to-face trading with the “Big Bang” deregulation of 1986, which included
the introduction of electronic screen-based trading. This shift was mirrored by
exchanges around the world, including the Toronto Stock Exchange (TSE) and
the Frankfurt Stock Exchange (FSE).
The development of Globex by the Chicago Mercantile Exchange (CME) in
1992 was another significant advancement. Globex was an electronic trading
platform intended for after-hours trading that would eventually become a 24-
hour worldwide digital trading environment. Similarly, EUREX, established
in 1998 as a result of the merger between the German and Swiss derivatives
exchanges, was among the first to go fully electronic, setting a precedent
for derivatives trading globally. The adoption of electronic trading and the
Limit Order Book (LOB) system revolutionized market dynamics. With the
ability to process high volumes of transactions at unprecedented speeds, trad-
ing became faster, more efficient, and more accessible. Moreover, electronic
trading reduced the costs associated with trading and increased transparency
by making market data widely available. It also democratized market access,
enabling more participants to engage from remote locations.
Today, nearly all major stock and derivatives exchanges operate electroni-
cally. The transition has not only altered how trades are executed but also how
markets are monitored and regulated. Advanced algorithms and high-frequency
trading strategies that rely on microsecond advantages in electronic trading
environments have become prevalent, prompting ongoing discussions about
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 125
market fairness and stability. In order to digitize trading, every verbal bid and
ask needs to be converted into digital orders that can be entered into the LOB.
Each trader’s shouts and hand signals become electronic messages that spec-
ify the quantity, price, and conditions of trades. The electronic LOB system
then aggregates these orders, organizing them by price level and quantity. The
system then continuously updates as new orders come in, orders are modified,
and trades are executed. This shift also enhances transparency by providing all
market participants with a detailed real-time view of market activity and depth,
something that was not previously possible in the chaotic environment of the
trading pit.
Modern exchanges can generate billions of such messages in a day. The
high resolution and volume of this data enable deep learning models to discern
intricate patterns and dependencies that might be invisible in lower-frequency
data. Next, we will give a detailed description of high-frequency microstruc-
ture data. This will include an exploration of the inner working mechanism of
exchanges and the aggregation of individual order messages into limit order
books, which reflect supply and demand at the microstructure level. By lever-
aging such large datasets, we have numerous opportunities across various
financial applications, such as generating predictive signals that drive algorith-
mic trading decisions, optimizing trade execution strategies, and even creating
advanced generative models that can simulate entire exchange markets. Such
simulations can be used with reinforcement learning algorithms to design better
trading strategies, accounting for market impact, fill rate, and market anoma-
lies. Consequently, high-frequency microstructure data is not just facilitating
more informed decisions but is also a key component of many innovations in
financial technologies.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
126 Quantitative Finance
order volumes (sizes), order directions (sides), order types (a market order, a
limit order, etc.), order IDs which serve as a unique and anonymous identifier
for each individual order, and actions that describe the specific instruction of
a trader (buying, selling, or canceling an order). Table 10 shows a snapshot of
sequences of MBO data that contains essential information. For simplicity, we
omit some nonessential auxiliary information.
Figure 35 presents a snapshot of the LOB at a given time t which illustrates
the collection of all currently active limit orders. When a trader places orders, a
market order is matched immediately with an existing, resting order, whereas a
limit order enables traders to specify the worst price and quantity they wish to
transact. These limit orders stay active. Once an exchange has received a limit
order, it will place the order at the appropriate position within the existing LOB.
The incoming MBO data continuously alters the LOB and a new snapshot of
the LOB is formed whenever it gets updated.
A LOB consists of two primary types of orders: bids and asks. A bid order
signifies a willingness to purchase an asset at a specified price or lower, while
an ask order indicates an intention to sell an asset at a particular price or higher.
As shown in Figure 35, bids or asks have prices P(t) and sizes (volumes) V(t).
Each rectangle in the graph represents a single order, with its size represented
by the square’s height. Therefore, each level of a LOB is an ordered queue of
all limit orders at that specific price level.
Figure 36 illustrates how a limit order book evolves and demonstrates the
impact of an MBO message on the existing LOB. For instance, at the top of
Figure 36, a new limit order (ID=46280) is added to the ask side of the order
book with a price of 70.04 and a size of 7580. This order addition updates the
order book by placing the new order at the corresponding price level. Similarly,
the LOB is altered when there is a cancellation (as shown in the middle top
figure), a partial cancellation (middle bottom figure), or when a market buy
order is executed (bottom figure).
In practice, we can obtain high-frequency microstructure data by subscrib-
ing to market exchanges. Exchanges typically offer data across three tiers:
Level 1, Level 2, and Level 3. Each tier provides progressively more detailed
information and capabilities, with corresponding subscription costs:
• Level 1 Data: This tier comprises the price and volume of the latest trade,
along with the current best bid and ask prices, which is commonly referred
to as quote data.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 129
• Level 2 Data: This tier supplies LOB data, providing more comprehensive
information than Level 1 by displaying bid and ask prices along with their
respective volumes across multiple deeper levels of the order book.
• Level 3 Data: This tier goes beyond Level 2 by providing unaggregated
details of bids and asks placed by individual traders (MBO data), delivering
the most granular view of market activity.
The choice of which data source to use depends on the specific application
or analysis being conducted. Each tier of market data offers unique advantages
and levels of detail suitable for different purposes. LOB data, typically pro-
vided at Level 2, aggregates the total available quantities at each price level
in the market. This aggregated view gives insight into the overall demand and
supply dynamics at a microstructure level, helping analysts assess liquidity,
price stability, and potential market impact. However, LOB data lacks informa-
tion about individual orders, focusing instead on summarized market activity.
In contrast, MBO data, available at Level 3, provides granular details about
individual market participants’ behaviors. It includes unaggregated bids and
asks, along with unique order identifiers. This level of detail enables a deeper
understanding of queue positions, order prioritization, and the trading strate-
gies employed by participants. MBO data is especially valuable for applications
that require precise modeling of order flow dynamics, such as market impact
analysis, execution optimization, and algorithmic trading. By combining LOB
and MBO data, it is possible to gain both macro and micro views of the mar-
ket, allowing for more comprehensive analyses tailored to the needs of specific
trading strategies or research objectives.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
130 Quantitative Finance
limit order books. A key advancement in this study is the application of CNNs
to directly predict stock prices using LOB data. To do so, the authors adapt
CNNs, traditionally successful in image processing, to handle the structured
time-series data of LOBs. By treating the LOB as a multidimensional array,
CNNs can learn spatial hierarchies and patterns within the order book that are
predictive of future price movements. This approach leverages the depth of data
available, capturing subtle yet critical shifts in market sentiment that might be
indicative of future trends. Other studies (Z. Zhang, Zohren, & Roberts, 2019a)
have shown that CNNs can outperform classical statistical models and other
machine learning methods in predicting short-term price changes, providing
traders with a powerful tool for making more informed decisions.
Interestingly, the work of Sirignano and Cont (2018) has uncovered universal
features of price formation in limit order books. By analyzing vast amounts of
LOB data across different assets and markets, their models have identified com-
mon patterns and dynamics that govern price changes. These insights suggest
that despite the apparent complexity and noise within financial markets, there
are underlying principles and patterns that can be extracted through deep learn-
ing. The ability of deep learning models to distill these features from the data
not only enhances predictive accuracy but also provides a deeper understanding
of market mechanics.
In a more specialized context, Z. Zhang et al. (2019a) carefully designed a
deep network, termed DeepLOB, to predict price movements from LOB data
using an architecture that combines convolutional filters and LSTM modules.
Convolutional filters are utilized to capture the spatial patterns of the LOB,
while LSTM modules are employed to model longer-term temporal dependen-
cies. This proposed network continues to achieve state-of-the-art performance,
serving as a benchmark and inspiring a wide range of studies and applications
in financial modeling and trading. We implement DeepLOB for a regression
problem and attach the code script in Listing 3 in Appendix D.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 131
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
132 Quantitative Finance
including tick size, predictive horizon and order book depths. Prata et al. (2024)
also carefully compare the predictive power of fifteen cutting-edge DL mod-
els based on LOB data. For more interesting works, readers can refer to Bao,
Yue, and Rao (2017); Chen, Chen, Huang, Huang, and Chen (2016); Di Persio
and Honchar (2016); Dixon (2018); Doering, Fairbank, and Markose (2017);
Fischer and Krauss (2017); Nelson, Pereira, and de Oliveira (2017); Selvin,
Vinayakumar, Gopalakrishnan, Menon, and Soman (2017); Tsantekidis et al.
(2017b, 2017a).
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 133
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
134 Quantitative Finance
environments and takes an action (At ) based on the observed information. This
action either leads to a reward (Rt ) or a penalty that indicates the goodness of the
chosen action. The agent then moves to the next state (St+1 ), and this procedure
continues until the environment concludes. Throughout, the agent’s objective
Í
is to maximize the expected total rewards (E( Rt )).
DRL combines the components of RL with deep neural networks to learn
complex state spaces and effective policies from high-dimensional inputs.
There are a range of DRL algorithms. Deep Q-Networks (DQNs) mark a
major advancement in reinforcement learning by integrating Q-learning princi-
ples with the robust function approximation abilities of deep neural networks.
Traditional Q-learning, which is a model-free reinforcement learning method,
depends on a Q-table to record and update Q-values for every state-action
combination. A Q-value represents the network’s estimate of the expected
discounted sum of future rewards when taking a specific action in a given
state according to the current optimal policy. However, in its traditional form,
this technique becomes unmanageable in environments with extensive or con-
tinuous state spaces because the memory and computational demands grow
exponentially.
DQNs address this challenge by using deep neural networks to approximate
the Q-value function, enabling them to process high-dimensional inputs such as
images or intricate market data. A key advancement in DQNs is the implemen-
tation of experience replay. This method involves retaining past interactions
in a replay buffer and randomly re-selecting mini-batches of these experiences
during training. By disrupting the temporal sequences in the data, experience
replay helps to stabilize the learning process and enhance the algorithm’s over-
all performance. Furthermore, DQNs incorporate a target network, which is
an intermittently updated replica of the Q-network. This target network pro-
vides consistent target values for training, thereby improving the stability and
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 135
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
136 Quantitative Finance
maximizing trading returns. The findings reveal that the RL agent formulates an
effective strategy for inventory management and order placement, surpassing a
heuristic benchmark trading strategy that employs the same signals. Figure 40
illustrates a 17-second segment from the testing period, comparing the baseline
strategy with the RL approach. The first two panels show the highest bid, low-
est ask, and mid-prices, alongside trading activities for buy orders (highlighted
in green) and sell orders (highlighted in red). Since the simulation encom-
passes the entire LOB, the influence of trading actions on bid and ask prices is
observable. The third panel depicts the progression of inventory positions for
both strategies, and the final panel displays the trading profits in USD over the
duration of the period.
The findings indicate that both strategies impact the prices within the LOB
by introducing new order flows into the market. These new orders interact
with existing ones, thereby influencing liquidity at the top bid and ask levels.
Throughout the examined timeframe, the baseline strategy experiences minor
losses attributed to frequent changes in its signals which alternate between
anticipating declining and rising future prices. This behavior results in aggres-
sive trading, causing the strategy to incur the spread cost with each transaction.
On the other hand, the RL strategy outperforms by employing a more subdued
approach. This minimizes the effects of market volatility while allowing the RL
strategy to effectively manage its positions. It trades prudently when exiting
long positions and makes strategic decisions when establishing new ones. In
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 137
Figure 41 The distribution of executed volume per time step, with the
horizontal axis representing the time step, vertical axis indicating the volume,
and columns corresponding to different execution strategies. The box plots
show the interquartile ranges, medians (marked by orange lines), means
(indicated by blue triangles), and the 10th and 90th percentiles (represented
by whiskers).
the latter part of the observed period, the RL strategy notably increases its pas-
sive buy orders (depicted as green circles in the second panel of Figure 40).
These orders are connected by green lines to their respective executions or
cancellations, with some actions occurring beyond the timeframe shown in the
figure.
To further present how different DRL algorithms affect execution paths,
we take an example from Schnaubelt (2022) that optimizes order place-
ments on cryptocurrency exchanges. Figure 41 illustrates the executed volume
across various time steps for four different strategies: submit-and-leave (S&L),
backwards-induction Q-learning (BQL), deep double Q-networks (DDQN),
and proximal policy optimization (PPO). Several consistent patterns are
observed in the average executed volume fractions. Firstly, a substantial portion
of the volume is typically executed in the final time step, which usually involves
completing any remaining volume through a market order. Secondly, when
analyzing the volume fractions within the first three time steps, the majority of
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
138 Quantitative Finance
the execution generally occurs in the initial step. Thirdly, as the initial volume
v0 increases, the volume executed in the final time step also rises, while the
volume fraction executed in the earlier steps tends to decrease. These trends
can be attributed to the limited liquidity available during the initial time steps.
When comparing various execution strategies, it becomes apparent that
the S&L method handles a smaller portion of the volume within the first
three time steps compared to the deep reinforcement learning approaches PPO
and DDQN. Although the S&L strategy maintains a positive average volume
fraction, its median fraction is zero across all three initial time steps. In con-
trast, both DDQN and PPO agents exhibit similar execution patterns, with the
majority of the volume being carried out in the first time step.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 139
they can simulate user preferences and generate tailored content suggestions,
improving user satisfaction and engagement. In gaming, generative models can
craft personalized environments and narratives suited to each player’s pref-
erences. By utilizing the power of generative models, developers can create
more customized and engaging experiences, boosting user satisfaction and
retention.
For high-frequency microstructure data, we can use generative models to
enhance simulations by generating realistic, high-fidelity data that is accu-
rately representative of complex financial markets. This is particularly useful
for modeling market impact as such interactions are difficult to simulate with
static historical data. Furthermore, we can use high-quality synthetic data to
study the problem of regime shift, a notorious problem for financial time-series
that often leads to overfitting and poor generalization. By improving the model-
ing of market dynamics, generative models enhance decision-making processes
and improve risk management.
The roots of generative modeling lie in traditional statistical methods, which
focus on modeling the underlying distributions of data. Some of the foun-
dational approaches include Gaussian Mixture Models (GMMs) and Hidden
Market Models (HMMs). GMMs represent data as a mixture of multiple Gauss-
ian distributions, each capturing a different aspect of the data distribution.
GMMs are effective for clustering and density estimation but struggle with
high-dimensional data. HMMs are used to model sequential data, where the
data-generating process is assumed to follow a Markov process with hid-
den states. They are frequently applied in speech recognition and time-series
analysis, but they struggle to capture complex dependencies.
Advancements in deep learning algorithms have profoundly transformed the
generative modeling landscape over the past several years, shifting it from con-
ventional statistical approaches to advanced deep learning frameworks. This
progression has been fueled by the demand for models that are more pre-
cise, efficient, and capable of generating complex data. Neural networks, with
their proficiency in learning intricate representations, have been instrumental
in developing more robust and adaptable generative models.
There are several remarkable works that leverage the power of deep networks
to provide a new paradigm for generative modeling. Variational Autoencoders
(VAEs) introduced by Kingma and Welling (2013) combine principles from
Bayesian inference and neural networks. They use an encoder-decoder archi-
tecture to learn a probabilistic representation of data, enabling efficient gener-
ation of new samples. VAEs marked a significant step forward in generating
realistic data while providing a solid theoretical foundation.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
140 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 141
of the generated data with those of the actual realized data over the span of 100
future messages. The findings demonstrate that the model effectively mirrors
the mid-price return distributions, even though these were not directly included
in the training loss function. The average returns exhibit no significant drift or
trend, and the shaded areas, representing the 95% confidence intervals of the
distributions, align closely.
To further test the authenticity of the generated data, returns are sampled
from the generative model, and correlation is calculated between the gener-
g r for 100 future messages (s ∈
ated returns rt+s and the realized returns rt+s
[1, · · · , 100]). As shown in the top of Figure 44, there exists a consistently pos-
itive correlation for both Google (ρ ≈ 0.1) and Intel (ρ ≈ 0.2). The lower
panel displays the corresponding p-values from t-tests evaluating the alterna-
tive hypothesis H1 : ρ > 0 against the null hypothesis H0 : ρ = 0. The dotted
line represents the 5% significance level. For the Google model, the p-values
remain at or near the 5% threshold for up to 80 future messages, whereas for
Intel, the correlations stay statistically significant for at least 100 messages. The
sustained positive correlation indicates directional forecasting power which
suggests new possibilities for alpha.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
142 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Deep Learning in Quantitative Trading 143
8 Conclusions
This final section concludes our exploration of the applications of deep learn-
ing to quantitative finance. It aims to summarize key insights from the Element
and discuss future opportunities and challenges in integrating these fields,
providing a foundation for future work.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
144 Quantitative Finance
range from simple fully connected layers to the more advanced attention mech-
anism, which is particularly effective in capturing long-range dependencies
within structured datasets.
Although deep learning has achieved significant advancements, deep net-
works frequently face issues such as overfitting, where models perform excep-
tionally well on training data but have difficulty generalizing to new, unseen
datasets. To mitigate this issue, this Element outlines a complete workflow
for implementing deep learning algorithms in quantitative trading. The work-
flow covers crucial stages, including data collection, exploratory data analysis,
and cross-validation methods specifically adapted for financial datasets. These
stages address key aspects like data distribution, stationarity, and the distinc-
tive characteristics of financial time-series. These considerations are critical
for creating models that achieve not only high accuracy but also robustness
and reliability for implementation in real-world trading environments.
The second part of the Element is dedicated to the application of deep learn-
ing algorithms to various financial contexts. It places a key focus on one of the
core tasks in quantitative trading: generating predictive signals. We explore a
range of deep learning architectures designed for this purpose, demonstrating
how these models can effectively forecast market movements. On top of this,
we delve into advanced applications, such as improving momentum trading
and cross-sectional momentum strategies. Additionally, we address portfolio
optimization by introducing methods that enable the direct optimization of port-
folio weights from market data. This end-to-end approach eliminates the need
for intermediate steps, such as estimating returns and working with covari-
ance matrices of returns, which are often difficult to implement in practical
scenarios.
We provide an in-depth examination of the operational dynamics of mod-
ern securities exchanges, illustrating the processes behind financial transactions
and the generation of high-frequency microstructure data, including order book
updates and trade executions. Furthermore, we analyze the unique attributes of
different asset classes, such as equities, bonds, commodities, and cryptocurren-
cies, highlighting the specific challenges and opportunities for applying deep
learning techniques effectively to each.
such as text, and techniques specific to those data types as potential sources
of additional alpha. In Section 3, we look at how recent advances in NLP,
such as transformer-based models like BERT and GPT, have made it feasible
to extract nuanced information from unstructured textual data. Such methods
could be used to evaluate data from news articles, social media, and earnings
call transcripts to inform sentiment analysis and event prediction. Similarly,
computer vision models can be used to analyze visual patterns in images. Prac-
titioners could thus use satellite data, product shelves, or even weather imagery
to provide insights into supply chain activity or predict market trends.
Another interesting area of further research is the explainability of deep net-
works. As deep learning models become increasingly sophisticated, the lack of
interpretability poses challenges to understanding why a model makes specific
decisions. In quantitative trading, where financial stakes and regulatory scru-
tiny are high, explainable algorithms are essential for building trust in model
outputs and avoiding unintended biases. For trading strategies, explainability
should encompass not only technical factors but also ethical considerations. It
is important to ensure that algorithms do not exploit market inefficiencies in
ways that harm retail investors or contribute to systemic risks. For instance, on
May 6, 2010, the U.S. stock market underwent the Flash Crash, during which
the Dow Jones Industrial Average plummeted by nearly 1,000 points within
minutes before swiftly rebounding. This sudden decline was initiated by a sub-
stantial sell order executed by a mutual fund employing a trading algorithm
intended to reduce market impact. The algorithm indiscriminately offloaded a
large volume of E-mini S&P 500 futures contracts, ignoring prevailing prices
and market conditions. HFT algorithms quickly picked up on this activity,
starting a cascade of rapid-fire selling that spread across markets.
Interpretability has already been studied in academia, and methods like
SHAP (Shapley Additive Explanations), Integrated Gradients (IG), and LIME
(Local Interpretable Model-Agnostic Explanations) can be used to provide
insights into model behavior. SHAP assigns each feature a contribution score
for each prediction that indicates that feature’s importance. Differently, IG is an
attribution-based method and assesses the impact of each input feature on the
predicted output by summing the gradients along a path from a baseline to the
input. Similarly, LIME takes an approximation method that adopts a simpler
model to explain individual predictions. Despite their utility, these methods still
face significant challenges that limit their effectiveness in certain contexts. For
example, SHAP can be computationally expensive and LIME relies on local
approximations that may not accurately capture global model behavior. Addi-
tionally, these methods can struggle with capturing interactions among features
in time-series or nonlinear domains, leading to incomplete interpretations.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
146 Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Acronyms
ACF Auto Correlation Function.
AR Autoregressive Model.
ARMA Autoregressive Moving Average Model.
BERT Bidirectional Encoder Representations from Transformers.
BTC Bitcoin.
CAPM Capital Asset Pricing Model.
CBOE Chicago Board Options Exchange.
CDS Credit Default Swaps.
CME Chicago Mercantile Exchange.
CNNs Convolutional Neural Networks.
DDPO Deep Deterministic Policy Gradient.
DeFi Decentralized Finance.
DMNs Deep Momentum Networks.
DOT Designated Order Turnaround.
DQNs Deep Q-Networks.
DRL Deep Reinforcement Learning.
ETF Exchange-Traded Fund.
ETH Ethereum.
FCNs Fully Connected Networks.
FPR False Positive Rates.
FSE Frankfurt Stock Exchange.
FX Foreign Exchange Market.
GANs Generative Adversarial Networks.
GBM Geometric Brownian Motion.
GCNs Graph Convolutional Neural Networks.
GED Generalized Error Distribution.
GMMs Gaussian Mixture Models.
GNNs Graph Neural Networks.
GP Gaussian Process.
GRUs Gated Recurrent Units.
HAR Heterogeneous Autoregressive.
HEAVY High-Frequency Based Auto-regressive and Volatility.
HL Huber Loss.
HMMs Hidden Market Models.
IG Integrated Gradients.
IPOs Initial Public Offerings.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
148 Acronyms
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix A
Different Asset Classes
We here introduce several key asset classes that are particularly relevant to the
topics discussed in this Element and are widely traded in financial markets.
These asset classes include equities, bonds, foreign exchange (FX), futures,
options, exchange-traded funds (ETFs), and cryptocurrencies, each of which
presents unique characteristics and potential applications of deep learning.
While our focus is on these prominent categories, it is important to note that
this list is by no means exhaustive. Financial markets encompass a broad range
of additional asset classes, such as real estate investment trusts (REITs) and
derivatives like spread-betting, each offering distinct challenges and appli-
cations. Our goal is to establish a basic comprehension of these key asset
classes, allowing readers to better understand the deep learning methods pre-
sented in the Element. Future exploration of other asset classes can further
enrich the contextual knowledge base and expand the practical scope of these
methodologies.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
150 Appendix A
are removed from indices and databases, skewing performance analyses. Addi-
tionally, corporate actions such as dividends, stock splits, and mergers must
be properly accounted for in price series to avoid misinterpreting historical
data. Adjusting prices for these actions ensures that analyses and back-tests
accurately reflect the financial realities of investing in equities. Beyond single
stocks, indices such as the S&P 500 and Dow Jones Industrial Average aggre-
gate the prices of multiple equities to monitor the performance of specific areas
within equity markets. These indices are constructed and weighted in various
ways, for example price-weighted, equal-weighted, or market-cap-weighted,
depending upon the index-specific methodology. Indices serve multiple pur-
poses: They provide benchmarks for fund performance, offer insights into
market trends, and serve as tradable instruments themselves.
A.2 Bonds
Bonds are fixed-income instruments that represent loans provided by lenders
to borrowers, typically corporations or government entities. They offer predict-
able income and a range of maturities to suit diverse investment goals. Compa-
nies issue corporate bonds to obtain capital for purposes such as expansion,
operational needs, or refinancing existing debts. Corporate bonds typically
provide higher yields compared to government bonds to compensate for the
increased credit risk taken on by lenders. The risks of these bonds are evaluated
and classified by agencies like Moody’s and Standard & Poor’s. Conversely,
government bonds are issued by national, state, or municipal authorities to
finance public expenditures and infrastructure projects. Bonds from stable
governments, such as U.S. Treasuries, are considered some of the safest invest-
ments, while sovereign debt from emerging markets may carry higher yields but
also greater risk due to economic and political volatility.
The bond market operates primarily in an over-the-counter (OTC) format,
where trades are negotiated directly between buyers and sellers rather than on
centralized exchanges. This OTC structure allows for flexibility in terms of
transactions but often results in lower transparency compared to equity markets.
Despite this, the bond market is enormous, with an estimated global market size
exceeding 130 trillion USD. This valuation underscores its importance along-
side equities as a cornerstone of financial systems worldwide. Instead of trading
OTC, investors can gain exposure to bonds in several other ways. ETFs pro-
vide a simple and efficient method for individuals to access a diversified basket
of bonds. These ETFs track indices composed of corporate, government, or
municipal bonds and allow investors to trade bond exposure on stock exchanges
with ease. Bond futures provide an alternative means for investors to protect
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix A 151
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
152 Appendix A
price fluctuations that might affect international trade or the value of their
reserves. Financial institutions, hedge funds, and retail traders often engage
in Forex for speculative purposes, seeking to profit from changes in exchange
rates. Exchange rate movements are influenced by multiple elements such as
interest rate disparities, geopolitical incidents, economic indicators, and central
bank strategies. Leverage plays an important role in Forex trading by allowing
traders to hold positions substantially larger than the amount of capital they
commit as collateral. This magnifies both possible gains and associated risks.
Accordingly, the high levels of leverage available in Forex can lead to signifi-
cant losses, particularly for inexperienced traders. Moreover, the decentralized
and largely unregulated nature of the market means participants should choose
brokers carefully to ensure transparency and fair dealing.
Access to the Forex market has been democratized significantly through
technology, allowing retail traders to participate via online platforms. These
platforms provide traders with exposure to Forex markets through spot trading,
forward contracts, and options. Beyond direct trading, investors can also gain
exposure to currency movements through ETFs that track the performance of
currency indices or specific currency pairs. Futures contracts on major curren-
cies offer yet another way to speculate or hedge currency exposure, providing
a regulated alternative to OTC Forex trading.
A.4 Futures
Futures are standardized financial agreements that require a buyer to purchase,
or a seller to sell, an underlying asset at a set price on a designated future
date. Futures are essential instruments in global financial markets, used both
for speculation and hedging against price movements. The total market for
futures is vast, spanning financial instruments, commodities, and more. The
Chicago Mercantile Exchange (CME) is the most prominent and liquid futures
exchange globally. Futures contracts are intrinsically tied to “future deliver-
ables,” meaning a contract specifies the terms for the delivery of the respective
underlying asset at expiry. However, in practice, most futures contracts are
either cash-settled or closed out prior to delivery, particularly for financial
futures where physical delivery is less common. The ability to settle contracts in
cash adds flexibility for traders and investors, reducing the logistical challenges
associated with taking physical delivery of assets, such as oil or agricultural
products.
Each futures contract has a specific expiry date, and traders often need to
“roll” contracts if they wish to maintain their position beyond that expiration
date. Rolling consists of closing the position in the nearing expiration contract
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix A 153
A.5 Options
Options are financial derivatives that provide the holder with the right, but
not the obligation, to buy or sell an underlying asset at a predetermined price,
known as the strike price, on or before a specific expiration date. Similar to
futures, options are extensively utilized for hedging, speculative activities, and
income generation. They are traded on centralized exchanges such as the Chi-
cago Board Options Exchange (CBOE) and OTC markets. The options traded
on these exchanges are standardized to ensure greater liquidity and transpar-
ency. There are two primary types of options: call options and put options. A
call option grants the holder the right to purchase the underlying asset, while a
put option allows the holder to sell it. Each option contract requires the payment
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
154 Appendix A
of a premium, which is the cost the buyer pays to the seller for the rights the
option provides. The value of an option is influenced by various factors, includ-
ing the price and expected volatility of the underlying asset, the time remaining
until expiration, and prevailing interest rates.
Unlike futures, which impose mandatory obligations, options offer greater
flexibility. The purchaser of an option has the discretion to decide whether to
exercise the contract, whereas the seller (or writer) must adhere to the con-
tract terms if the buyer chooses to exercise it. Options are available on a wide
range of underlying assets, such as stocks, indices, commodities, currencies,
and even interest rates. For example, an investor with a stock portfolio might
buy put options to protect against a potential drop in stock prices. Similarly, a
business exposed to fluctuating commodity prices might purchase call options
to guarantee maximum costs for raw materials. Speculators use options to profit
from anticipated price movements, benefiting from the contracts’ relatively low
upfront cost compared to that of the underlying asset to gain leverage to price
movements.
However, the flexibility of options comes with complexity. A key character-
istic of options is their expiration date, after which a contract expires worthless
if not exercised. This creates the need for strategic decision-making around
whether and when to exercise an option. Rolling options or closing a posi-
tion in a near expiry option and simultaneously opening a new position in a
longer-dated contract is a common practice to maintain exposure beyond an
approaching expiration date. Another key feature of options is the leverage
they offer. A slight fluctuation in the price of the underlying asset can cause
large percentage changes in an option’s value. This leverage can amplify both
profits and losses, requiring careful position sizing and risk management.
Options trading has grown significantly in popularity, driven by technolog-
ical advancements and the rise of retail trading platforms. Exchange-traded
options, such as those on major indices like the S&P 500, are among the most
actively traded due to their high liquidity and broad appeal. Meanwhile, OTC
options allow for customized contracts tailored to specific needs, but they come
with less transparency and higher counterparty risk.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix A 155
A.7 Cryptocurrency
Cryptocurrencies are digital or virtual currencies that are built upon decen-
tralized blockchain technologies and leverage cryptography to ensure their
security. They possess several features that distinguish them significantly
from traditional major asset classes. Primarily, cryptocurrencies are usually
not governed by any central authority, making them resistant to government
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
156 Appendix A
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix A 157
risks associated with rolling contracts. These financial products bridge the gap
between traditional investment frameworks and the digital asset space, making
it easier for institutional and retail investors to enter the market.
A.8 Others
Besides the major asset classes, there are numerous other products that pro-
vide investors with diverse opportunities to achieve their financial goals. We
list a few here. Commodities, although often traded via futures, also exist as
a standalone asset class. This category encompasses physical goods such as
gold, silver, crude oil, natural gas, and agricultural items like wheat and corn.
Investors can participate in the commodities market by owning these assets
directly, engaging in futures contracts, or investing in ETFs that track com-
modities. Commodities are particularly valued for their role as inflation hedges
and their historically low correlation with traditional financial assets, making
them useful for portfolio diversification.
While futures and options are the most frequently traded derivatives, other
types of derivatives also hold significant importance in financial markets. For
instance, swaps are extensively utilized in the interest rate and currency sec-
tors. Interest rate swaps enable parties to exchange fixed-rate payments for
floating-rate payments, or the other way around, allowing them to manage their
exposure to interest rate variability. Currency swaps involve the exchange of
principal and interest payments in different currencies, serving as essential tools
for multinational corporations and governments to manage foreign exchange
risks. Additionally, credit default swaps (CDS) function as a type of insurance
against a borrower’s default, playing a crucial role in credit markets and risk
management strategies.
The real estate market is another prominent market. Often, the cost of buy-
ing or selling a property is high and the process is time-consuming. However,
real estate investment trusts (REITs) offer investors a way to access the real
estate market without directly owning physical properties. By pooling funds
from multiple investors, these investment entities can purchase, oversee, and
finance income-producing real estate assets such as office buildings, retail cen-
ters, apartment complexes, and industrial facilities. Publicly traded REITs are
listed on stock exchanges, with liquidity and ease of access similar to that of
equities. Private non-traded REITs are also available to accredited investors and
often focus on niche markets. REITs attract income-oriented investors due to
legal requirements that compel them to pay out a significant share of their earn-
ings as dividends, typically offering higher returns than conventional equities.
Nonetheless, their success can be impacted by factors like interest rate changes,
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
158 Appendix A
trends in the property market, and economic cycles, rendering them sensitive
to macroeconomic shifts. Note that some dividends from REITs are qualified
as capital gains rather than income which can get more favorable taxation.
Private equity and venture capital represent another distinct asset class, pro-
viding opportunities to invest in companies that do not trade on public markets.
Venture capital targets early-stage, high-growth startups, while private equity
focuses on mature companies, often involving buyouts or growth investments.
These investments usually involve committing capital over an extended period
and bearing higher risks, but they also provide the potential for considerable
returns.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix B
Access to Market Data
B.1 Professional
For professionals, there are a variety of established third-party providers that
deliver high-quality market data, tailored to the needs of institutional investors,
traders, and financial analysts. Providers such as Bloomberg, Refinitiv, and
S&P Global offer comprehensive datasets spanning multiple asset classes,
along with advanced analytical tools and integration options. These plat-
forms have become industry staples, ensuring reliable and timely access to
financial information critical for decision-making. In addition to third-party
providers, many professional market participants access direct market feeds
from exchanges. These feeds deliver raw, real-time data, including order book
details, trade executions, and price updates, providing the low-latency access
required for high-frequency trading and algorithmic strategies. Beyond tra-
ditional market data, there is also a growing demand for alternative data,
non-conventional datasets that provide unique insights into market trends and
behavior. This can include information from annual reports, social media sen-
timent, credit card transactions, and forum discussions. Alternative data has
become a critical tool for gaining a competitive edge, offering perspectives not
available from standard financial datasets. Together, these resources constitute
the standard data sources accessed by professionals.
B.2 Academic
Academics often have access to subsidized data sources formatted specifi-
cally for research and academic use. These resources can be tailored to meet
the requests of universities and researchers. Such data is often used to study
financial markets, corporate behavior, and economic trends.
One of the primary resources for academics is Wharton Research Data Ser-
vices (WRDS), a global data platform that hosts a huge amount of data (more
than 350 TB) aggregated from global data vendors. WRDS encompasses a
range of databases including Compustat, CRSP, TFN (THOMSON), TAQ and
many others. WRDS not only covers historical financial time-series data but
also provides access to corporate fundamentals, macroeconomic indicators and
more. Most academics can gain access to the WRDS platform through a sub-
scription provided by their universities. As a result, WRDS is widely used in
academic research.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
160 Appendix B
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix B 161
information. These tools are especially popular among developers and quanti-
tative enthusiasts who want to integrate financial data into their own projects
or build custom trading algorithms.
Finally, for those interested in more niche or alternative datasets, open-
source repositories and public APIs from organizations like AlphaQuery and
Kaggle can offer unique insights and opportunities for experimentation. The
wide availability of these tools ensures that personal enthusiasts have plenty of
options to explore financial markets, regardless of their respective experience
levels or budgets.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix C
Investment Performance Metrics
Here, we introduce various metrics that are used to gauge the performance of a
portfolio or a trading strategy. We denote the daily trade returns from a strategy
as Rt :
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix C 163
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix D
Code Scripts
1 def rolling_forward_cv (X_torch , y_torch , train_end_fractions =
[0.7, 0.8, 0.9], val_fraction = 0.1):
2 """ Perform 3-fold rolling forward CV. Each validation is 10%
of data """
3 N = len(X_torch)
4 all_data []
5 for frac in train_end_fractions :
6 train_end = int(frac * N)
7 val_end = int(( frac + val_fraction) * N)
8 # Safety check if val_end exceeds dataset size
9 if val_end > N:
10 break # no more folds possible if we run off the
end
11 # Create train/val splits
12 X_train , y_train = X_torch [: train_end], y_torch [:
train_end]
13 X_val , y_val = X_torch[train_end:val_end], y_torch[
train_end:val_end]
14 all_data.append (( X_train , y_train , X_Val , y_val))
15 return all_data
1 import [Link] as nn
2
3 class MLP([Link]):
4 def __init__(self , seq_length , n_features , y_dim):
5 super ().__init__ ()
6
7 [Link] = [Link](
8 [Link] (),
9 [Link](seq_length*n_features , 4),
10 [Link] (),
11 [Link] (4, y_dim))
12
13 def forward(self , x):
14 x = [Link](x, start_dim =1)
15 x = [Link](x)
16 y = [Link](x, dim =1)
17 return y
1 import [Link] as nn
2
3 class deeplob([Link]):
4 def __init__(self , device):
5 super ().__init__ ()
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Appendix D 165
6 [Link] = device
7
8 self.conv1 = [Link](
9 nn.Conv2d(in_channels =1, out_channels =32,
kernel_size =(1 ,2), stride =(1 ,2)),
10 [Link](negative_slope =0.01) ,
11 nn.BatchNorm2d (32) ,
12 nn.Conv2d(in_channels =32, out_channels =32,
kernel_size =(4 ,1)),
13 [Link](negative_slope =0.01) ,
14 nn.BatchNorm2d (32) ,
15 nn.Conv2d(in_channels =32, out_channels =32,
kernel_size =(4 ,1)),
16 [Link](negative_slope =0.01) ,
17 nn.BatchNorm2d (32) ,
18 )
19 self.conv2 = [Link](
20 nn.Conv2d(in_channels =32, out_channels =32,
kernel_size =(1 ,2), stride =(1 ,2)),
21 [Link] (),
22 nn.BatchNorm2d (32) ,
23 nn.Conv2d(in_channels =32, out_channels =32,
kernel_size =(4 ,1)),
24 [Link] (),
25 nn.BatchNorm2d (32) ,
26 nn.Conv2d(in_channels =32, out_channels =32,
kernel_size =(4 ,1)),
27 [Link] (),
28 nn.BatchNorm2d (32) ,
29 )
30 self.conv3 = [Link](
31 nn.Conv2d(in_channels =32, out_channels =32,
kernel_size =(1 ,10)),
32 [Link](negative_slope =0.01) ,
33 nn.BatchNorm2d (32) ,
34 nn.Conv2d(in_channels =32, out_channels =32,
kernel_size =(4 ,1)),
35 [Link](negative_slope =0.01) ,
36 nn.BatchNorm2d (32) ,
37 nn.Conv2d(in_channels =32, out_channels =32,
kernel_size =(4 ,1)),
38 [Link](negative_slope =0.01) ,
39 nn.BatchNorm2d (32) ,
40 )
41
42 self.inp1 = [Link](
43 nn.Conv2d(in_channels =32, out_channels =64,
kernel_size =(1 ,1), padding='same '),
44 [Link](negative_slope =0.01) ,
45 nn.BatchNorm2d (64) ,
46 nn.Conv2d(in_channels =64, out_channels =64,
kernel_size =(3 ,1), padding='same '),
47 [Link](negative_slope =0.01) ,
48 nn.BatchNorm2d (64) ,
49 )
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
166 Appendix D
50 self.inp2 = [Link](
51 nn.Conv2d(in_channels =32, out_channels =64,
kernel_size =(1 ,1), padding='same '),
52 [Link](negative_slope =0.01) ,
53 nn.BatchNorm2d (64) ,
54 nn.Conv2d(in_channels =64, out_channels =64,
kernel_size =(5 ,1), padding='same '),
55 [Link](negative_slope =0.01) ,
56 nn.BatchNorm2d (64) ,
57 )
58 self.inp3 = [Link](
59 nn.MaxPool2d ((3, 1), stride =(1, 1), padding =(1, 0)),
60 nn.Conv2d(in_channels =32, out_channels =64,
kernel_size =(1 ,1), padding='same '),
61 [Link](negative_slope =0.01) ,
62 nn.BatchNorm2d (64) ,
63 )
64
65 # lstm layers
66 [Link] = [Link](input_size =192, hidden_size =64,
num_layers =1, batch_first=True)
67 self.fc1 = [Link] (64, 1)
68
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
References
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A
next-generation hyperparameter optimization framework. In Proceedings of
the 25th acm sigkdd international conference on knowledge discovery & data
mining (pp. 2623–2631).
Almgren, R., & Chriss, N. (2001). Optimal execution of portfolio transactions.
Journal of Risk, 3, 5–40.
Atkins, A., Niranjan, M., & Gerding, E. (2018). Financial news predicts stock
market volatility better than close price. The Journal of Finance and Data
Science, 4(2), 120–137.
Atsalakis, G. S., & Valavanis, K. P. (2009). Surveying stock market fore-
casting techniques–Part II: Soft computing methods. Expert Systems with
Applications, 36(3), 5932–5941.
Bachelier, L. (1900). Théorie de la spéculation. In Annales scientifiques de
l’école normale supérieure (Vol. 17, pp. 21–86). Elsevier
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by
jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Bao, W., Yue, J., & Rao, Y. (2017). A deep learning framework for financial
time series using stacked autoencoders and long-short term memory. PloS
one, 12(7), e0180944.
Beck, M., Pöppel, K., Spanring, M., et al. (2024). xlstm: Extended long short-
term memory. arXiv preprint arXiv:2405.04517.
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependen-
cies with gradient descent is difficult. IEEE transactions on neural networks,
5(2), 157–166.
Bertsimas, D., & Lo, A. W. (1998). Optimal control of execution costs. Journal
of financial markets, 1(1), 1–50.
Blondel, M., Teboul, O., Berthet, Q., & Djolonga, J. (2020). Fast differentia-
ble sorting and ranking. In Hal Daumé & Singh, Aarti (eds), International
conference on machine learning (pp. 950–959). PMLR
Borovykh, A., Bohte, S., & Oosterlee, C. W. (2017). Conditional time
series forecasting with convolutional neural networks. arXiv preprint
arXiv:1703.04691.
Boureau, Y., Ponce, J., & LeCun, Y. (2010). A theoretical analysis of feature
pooling in vision algorithms. In Proceedings of international conference on
machine learning (icml’10) (Vol. 28, p. 3).
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
168 References
Briola, A., Bartolucci, S., & Aste, T. (2024). Deep limit order book forecasting.
arXiv preprint arXiv:2403.09267.
Briola, A., Turiel, J., & Aste, T. (2020). Deep learning modeling of limit order
book: A comparative perspective. arXiv preprint arXiv:2007.07319.
Cesa, M. (2017). A brief history of quantitative finance. Probability, Uncer-
tainty and Quantitative Risk, 2(1), 1–16.
Chen, J.- F., Chen, W.- L., Huang, C.- P., Huang, S.- H., & Chen, A.- P.
(2016). Financial time-series data analysis using deep convolutional neural
networks. In Cloud computing and big data (ccbd), 2016 7th international
conference on (pp. 87–92).
Cho, K., Van Merriënboer, B., Gulcehre, C., et al. (2014). Learning phrase rep-
resentations using rnn encoder-decoder for statistical machine translation.
arXiv preprint arXiv:1406.1078.
Cont, R., Cucuringu, M., Kochems, J., & Prenzel, F. (2023). Limit order book
simulation with generative adversarial networks. SSRN 4512356.
Cuturi, M., Teboul, O., & Vert, J.- P. (2019). Differentiable ranking and sort-
ing using optimal transport. Advances in Neural Information Processing
Systems, 32.
Devlin, J., Chang, M.- W., Lee, K., & Toutanova, K. (2018). Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805.
Di Persio, L., & Honchar, O. (2016). Artificial neural networks architectures for
stock price prediction: Comparisons and applications. International Journal
of Circuits, Systems and Signal Processing, 10, 403–413.
Dixon, M. (2018). Sequence classification of the limit order book using
recurrent neural networks. Journal of Computational Science, 24, 277–286.
Doering, J., Fairbank, M., & Markose, S. (2017). Convolutional neural net-
works applied to high-frequency market microstructure forecasting. In
Computer science and electronic engineering (ceec), 2017 (pp. 31–36).
Du, K., Xing, F., Mao, R., & Cambria, E. (2024). Financial sentiment analysis:
Techniques and applications. ACM Computing Surveys, 56(9), 1–42.
Ekmekcioğlu, Ö., & Pınar, M. Ç. (2023). Graph neural networks for deep
portfolio optimization. Neural Computing and Applications, 35(28), 20663–
20674.
Fischer, T., & Krauss, C. (2017). Deep learning with long short-term memory
networks for financial market predictions. European Journal of Operational
Research, 270(2), 654–669.
Frazier, P. I. (2018). Bayesian optimization. In Recent advances in optimization
and modeling of contemporary problems (pp. 255–278). Informs.
Gatheral, J. (2010). No-dynamic-arbitrage and market impact. Quantitative
finance, 10(7), 749–759.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
References 169
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
([Link])
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. (2014). Generative adver-
sarial nets. Advances in neural information processing systems, 27.
Grover, A., Wang, E., Zweig, A., & Ermon, S. (2018). Stochastic optimization
of sorting networks via continuous relaxations. In International conference
on learning representations.
Gu, A., & Dao, T. (2023). Mamba: Linear-time sequence modeling with
selective state spaces. arXiv preprint arXiv:2312.00752.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the ieee conference on computer vision and
pattern recognition (pp. 770–778).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural
computation, 9(8), 1735–1780.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward
networks are universal approximators. Neural networks, 2(5), 359–366.
Hwang, Y., Kong, Y., Lee, Y., & Zohren, S. (2025). Decision-informed neural
networks with large language model integration for portfolio optimization.
Jin, M., Wang, S., Ma, L., et al. (2023). Time-LLM: Time series forecasting by
reprogramming large language models. arXiv preprint arXiv:2310.01728.
Kalman, R. E. (1960). A new approach to linear filtering and prediction
problems. Journal of Basic Engineering, Transactions of the ASME, 82(1),
35–45.
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes. arXiv
preprint arXiv:1312.6114.
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph
convolutional networks. arXiv preprint arXiv:1609.02907.
Kong, Y., Nie, Y., Dong, X., et al. (2024). Large language models for finan-
cial and investment management: Applications and benchmarks. Journal of
Portfolio Management, 51(2) 162–210.
Kong, Y., Wang, Z., Nie, Y., et al. (2024). Unlocking the power of lstm for long
term time series forecasting. arXiv preprint arXiv:2408.10006.
Korangi, K., Mues, C., & Bravo, C. (2024). Large-scale time-varying portfo-
lio optimisation using graph attention networks. arXiv preprint arXiv:2407
.15532.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification
with deep convolutional neural networks. Advances in neural information
processing systems, 25.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classifica-
tion with deep convolutional neural networks. Communications of the ACM,
60(6), 84–90.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
170 References
Lim, B., Arık, S. Ö., Loeff, N., & Pfister, T. (2021). Temporal fusion trans-
formers for interpretable multi-horizon time series forecasting. International
Journal of Forecasting, 37(4), 1748–1764.
Lim, B., & Zohren, S. (2021). Time-series forecasting with deep learning:
A survey. Philosophical Transactions of the Royal Society A, 379(2194),
20200209.
Lim, B., Zohren, S., & Roberts, S. (2019). Enhancing time-series momen-
tum strategies using deep neural networks. The Journal of Financial Data
Science, 1(4), 19–38.
Liu, T.- Y. (2009). Learning to rank for information retrieval. Foundations and
Trends® in Information Retrieval, 3(3), 225–331.
Liu, Y., Hu, T., Zhang, H., et al. (2023). itransformer: Inverted transformers are
effective for time series forecasting. arXiv preprint arXiv:2310.06625.
Luong, M.- T., Pham, H., & Manning, C. D. (2015). Effective approaches
to attention-based neural machine translation. arXiv preprint arXiv:1508.
04025.
Maas, A. L., Hannun, A. Y., Ng, A. Y., et al. (2013). Rectifier nonlinearities
improve neural network acoustic models. In Proc. icml (Vol. 30, p. 3).
Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1), 77–91.
Mhaskar, H. N., & Micchelli, C. A. (1993). How to choose an activation
function. Advances in neural information processing systems, 6.
Moreno-Pino, F., & Zohren, S. (2024). Deepvol: Volatility forecasting from
high-frequency data with dilated causal convolutions. Quantitative Finance,
24(8), 1105–1127.
Moskowitz, T. J., Ooi, Y. H., & Pedersen, L. H. (2012). Time series momentum.
Journal of Financial Economics, 104(2), 228–250.
Nagy, P., Calliess, J.- P., & Zohren, S. (2023). Asynchronous deep double
dueling q-learning for trading-signal execution in limit order book markets.
Frontiers in Artificial Intelligence, 6 1151003.
Nagy, P., Frey, S., Sapora, S., et al. (2023). Generative AI for end-to-end limit
order book modelling: A token-level autoregressive generative model of
message flow using a deep state space network. In Proceedings of the fourth
ACM international conference on AI in finance (pp. 91–99).
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted
Boltzmann machines. In Icml (pp. 807–814).
Nelson, D. M., Pereira, A. C., & de Oliveira, R. A. (2017). Stock market’s
price movement prediction with LSTM neural networks. In Neural networks
(ijcnn), 2017 international joint conference on (pp. 1419–1426).
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
References 171
Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. (2022). A time series
is worth 64 words: Long-term forecasting with transformers. arXiv preprint
arXiv:2211.14730.
Obizhaeva, A. A., & Wang, J. (2013). Optimal trading strategy and supply/de-
mand dynamics. Journal of Financial markets, 16(1), 1–32.
Ogryczak, W., & Tamir, A. (2003). Minimizing the sum of the k largest
functions in linear time. Information Processing Letters, 85(3), 117–122.
Poh, D., Lim, B., Zohren, S., & Roberts, S. (2021a). Building cross-sectional
systematic strategies by learning to rank. The Journal of Financial Data
Science, 3(2), 70–86.
Poh, D., Lim, B., Zohren, S., & Roberts, S. (2021b). Enhancing cross-sectional
currency strategies by context-aware learning to rank with self-attention.
arXiv preprint arXiv:2105.10019.
Poh, D., Lim, B., Zohren, S., & Roberts, S. (2021c). Enhancing cross-sectional
currency strategies by ranking refinement with transformer-based architec-
tures. arXiv preprint arXiv:2105.10019.
Poh, D., Roberts, S., & Zohren, S. (2022). Transfer ranking in finance: appli-
cations to cross-sectional momentum with data scarcity. arXiv preprint
arXiv:2208.09968.
Prata, M., Masi, G., Berti, L., et al. (2024). Lob-based deep learning models
for stock price trend prediction: A benchmark study. Artificial Intelligence
Review, 57(5), 1–45.
Pu, X. S., Roberts, S., Dong, X., & Zohren, S. (2023). Network momen-
tum across asset classes. Stephen and Dong, Xiaowen and Zohren, Stefan,
Network Momentum across Asset Classes (August 7, 2023).
Rahimikia, E., Zohren, S., & Poon, S.- H. (2021). Realised volatility fore-
casting: Machine learning via financial word embedding. arXiv preprint
arXiv:2108.00480.
Reisenhofer, R., Bayer, X., & Hautsch, N. (2022). Harnet: A convolu-
tional neural network for realized volatility forecasting. arXiv preprint
arXiv:2205.07719.
Schnaubelt, M. (2022). Deep reinforcement learning for the optimal placement
of cryptocurrency limit orders. European Journal of Operational Research,
296(3), 993–1006.
Selvin, S., Vinayakumar, R., Gopalakrishnan, E., Menon, V. K., & Soman, K.
(2017). Stock price prediction using LSTM, RNN and CNN-sliding window
model. In Advances in computing, communications and informatics (icacci),
2017 international conference on (pp. 1643–1647).
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
172 References
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556.
Sirignano, J., & Cont, R. (2018). Universal features of price formation in finan-
cial markets: Perspectives from deep learning. arXiv preprint arXiv:1803
.06917.
Soleymani, F., & Paquet, E. (2021). Deep graph convolutional reinforcement
learning for financial portfolio management–deeppocket. Expert Systems
with Applications, 182, 115127.
Sun, Q., Wei, X., & Yang, X. (2024). Graphsage with deep reinforcement learn-
ing for financial portfolio optimization. Expert Systems with Applications,
238, 122027.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning
with neural networks. In Advances in neural information processing systems
(pp. 3104–3112).
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction.
MIT press.
Theron, L., & Van Vuuren, G. (2018). The maximum diversification invest-
ment strategy: A portfolio performance comparison. Cogent Economics &
Finance, 6(1), 1427533.
Tsantekidis, A., Passalis, N., Tefas, A., et al. (2017a). Forecasting stock prices
from the limit order book using convolutional neural networks. In Business
informatics (cbi), 2017 ieee 19th conference on (Vol. 1, pp. 7–12). IEEE.
Tsantekidis, A., Passalis, N., Tefas, A., et al. (2017b). Using deep learning to
detect price change indications in financial markets. In Signal processing
conference (eusipco), 2017 25th european (pp. 2511–2515).
Van Den Oord, A., Dieleman, S., Zen, H., et al. (2016). Wavenet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 12.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need.
Advances in neural information processing systems, 30.
Vergara, J. R., & Estévez, P. A. (2014). A review of feature selection methods
based on mutual information. Neural Computing and Applications, 24, 175–
186.
Wan, X., Yang, J., Marinov, S., et al. (2021). Sentiment correlation in financial
news networks and associated market movements. Scientific Reports, 11(1),
3062.
Wang, J., Zhang, S., Xiao, Y., & Song, R. (2021). A review on graph neural
network methods in financial applications. arXiv preprint arXiv:2111.15367.
Wang, Y., Wu, H., Dong, J., et al. (2024). Deep time series models: A
comprehensive survey and benchmark. arXiv preprint arXiv:2407.13278.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
References 173
Wood, K., Giegerich, S., Roberts, S., & Zohren, S. (2021). Trading with the
momentum transformer: An intelligent and interpretable architecture. arXiv
preprint arXiv:2112.08534.
Wood, K., Kessler, S., Roberts, S. J., & Zohren, S. (2023). Few-shot learning
patterns in financial time-series for trend-following strategies. arXiv preprint
arXiv:2310.10500.
Wood, K., Roberts, S., & Zohren, S. (2021). Slow momentum with fast rever-
sion: A trading strategy using deep learning and changepoint detection. arXiv
preprint arXiv:2105.13727.
Wu, H., Xu, J., Wang, J., & Long, M. (2021). Autoformer: Decomposi-
tion transformers with auto-correlation for long-term series forecasting.
Advances in Neural Information Processing Systems, 34, 22419–22430.
Wu, Z., Pan, S., Chen, F., et al. (2020). A comprehensive survey on graph neu-
ral networks. IEEE transactions on neural networks and learning systems,
32(1), 4–24.
Zhang, C., Pu, X., Cucuringu, M., & Dong, X. (2023). Graph neural networks
for forecasting multivariate realized volatility with spillover effects. arXiv
preprint arXiv:2308.01419.
Zhang, C., Pu, X., Cucuringu, M., & Dong, X. (2024). Graph-based meth-
ods for forecasting realized covariances. Journal of Financial Econometrics,
nbae026.
Zhang, C., Zhang, Z., Cucuringu, M., & Zohren, S. (2021). A universal end-
to-end approach to portfolio optimization via deep learning. arXiv preprint
arXiv:2111.09170.
Zhang, X., Chowdhury, R. R., Gupta, R. K., & Shang, J. (2024). Large language
models for time series: A survey. arXiv preprint arXiv:2402.01801.
Zhang, Y., & Yan, J. (2023). Crossformer: Transformer utilizing cross-
dimension dependency for multivariate time series forecasting. In The
Eleventh International Conference on Learning Representations.
Zhang, Z., Lim, B., & Zohren, S. (2021). Deep learning for market by order
data. Applied Mathematical Finance, 28(1), 79–95.
Zhang, Z., & Zohren, S. (2021). Multi-horizon forecasting for limit order
books: Novel deep learning approaches and hardware acceleration using
intelligent processing units. arXiv preprint arXiv:2105.10430.
Zhang, Z., Zohren, S., & Roberts, S. (2019). Deep convolutional neural net-
works for limit order books. IEEE Transactions on Signal Processing, 67(11),
3001–3012.
Zhang, Z., Zohren, S., & Roberts, S. (2019a). Deeplob: Deep convolutional
neural networks for limit order books. IEEE Transactions on Signal Pro-
cessing, 67(11), 3001–3012.
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
174 References
Zhang, Z., Zohren, S., & Roberts, S. (2019b). Extending deep learning mod-
els for limit order books to quantile regression. Proceedings of Time Series
Workshop of the 36 th International Conference on Machine Learning, Long
Beach, California, PMLR 97, 2019.
Zhang, Z., Zohren, S., & Roberts, S. (2020). Deep learning for portfolio
optimization. The Journal of Financial Data Science, 2(4), 8–20.
Zhou, H., Zhang, S., Peng, J., et al. (2021). Informer: Beyond efficient trans-
former for long sequence time-series forecasting. In Proceedings of aaai (pp.
11106–11115).
Zhou, Y.- T., & Chellappa, R. (1988). Computation of optical flow using a
neural network. In Icnn (pp. 71–78).
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Acknowledgments
We owe our profound gratitude to everyone who contributed to the successful
completion of this book. To begin with, we wish to acknowledge our fami-
lies for their unfailing support. We also want to recognize the assistance of our
colleagues and friends, whose insightful conversations, constructive criticism,
and fresh perspectives helped us to shape the content of this book. In particular,
we would like to thank our senior colleagues, Steve Roberts, Xiaowen Dong,
Jan Calliess, Mihai Cucuringu, Alex Shestopaloff, Janet Pierrehumbert, Jakob
Foerster, Ani Calinescu, Nick Firoozye, Chao Ye, Xiaoqing Wu, and Yongjae
Lee, as well as research students and postdocs whose work was featured
here, Bryan Lim, Kieran Wood, Daniel Pho, Will Tan, Fernando Moreno-Pino,
Chao Zhang, Vincent Tan, Xingyue Pu, Yaxuan Kong, Yoontae Hwang, Felix
Drinkall, Dragos Gorduza, Peer Nagy, Xingchen Wan, Binxin Ru, and Sasha
Frey. Special thanks also to Samson Donick for proofreading the entire man-
uscript, as well as several of the above students for proofreading individual
sections. Additional thanks goes to George Nigmatulin, Yaxuan Kong and
Yonntae Hwang for helping with didactic materials around the book. More-
over, we would like to thank Bank of America for hosting a short lecture series
based on the book attended by 200 quants. In particular, special thanks go to
Robert De Witt, Ilya Sheynzon and Shih-Hau Tan for organising the event, as
well as to Leif Andersen for carefully reading the manuscript and providing
additional comments.
Our thanks extend as well to the editorial and publishing team, in particular
our editor Riccardo Rebonato for his insightful feedback and patience along the
process. We are deeply thankful to the Machine Learning Research Group and
the Oxford-Man Institute, at the University of Oxford for providing us with a
supportive research environment. We would also like to thank Man Group for
sponsoring the institute and their engagement through their academic liaisons
Anthony Ledford and Slavi Marinov. Without all your support, this book would
never have come to fruition.
To our families
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Quantitative Finance
Riccardo Rebonato
EDHEC Business School
Editor Riccardo Rebonato is Professor of Finance at EDHEC Business School and holds
the PIMCO Research Chair for the EDHEC Risk Institute. He has previously held academic
positions at Imperial College, London, and Oxford University and has been Global Head
of Fixed Income and FX Analytics at PIMCO, and Head of Research, Risk Management
and Derivatives Trading at several major international banks. He has previously been on
the Board of Directors for ISDA and GARP, and he is currently on the Board of the Nine
Dot Prize. He is the author of several books and articles in finance and risk management,
including Bond Pricing and Yield Curve Modelling (2017, Cambridge University Press).
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]
Quantitative Finance
Downloaded from [Link] IP address: [Link], on 03 Oct 2025 at [Link], subject to the Cambridge Core terms of
use, available at [Link] [Link]