Algorithmic Trading Using Sentiment Analysis and Reinforcement Learning
Algorithmic Trading Using Sentiment Analysis and Reinforcement Learning
1
2005][11] predicts the movements of stock prices based on the Rewards: The reward at any point is calculated as the
contents of the news stories. All these researches are trying to [Current Value of the Portfolio - Initial Investment].
classify the news more accurately, thereby trying to obtain Discount Factor: In this project, the discount factor is assumed
significant accuracy. to be 1 always.
In this project we follow a novel approach of combining both the Solving the MDP:
implementations where we gather information from various
The MDP described above was solved using vanilla Q-Learning
sources that have big impact on the stock market and supplying
algorithm with functional approximations[17].
this additional information to better learn an optimal
policy/strategy. Thus the ML agent not only just learns an optimal The algorithm can be described as:
trading strategy based on historical prices but also on additional On each (s, a, r, 𝑠ʹ):
information based on sentiment and trend of the market to make 𝑄$%& 𝑠, 𝑎 ← 1 − 𝜂 𝑄$%& 𝑠, 𝑎 + 𝜂(𝑟 + 𝛾𝑉$%& 𝑠′ )
an informed decision.
where: 𝑉$%& 𝑠′ = 𝑚𝑎𝑥 𝑄$%& 𝑠′, 𝑎′
89∈;<&=$>?(? @ )
3. Dataset Used s = current state, a = action being taken, 𝑠′ = next state, 𝛾 = discount
5-years of daily historical closing prices (2011-2016) for various factor, r = reward, 𝜂 = exploration probability.
stocks were obtained from Yahoo Finance[12] to form the As Q-learning doesn’t generalize to unseen states/actions,
dataset. For this project, the results are restricted to a portfolio function approximation has been used which parameterizes 𝑄$%&
consisting of 2 stocks [Qualcomm (QCOM), Microsoft (MSFT)]. by a weight vector and feature vector which can be described as:
10-years of news articles have also been obtained from Reuters On each (s, a, r, 𝑠ʹ):
Key Development Corpus[13] for both Qualcomm and 𝑤 ← 𝑤 − 𝜂 𝑄$%& 𝑠, 𝑎; 𝑤 − 𝑟 + 𝛾𝑉$%& 𝑠′ 𝜙(𝑠, 𝑎)
Microsoft. 5-years of these news articles (2006-2010) are used as
training dataset and the remaining (2011-2016) are used as test where: 𝑄$%& 𝑠, 𝑎; 𝑤 = prediction and 𝑟 + 𝛾𝑉$%& 𝑠′ = target
dataset to perform Sentiment Analysis. The headlines were For our problem, we used the following feature vectors: (a)
manually classified to obtain the ground truth sentiment score: +1 Number of Stocks of each asset, (b) Current Stock Price of each
if the news provides a positive sentiment and -1 if the news asset and, (c) Cash in Hand.
provides a negative sentiment.
Also, in order to deploy the tradeoff between exploration and
4. Problem Formulation exploitation, epsilon-greedy algorithm has been used which
This section describes how the problem of choosing when to explores with probability ε and exploits with probability 1−ε. An
BUY/HOLD/SELL a portfolio of stocks is formulated as a exploration probability of 0.2 has been chosen for this project.
Markov Decision Process (MDP). The section further elaborates Results:
on how the MDP is solved by learning an optimal policy using This section discusses the performance of the above implemented
Q-learning with functional approximations. It may be noted that Q-learning system. The Q-learning system was run on a dataset
since the state space with real valued stock prices is really huge, containing 5-year stock prices and number of trials as 10,000.
the state space has been discretized for the stock prices.
The plot below shows how the Sharpe Ratio evolves with
MDP Formulation: increasing historical time period. As can be observed, the ML
MDP[16] can be formulated by describing its States, Actions, agent slowly but steadily learns an optimal strategy that increases
Transition Probability, Rewards, and discount factor. the Sharpe Ratio as more historical data is provided, thus
States: [(#of Stocks for each asset), (Current Stock Price for each displaying incremental performance and is successful in
asset), Cash in Hand] achieving the Sharpe Ratio of 0.85.
The first part of the state consists of a tuple containing number of
stocks for each asset. The second part of the state consists of a
tuple containing the daily closing stock price for each asset.
Finally, the third part of the state consists of cash in hand which
is evaluated at every time step based on the action performed.
Initial State: [(0, 0...), (𝑆1, 𝑆2...), $10,000] i.e. the agent has 0
stocks for each asset and only $10,000 as an initial investment.
Actions: At any point the agent chooses from three actions: BUY,
SELL, and HOLD. Action BUY buys as many stocks for each
asset as possible based on the current stock price and cash in
hand. Action SELL sells all the stocks in portfolio and adds the Figure 2: Sharpe Ratio vs historical time period using Q-Learning
generated cash to cash in hand. Action HOLD, does nothing, i.e. To validate that the algorithm was effective in learning the
neither sells nor buys any stock of an asset. optimal strategy, Monte-Carlo simulation has been performed as
Transition Probability: The transition probability is chosen to baseline. In this simulation, the agent is forced to choose an
be 1 always as whenever the action is BUY/SELL we are sure to action at random. As can be observed in the plot below, the ML
BUY/SELL the stocks of an asset. Here the randomness in the agent generates negative Sharpe Ratio, hence validating the Q-
system comes from the fact that the stock price changes just after learning algorithm for this problem.
a stock is bought or sold i.e. after every time step.
2
4. Stochastic D
𝐾% 𝑖 − 2 + 𝐾% 𝑖 − 1 + 𝐾% 𝑖
𝐷% 𝑖 =
3
129
161
193
225
257
289
321
353
385
417
449
481
33
65
97