Comparing Bitcoin's Prediction Model Using GRU, RNN, and LSTM by Hyperparameter Optimization Grid Search and Random Search

Comparing Bitcoin’s Prediction Model Using GRU,
RNN, and LSTM by Hyperparameter Optimization

Grid Search and Random Search
Nurhayati1, Imam Lutfi Rahmatullah2, Bayu Aji Setyawan3

Informatics Engineering
Universitas Negeri Syarif Hidayatullah Jakarta
Jakarta, Indonesia
Email: nurhayati@uinjkt.ac.id1, imam.lutfi18@mhs.uinjkt.ac.id2, bayu.ajisetyawan18@mhs.uinjkt.ac.id3
Abstract— After its introduction in 2008, the rise in price of of payment must meet the requirements as a means of payment
bitcoin and the popularity of the other crypto money issues a and be recognized by the government. For now,
growing topic about energy usage was used to produce this cryptocurrency digital money does not meet the requirements,
cryptocurrency. Now Bitcoin becoming the most expensive and there is no recognition from the government as a means of
popular cryptocurrency for now, and now the business path and payment, because Bitcoin is a new phenomenon by some
some research community begin to study bitcoin development. people in Indonesia.
Cryptocurrency is a blockchain-based technology that is often
used as a digital currency. One type of Cryptocurrency is Forecasting is really important things to plan in effective
Bitcoin. However, due to the absence of government regulation, and efficient way. There are two types of forecasting, the first
the price of bitcoin has become uncontrollable, resulting in one is subjective and the second one is objective [1]. The
frequent large fluctuations. The GRU, RNN, and LSTM method subjective forecasting method has a qualitative model and the
are methods that frequently used in Forecasting needs. Those objective forecasting method has two models, it called causal
three methods make use of historical data for a certain period in model and time series model. The qualitative model seems
the prediction process. GRU, RNN, and LSTM are considered include some subjective factors in forecasting, this model will
to be the best method to get a prediction, but it also depends on be very handy but only if accurate quantitative data is hard to
the model that computer could have. In this paper GRU, RNN, get.
and LSTM will be compared and also compared by using Grid
Search and Random Search to find out the better one. Time series is when the data collected, recorded or
watched all through time in arrangement with a few time
Keywords—GRU, RNN, LSTM, Random Search, Grid Search periods to know, quarter, month, week and in a few cases days
(key words) or hours. Time series data are analysed to get some patterns of
past event that can be considered to estimate value for the
I. INTRODUCTION
forecast result because by observing time series data, four
In recent years, digital currencies have developed so components will be seen that will affect past and present data
rapidly because the public's need for transactions has patterns that tend to repeat themselves in the future [2].
increased rapidly. This has made various kinds of innovations
appear in online transaction methods. One method that is In neural network, models can be tuned by adding some
currently popular in the digital world is transacting using Hyperparameters to enhanced the model quality.
crypto currency, or Cryptocurrency. This currency has many Hyperparameter optimization can be deciphered as an
kinds with various advantages offered. Bitcoin is one of the optimization issue where the objective is to discover a value
most popular types of Cryptocurrency today. Bitcoin is the that maximizes the execution and yields a wanted show [3].
first cryptocurrency in the world which was introduced by Hyperparameters that being used in this experiment are Grid
Sathoshi Nakamoto in 2008. Search and Random search because both of them are
frequently used to tune the model, so to find which one would
The bitcoin itself was first mentioned by two programmers fit better to the model given with GRU, RNN, or the LSTM.
in August 2008 named Satoshi Nakamoto and his friend
Martti Malmi try to make a new domain called bitcoin.org. On II. RELATED WORK
the same year precisely on October, Nakamoto publish a The following is a summary of several similar
document (paper) with title “Bitcoin: Peer-to-Peer Electronic studies that have been conducted before by several other
Money System”. At the previous few months, Nakamoto and authors:
a bound of researchers already proposed another version of the 1. Yiying, Wang. & Yeze, Zang. 2019 [4]. This
same concept in a forums and email threads. But in 2008, research applies ANN and LSTM as algorithms to
anything become together. predict cryptocurrency, especially bitcoin. The
Cryptocurrency has several advantages and disadvantages results of his research indicate that the two
if you want to use it as a currency, for now there is no clear methods show fairly good results in predicting.
and definite law to regulate the circulation of digital currency However, in the memory analysis of the model
(cryptocurrency) such as bitcoin for example. If there is an made, ANN relies more on long-term history while
abuse of money in digital form such as fraud, money laundry, LSTM tends to rely on short-term history.
or other criminal acts, the institution that will be responsible 2. Phaladisailoed, Thearasak. & Numnonda,
will not exist. In addition, money that can be used as a means Thanisa. 2018 [5]. Using the RNN, Huber
The 9th International Conference on Cyber and IT Service Management (CITSM 2021)
Bengkulu, September 22-23, 2021
Regression, Theil-Sen Regression and LSTM obligatory pace and it incorporates strategies just like
methods to compare which model is the best for integration, normalization, cleaning and transformation [7].
predicting bitcoin prices. From many predictive Because the dataset that the author got can be categorized as
features such as Open, Close, High, and Low, the quite good and there is no dirty data or defective data, then
prediction results show that all GRU methods are for the preprocessing data we only make the data sorted by
better than regression. However, the model that is date in ascending manner. Because the data received shows
made also depends on the parameters made, its stack shape, every new data is placed at the top (row 1). If
because it greatly affects the prediction results. left unchecked, the time series plotting will start from the
3. Shewalkar, Apeksha., Nyavanandi, Deepika., & most recent data to the first data input. Therefore, the data is
A. Ludwig, Simone. 2019 [6]. Implementing changed in ascending order.
GRU, LSTM, and RNN in Speech Recognition
applications. This is done due to the development C. Splitting Data
of a feedforward neural network that is no longer Data sharing is done by dividing the time span into 2
able to handle speech data properly. This study parts. First, for training data, the time span taken is the first
compares the performance of GRU, LSTM, and 1000 data. Then the validation data taken, the remaining
RNN using the reduced TED-LIUM speech about 356 data.
dataset. These results indicate that the GRU and
LSTM methods have similar results, but slightly D. GRU
higher for LSTMs. However, the training time for Gated Recurrent Unit (GRU) is a one step forward
LSTMs is longer with high accuracy. Therefore improvisation demonstrate on Long Short-Term Memory
this journal concludes that using GRU is better for (LSTM). But the GRU's victory is fundamentally because of
speech recognition using the reduced TED-LIUM the gating arrange signaling that control that will display
speech dataset. input and past memory are utilized to overhaul the current
actuation and create the current state [8]. GRU’s cell create a
III. PROBLEM FORMULATION
number of changes over LSTM cells on combining the Input-
The research method uses a machine learning approach Gate and Forget-Gate as Update-Gate, and it comprises of
using Tensorflow Keras and directly chooses the variable two doors, Update-Gate and Reset-Gate.
“Close” to affect forecasting results. Then enter the general
stages in the data forecasting process such as data collection,
data preprocessing, splitting data into training data and
validation data, creating data models, and predicting models.
A. Data Acquisition
The data we use is time series data obtained from various
websites of historical data providers with Crypto Exchange
(providers of cryptocurrency buying or selling), namely from
https://www.cryptodatadownload.com. The data that was
downloaded came from Exchange Gemini with a span of 17
August 2017 to 13 April 2021. In other words, we took the
dataset from the original data in the field. Figure 1. Gated Recurrent Unit cell
Because the application of forecasting uses time series
data, the authors use all existing datasets without trimming The relationship between input and output can be
them first, because the results of the Prediction model rely on described by,
the historical data from the past. 𝑧𝑡 = 𝜎(𝑊𝑧 . [ℎ𝑡−1 , 𝑥𝑡 ])
The data that we acquire can be said to be non-defective 𝑟𝑡 = 𝜎(𝑊𝑟 . [ℎ𝑡−1 , 𝑥𝑡 ])
or dirty. because the value of the time span and the ̃ℎ𝑡 = 𝑡𝑎𝑛ℎ(𝑊. [𝑟𝑡 × ℎ𝑡−1 , 𝑥𝑡 ])
completeness of the data is complete. There are no null values ℎ𝑡 = (1 − 𝑧𝑡 ) × ℎ𝑡−1 + 𝑧1 × ℎ̃𝑡
or extreme outliers in each of the data indexes.
where t, z, and tr are the output of Update-Gate and Reset
B. Data Preprocessing Gate, Wr and Wz are the weights of the Reset-Gate and
Raw information ordinarily comes with numerous flaws Update-Gate; σ (.) and tanh (.) are Sigmoid and Hyperbolic
such as irregularities, lost values, commotion and/or Tangent functions. Reset-Gate do some “capture” on the
redundancies. Execution of ensuing learning calculations will short-term dependencies at the sequence data and Update-
in this way be undermined in the event that they were Gate assist to obtain long-term dependencies
visualized with low-quality data. In this way by doing some correspondingly.
appropriate preprocessing procedure, we may ready to
essentially impact the quality and unwavering quality of E. RNN
consequent programmed disclosures and choices. In a conventional neural network, The inputs are
Preprocessing data is pointed at changing crude input to independent from another and every input has a single layer
become high-quality one that appropriately match the mining unit to take. In any case, the arrangement of input needs to be
prepare to take after. Preparation will be considered as a considered. On the other hand RNN (Recurrent Neural
Network) only has a single unit for each input with
current being dependent on past computation[9]. So RNNs we make the value absolute to not cancel each other out.
are considered to have a memory which captures almost all Repeat the process until n instance, after that we can divide it
data that has been calculated. by n [11].
IV. PROBLEM SOLUTION
Now, to make a better prediction model, we will apply
Hyperparameter Optimization to our model. As previously
stated, the two Hyperparameter Optimization that we will use
are Grid Search and Random Search. Each method that we
tested was applied one by one Hyperparameter Optimization.
In other words, for the GRU method we will apply the Grid
Search, Random Search, and without Hyperparameter
Figure 2. Illustration of Recurrent Neural Network
Optimization respectively. That way, we get 9 comparisons
The picture above explained a RNN being unfolded into that we will get later. But first, we must know what Grid
a full network. By unfolding we mean that we compose out Search and Random Search are.
the organize for the total grouping. A. Grid Search
F. LSTM Grid search is a Hyperparameter Optimization technique
that chooses a value one by one on the hyperparameter space
Unlike conventional RNNs, an LSTM (Long Short Term
and tries it on the focused-on algorithm [12]. Grid search may
Memory) network is better at memorizing the past data or
be a tuning method that endeavors to compute the ideal values
computation because there are more gates added to adjust
of hyperparameters. It is a comprehensive search that's
vanishing gradient or exploding gradient which occur in RNN
performed on the particular parameter values of a model. The
when backpropagation optimization[10].
model is additionally known as an estimator.
B. Random Search
Different from grid search, random search chooses a
random value in the hyperparameter space based on a
probability distribution with a fixed iteration [12]. Random
search is incredible for disclosure and getting hyperparameter
combinations that you just would not have speculated
naturally, in spite of the fact that it regularly requires more
time to execute. Since it characterizes a search space as a
bounded domain of hyperparameter values and arbitrarily
Figure 3. Long Short Term Memory Cell sample points in that domain.
𝑖𝑡 = 𝜎(𝑊𝑥𝑖 𝑥𝑡 + 𝑊ℎ𝑖 ℎ𝑡−1 + 𝑏𝑖 ) V. RESULT AND SIMULATION

𝑜𝑡 = 𝜎(𝑊𝑥𝑜 𝑥𝑡 + 𝑊ℎ𝑜 ℎ𝑡−1 + 𝑏𝑜 ) This section will experimentally evaluate the performance
𝑓𝑡 = 𝜎(𝑊𝑥𝑓 𝑥𝑡 + 𝑊ℎ𝑓 ℎ𝑡−1 + 𝑏𝑓 ) of three categories which are:
𝑔𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝑥𝑐 𝑥𝑡 + 𝑊ℎ𝑐 ℎ𝑡−1 + 𝑏𝑐 ) 1. Using Hyperparameter Optimization Grid Search
𝑐𝑡 = 𝐹𝑡 × 𝑐𝑡−1 + 𝑖𝑡 × 𝑔𝑡 2. Using Hyperparameter Optimization Random Search
ℎ𝑡 = 𝑜𝑡 × tanh⁡(𝑐𝑡 ) 3. Without Using any Hyperparameter Optimization
Where i , o , f and c represents the input gate, output gate,

t t t t
First before create the model and feed the data to neural
forget gate and cell state. σ represents the logistic sigmoid network, first thing to do is normalize the dataset. To
function. h and ht-1 represents hidden layer output and
t
normalize it we can use this formula:
previous lstm block layer output. W , W , W , W , W , W , xi hi xo ho xf hf
W and W and b , b , b , b are weights and biases to adjust

xc hc i o f c
(𝑥 − 𝑥𝑚𝑖𝑛 )
the output of respective gates. Weight and bias terms can be 𝑥′⁡ = ⁡
(𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 )
trained through an optimizer.
G. MAE Beside normalization, we also split the dataset to 1000
Mean Absolute Error is metrics to measure how good a training, and 356 as test set. It aims to see how good the model
model is. Mean Absolute Error measures average between is with data unseen before. After that the dataset got mapped
absolute value predicted and actual. by 40 windows and 50 batches. And the last one we did
feature selection to choose the “Close” feature of the bitcoin
∑𝑛𝑖=1 𝑎𝑏𝑠(𝑦𝑖 − 𝜆(𝑥𝑖 )) dataset.
𝑚𝑎𝑒 = ⁡ In this experiment, every model to train is using the same
𝑛 configurations like down below :
y - λ(x ) is the difference between actual and predicted 1. Reshape Layer
i i
value where y is actual value and λ(x ) is predicted value, then 2. RNN/GRU/LSTM Layer
i i
3. RNN/GRU/LSTM Layer
4. RNN/GRU/LSTM Layer
5. Dense Layer
Figure 5. Loss of RNN, GRU, and LSTM with Grid Search
Figure 4. Illustration of model’s architecture that being used in

this paper
These are the Hyperparameters will be used for

optimization :
• First hidden layer: [16, 32, 64, 128]

• Second hidden layer: [16, 32, 64, 128]
• Third hidden layer: [16, 32, 64, 128]
• Activation Function on Dense Layer : Sigmoid,
Linear, ReLu
• Optimizer model: ReLu, Adam, Sigmoid
All experiments in this research were conducted on a

Personal Computer running TensorFlow GPU (CUDA)
equipped with Intel Core i9-10900KF 3,7 GHz, 16 GB of Figure 6. Validation of RNN, GRU, and LSTM with Grid Search
memory, and NVIDIA GeForce RTX 3080 GPU 10 GB of
Memory.
A. Hyperparameter Optimization Grid Search
RNN, GRU, and LSTM models that are trained using the
Grid Search take varying amounts of time. For RNN it takes
about 2 hours 5 minutes 10 seconds, for GRU it takes about
4 hours 1 minute 53 seconds, and finally for LSTM it takes
about 3 hours 31 minutes 7 seconds. With 150 epochs and a
total Grid Search iteration of 1537 times.
Figure 7. Training of RNN, GRU, and LSTM with Grid Search

B. Hyperparameter Optimization Random Search

Next, the RNN, GRU, and LSTM models that are trained
using random search take varying amounts of time. For RNN
it takes about 16 minutes 32 seconds, for GRU it takes about
47 minutes 38 seconds, and finally for LSTM it takes about
16 minutes 32 seconds. With the number of epochs as many
as 150 and Random Search iteration is set 100 times.
Figure 10. Training of RNN, GRU, and LSTM with Random Search
C. Without Hyperparameter Optimization

Next, the RNN, GRU, and LSTM models that are trained
without using Hyperparameter Optimization do not require
additional time such as Grid Search and Random Search
because we input arbitrary Hyperparameter.
Figure 8. Loss of RNN, GRU, and LSTM with Random Search
Figure 11. Loss of RNN, GRU, and LSTM without Hyperparameter

Optimization
Figure 9. Validation of RNN, GRU, and LSTM with Random

Search
0.0042529935 0.061953168649
RNN
18323228 061204
Random 0.0045455186 0.083985426468
GRU
Search 871690995 7884
0.0042654447 0.096663039026
LSTM
88501635 69338
0.0749354925 0.393629119318
RNN
8039695 4623
Without
0.0039555295 0.069617547917
Hyper GRU
3871398 52854
Param
0.0042699575 0.089852898359
LSTM
87696327 46308
Tabel 1. MAE result on all models
As we can see from the result above on the validation set,

the bolded number is the smallest MAE we can get which is
from Grid Search on GRU (Gated Recurrent Unit). This
might happen because the GRU is the most advanced method
developed from LSTM model and it also will work much
better with a small dataset. But if we use Grid Search on these
models, it cost much time to train this model. But it really
worth because of the high accuracy we can get above.
Figure 12. Validation of RNN, GRU, and LSTM without
Hyperparameter Optimization VI. CONCLCUSION
Bitcoin is a very valuable commodity on the market
today. Because the exchange rate is high. This makes the
fluctuation value of bitcoin difficult to determine, so not
everyone will experience an advantage when trading with
bitcoin, for example. The methods compared, namely GRU,
RNN, and LSTM using Hyperparameter Optimization of grid
search and random search show the best results, namely by
using the GRU method by applying the Grid Search in the
test set. As for the train data, GRU without Hyperparameter
Optimization gets pretty good results even though the test
accuracy is not as good as the train, this indicates an
overfitting of the model. With the results obtained above, we
recommend using GRU with Hyperparameter Optimization
Grid Search but not Random Search because the difference in
MAE accuracy is consistently higher in models that do not
use Hyperparameter Optimization. We can see in the table
that there is some anomaly in the model (e.g. higher MAE in
test set LSTM with grid search compared to LSTM without
Hyperparameter Optimization). This is because there is some
variability in deep learning and TensorFlow since their
training and optimization of deep learning strongly relies on
stochastic procedures and TensorFlow using frameworks
Figure 13. Training of RNN, GRU, and LSTM without cuDNN [13].
Hyperparameter Optimization
REFERENCES
If we calculate the MAE (Mean Absolute Error) and then [1] Makridakis. Spyros, Wheelwright. Steven C, and McGee. Victor E,
put them all results to a table, it will become look like this: 1999, Methods and Applications of Forecasting, Jakarta: Erlangga
Publisher.
[2] S. M. Idrees, M. A. Alam and P. Agarwal, "A Prediction Approach for
Hyper Stock Market Volatility Based on Time Series Data," in IEEE Access,
Method Train Validation
Param vol. 7, pp. 17287-17298, 2019, doi: 10.1109/ACCESS.2019.2895252.
0.0053123628 0.0732741699 [3] B. Nakisa, M. N. Rastgoo, A. Rakotonirainy, F. Maire and V.
RNN Chandran, "Long Short Term Memory Hyperparameter Optimization
58975959 6645006
for a Neural Network Based Emotion Recognition Framework," in
Grid 0.0043332213 0.059368578832 IEEE Access, vol. 6, pp. 49325-49338, 2018, doi:
GRU
Search 23413428 41456 10.1109/ACCESS.2018.2868361.
0.0041476843 0.099474868050 [4] W. Yiying and Z. Yeze, "Cryptocurrency Price Analysis with Artificial
LSTM Intelligence," 2019 5th International Conference on Information
53620882 32896
Management (ICIM), 2019, pp. 97-101, doi: [9] X. Zhang and T. Luo, "A RNN Decoder for Channel Decoding under
10.1109/INFOMAN.2019.8714700. Correlated Noise," 2019 IEEE/CIC International Conference on
[5] T. Phaladisailoed and T. Numnonda, "Machine Learning Models Communications Workshops in China (ICCC Workshops), 2019, pp.
Comparison for Bitcoin Price Prediction," 2018 10th International 30-35, doi: 10.1109/ICCChinaW.2019.8849949.
Conference on Information Technology and Electrical Engineering [10] J. Li and Y. Shen, "Image describing based on bidirectional LSTM and
(ICITEE), 2018, pp. 506-511, doi: 10.1109/ICITEED.2018.8534911. improved sequence sampling," 2017 IEEE 2nd International
[6] Shewalkar, Apeksha & Nyavanandi, Deepika & Ludwig, Simone. Conference on Big Data Analysis (ICBDA), 2017, pp. 735-739, doi:
(2019). Performance Evaluation of Deep Neural Networks Applied to 10.1109/ICBDA.2017.8078733.
Speech Recognition: RNN, LSTM and GRU. Journal of Artificial [11] Mean Absolute Error. In: Sammut C., Webb G.I. (eds) Encyclopedia
Intelligence and Soft Computing Research. 9. 235-245. 10.2478/jaiscr- of Machine Learning. Springer, Boston, MA. 2011.
2019-0006. https://doi.org/10.1007/978-0-387-30164-8_525
[7] Sergio Ramírez-Gallego, Bartosz Krawczyk, Salvador García, Michał [12] Simon Chan, Philip Treleaven, Chapter 5 - Continuous Model
Woźniak, Francisco Herrera, A survey on data preprocessing for data Selection for Large-Scale Recommender Systems, Handbook of
stream mining: Current status and future directions, Neurocomputing, Statistics, Elsevier, Volume 33, 2015, ISSN 0169-7161, ISBN
Volume 239, 2017, ISSN 0925-2312, 9780444634924, https://doi.org/10.1016/B978-0-444-63492-4.00005-
https://doi.org/10.1016/j.neucom.2017.01.078. 8.
[8] R. Li, J. Hu and S. Yang, "Deep Gated Recurrent Unit Convolution [13] Kerautret, B., Colom, M., Lopresti, D., Monasse, P., & Talbot, H.
Network for Radio Signal Recognition," 2019 IEEE 19th International (Eds.). (2019). Reproducible Research in Pattern Recognition. Lecture
Conference on Communication Technology (ICCT), 2019, pp. 159- Notes in Computer Science. doi:10.1007/978-3-030-23987-9
163, doi: 10.1109/ICCT46805.2019.8947225.

Comparing Bitcoin's Prediction Model Using GRU, RNN, and LSTM by Hyperparameter Optimization Grid Search and Random Search

Uploaded by

Copyright:

Available Formats

Comparing Bitcoin's Prediction Model Using GRU, RNN, and LSTM by Hyperparameter Optimization Grid Search and Random Search

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comparing Bitcoin's Prediction Model Using GRU, RNN, and LSTM by Hyperparameter Optimization Grid Search and Random Search

Uploaded by

Copyright:

Available Formats

Comparing Bitcoin’s Prediction Model Using GRU,

RNN, and LSTM by Hyperparameter Optimization

Nurhayati1, Imam Lutfi Rahmatullah2, Bayu Aji Setyawan3

𝑖𝑡 = 𝜎(𝑊𝑥𝑖 𝑥𝑡 + 𝑊ℎ𝑖 ℎ𝑡−1 + 𝑏𝑖 ) V. RESULT AND SIMULATION

Where i , o , f and c represents the input gate, output gate,

W and W and b , b , b , b are weights and biases to adjust

Figure 5. Loss of RNN, GRU, and LSTM with Grid Search

Figure 4. Illustration of model’s architecture that being used in

These are the Hyperparameters will be used for

• First hidden layer: [16, 32, 64, 128]

All experiments in this research were conducted on a

Figure 7. Training of RNN, GRU, and LSTM with Grid Search

B. Hyperparameter Optimization Random Search

C. Without Hyperparameter Optimization

Figure 8. Loss of RNN, GRU, and LSTM with Random Search

Figure 11. Loss of RNN, GRU, and LSTM without Hyperparameter

Figure 9. Validation of RNN, GRU, and LSTM with Random

As we can see from the result above on the validation set,

You might also like