In recent decades, Long Short-Term Memory networks (LSTMs), an enhanced version of Recurrent Neural Networks (RNNs), have made significant contributions across various domains. Particularly in the study of time series data, they have offered promising capabilities in capturing temporal dependencies and patterns. This paper delves into the application of LSTMs in market forecasting, aiming to use historical price data to construct predictive models and optimize investment allocations for improved portfolio performance. The investigation includes a detailed examination of hyperparameters tailored for Invesco QQQ Trust (QQQ), SPDR Gold Trust (GLD), and Bitcoin (BTC) LSTM models, employing them for price prediction and the development of high-return trading strategies. Following this, an analysis is carried out on portfolio holdings, return rates, and risk enhancements for each investment asset within the testing set under this trading strategy.
In the realm of artificial intelligence and deep learning, Recurrent Neural Networks (RNNs) have emerged as indispensable tools for handling sequential data. RNNs maintain information states in time series data through recursive connections. However, they encounter challenges such as the vanishing or exploding gradient problem, making it difficult to learn long-term dependencies.
In 1997, Hochreiter and Schmidhuber 1 introduced Long Short-Term Memory networks (LSTMs), an improved version of RNNs specifically designed to address the issue of long-term dependencies. LSTMs enhance their ability to capture long-term information in sequences by introducing memory cells and gating mechanisms. In the field of Natural Language Processing (NLP), Hochreiter and Schmidhuber's seminal work in 1997 laid the foundation for the application of LSTMs in tasks such as machine translation, sentiment analysis, and named entity recognition. Following this, LSTMs have been widely applied in various fields, such as speech recognition 2, medical image processing 3, intelligent transportation systems 4, time series forecasting 5, and financial market 6, 7. Inspired by these studies, this paper applies LSTM to predict market trends for Invesco QQQ Trust (QQQ), SPDR Gold Trust (GLD), and Bitcoin (BTC), thereby discussing potential optimizations for investment portfolios. GLD is often considered a safe-haven asset, BTC serves as a global digital asset and is typically viewed as a higher-risk appreciating asset, while QQQ encompasses major companies in the technology sector, offering relatively stable long-term growth potential. Therefore, the selection of these three investment assets takes into account risk diversification, hedging and appreciation, as well as global market coverage.
It is well known that Recurrent Neural Networks (RNNs) are a type of deep learning model designed for handling sequential or time-based data. Their unique feature lies in the inclusion of cyclic connections, allowing information to be passed within the network and utilized for processing different time steps.
The basic structure of an RNN (Figure 1) consists of input, hidden, and output layers. The neurons in the hidden layer not only receive input from the current time step but also incorporate output from the hidden layer of the previous time step.
At each time step, an RNN receives input and performs a forward pass. The input includes the data for the current time step and the hidden state from the previous time step
The RNN computes the output
and the hidden state for each time step
. By comparing the model's output
with the actual target values
, the loss function
is computed. Next, the backward pass algorithm is employed to calculate the gradients of the loss function with respect to the network parameters (weights), where the gradients represent the rate of change of the loss function with respect to the parameters. Using gradient descent or other optimization algorithms, the gradient information is applied to the network parameters to update their values. This process ensures that the next forward pass of the network produces predictions closer to the actual targets. The above steps are iterated, moving through multiple time steps. In each iteration, the model adjusts its parameters to minimize the loss function. The hidden state is passed from each time step to the next, creating a recurrent structure that enables the network to consider contextual information in the sequence. When handling long sequences with traditional RNNs, it's important to note that during backward pass, gradient information can become extremely small (vanishing) or exceptionally large (exploding). The former hinders effective learning of long-term dependencies, while the latter causes unstable network parameters and uncontrollable training, impacting overall model performance. To address these issues, an improved structure was introduced, known as Long Short-Term Memory (LSTM), which uses memory cells and gate mechanisms, effectively alleviating the impact of gradient vanishing and exploding, and overcoming the challenges of long-term dependencies.
In an LSTM, the memory cell is responsible for storing information and can retain or update this information over extended periods, enabling more effective handling of long sequential dependencies. The recurrent unit in LSTM consists of three main gates: the input gate, the forget gate, and the output gate(see Figure 2). These gates control the flow of information, allowing the LSTM to selectively store, forget, or output information at different time steps.
The forget gate decides which information will be removed from the previous memory cell . Let
and
be the hidden state and input data in the current timestep
, and
be the weight of information to be forgotten in
, Then the forget gate uses a sigmoid activation function
to decide how much information to retain or forget based on input data
and the previous time step's hidden state
, where
and
are the weight vector (or matrix) and bias parameter of the forget gate. Then the output of
ranges between 0 and 1, where 0 indicates a preference to forget and discard, and 1 indicates a preference to remember and retain information from the past.
The input gate calculates the current moment candidate memory cell state value and the input gate value
between 0 and 1 to control the proportion of input information
to be stored into the current memory cell
, and then use
and
to update the memory cell.
![]() |
![]() |
![]() |
where
,
, and
are the weight vectors (or matrices) and bias parameters of the input gate.
The output Gate determines the output hidden state using the output gate value
.
, and
.
It controls how much information is output, utilizing the sigmoid and hyperbolic tangent (tanh) activation functions. To be more precise, it produces an output between 0 and 1 based on input data and the current time step's hidden state, and outputs a portion of the information processed by the tanh function on the memory cell. Through the combination of these gates, LSTM can more effectively handle long sequences, aiding the network in learning and retaining dependencies on past information. This capability makes it well-suited for various tasks.
2.2. Critical Hyperparameters in LSTMWhen building a Long Short-Term Memory (LSTM) network, a crucial aspect involves the selection of hyperparameters. Unlike parameters, which are internal variables adjusted during training, hyperparameters are external configuration settings that play a pivotal role in shaping the learning process. These higher-level structural choices significantly impact the model's performance and ability to generalize new data. In the context of LSTMs, careful consideration of hyperparameters is essential for achieving optimal results. This section will delve into the key hyperparameters associated with LSTMs, exploring their roles, significance, and the considerations involved in selecting appropriate values for a given task. From learning rates to the number of hidden units, each hyperparameter contributes to the overall architecture and behavior of the LSTM, making their understanding and fine-tuning crucial for successful model training.
Initial Learning Rate The initial learning rate is the starting value for the step size or the rate at which the model's parameters are updated during training. A higher
may lead to faster convergence but may risk overshooting the optimal values. A lower
may result in slower convergence but might provide more stable updates.
Input Size The input size refers to the dimensionality of the input data fed into the LSTM network. In natural language processing, for example, it could be the length of a sequence or the number of features in each time step. Larger
may require more computational resources but can capture more complex patterns. Smaller
may lead to faster training but might result in a loss of information.
Number of Hidden Units The number of hidden units (also called hidden neurons) represents the dimensionality of the hidden state and cell state in the LSTM. It determines the capacity of the LSTM to learn and represent complex relationships. A higher
increases the model's capacity to capture intricate patterns but may also increase the risk of overfitting, especially with limited data. A lower
may lead to underfitting, where the model might struggle to capture important patterns.
Number of Epochs The number of epochs is the count of times the entire training dataset is passed through the LSTM during training. Training for too few epochs may result in an underfit model, while training for too many epochs may lead to overfitting. It is essential to find a balance to achieve the best generalization for unseen data.
In short, in a LSTM, Initial Learning Rate Influences the speed and stability of convergence. Input Size
affects the model's ability to capture information from the input. Number of Hidden Units
determines the model's capacity to learn complex patterns. Number of Epochs
Influences the balance between underfitting and overfitting. Choosing appropriate values for these parameters involves a trade-off between model complexity, training speed, and generalization to new data. Hyperparameter tuning is often required to find the optimal configuration for a specific task. In the following sections, we will delve into price prediction models based on LSTM neural networks, focusing specifically on QQQ, GLD, and BTC. Our emphasis will be on investigating more effective input sizes and the number of hidden units.
We collected the daily closing prices for QQQ, GLD, and BTC from January 2010 to July 2023. As QQQ and GLD ETF do not trade on weekends and holidays, we use the closing price from the previous trading day for those days. For the daily price data of these three investment assets, we selected the first 70% (from August 1, 2010, to September 8, 2018) as the training set, and the remaining 30% as the test set (from September 9, 2019, to July 31, 2023).
Given that neural network learning inherently involves capturing the distribution of the data, it becomes imperative to normalize the data to maintain consistency. Without normalization, each batch of training data may exhibit a different distribution. As the neural network strives to strike a balance among these multiple distributions, the input data for each layer undergoes constant changes, complicating the search for an optimal equilibrium and potentially hindering the convergence of the constructed neural network model. To expedite model convergence, mitigate the risk of gradient explosion, and enhance overall training speed and efficiency, data normalization is applied.
The normalization formula is expressed as follows.
, where
is the price on day
,
is the minimum value in the price sample and
is the maximum value in the price sample.
Set the initial learning rate 0.005 and choose
, we use the LSTM network with different values for
and
to predict prices for QQQ, GLD, and BTC. Subsequently, we assess the model performance using the average absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE).
![]() |
Where is the size of the sample,
is the model prediction price, and
is the real price. A smaller MAE, MAPE, and RMSE indicate a closer alignment between the predicted and true values.
We selected input sizes of 5, 10, 20, and 60 based on the common practice in daily stock trading, where market participants often rely on moving averages calculated over these specific periods (5 days, 10 days, 20 days, and 60 days) for making assessments. This choice aligns with established strategies in the financial industry, leveraging these specific timeframes to gauge trends and make informed decisions in stock trading.
In Table 1, it is evident that all of the three model evaluation metrics reach their minimum values when and the
(corresponding to Forecast 1 in Figure 3) in the testing set. Slightly higher values are observed when
and the
(resulting in Forecast 2 in Figure 3). As a conclusion drawn from this analysis, for QQQ, utilizing the daily prices of the preceding five days as input in the LSTM network seems more suitable for accurate price estimation.
In the case of LSTM networks for GLD, we observe similar outcomes. Specifically, for GLD, using the daily prices from the previous 20 days as input in the LSTM network appears more suitable for precise price estimation. This observation is based on the fact that all three model evaluation metrics reach their minimum values when and the
, corresponding to Forecast 1 within the testing set shown in Figure 4. The second-best performing LSTM network, characterized by the second-to-last smallest values for all three evaluation metrics, generates the Forecast 2 in Figure 4. It shares the same
as the top-performing model, which is 20; and its
When comparing these BTC LSTM networks, the models with the minimum and second minimum values for MAE and MAPE are the same, corresponding to forecast 1 and 2 in Figure 5. The hyperparameters for forecast 1 are and the
, while for forecast 2, the hyperparameters are
and the
. However, the networks with the minimum and second minimum values for RMSE differ from forecast 1 and 2. They align with forecast 2 and forecast 3 in Figure 5, where forecast 3 represents the LSTM network with
and the
.
In this context, directly comparing forecast 1 (minimizing MAE and MAPE) and forecast 3 (minimizing RMSE) for predicting BTC prices is challenging due to their unique features. Forecast 1 is better suited for periods of lower BTC price volatility, as shown in Figure 6. Conversely, forecast 3 is considered more effective during times of increased BTC price fluctuations, as depicted in Figure 7. This difference in suitability stems from the distinct characteristics of the evaluation metrics and the underlying patterns captured by each forecast in response to different levels of BTC price volatility.
Summarizing the analysis and discussion in this subsection, we obtain the following intriguing findings: When employing LSTM for price evaluation, QQQ is best suited for using the prices of the previous five days; GLD is best suited for utilizing the prices of the previous 20 days. As for BTC, it is optimal to use the prices of the previous five days when its price volatility is low and the prices of the previous ten days when volatility is high.
Building upon the analysis in subsection 2.4, we select LSTM models with the following crucial hyperparameters for QQQ, GLD, and BTC to generate forecasts for the next day's prices.
Let denote the predicted price on day
, and
denote the actual price on day
, then the expected daily return rate is
. We compute and compare the daily
for QQQ, GLD, and BTC, utilizing the comparison results to inform our trading strategies. On regular trading days, we prioritize the investment target with the highest expected daily return among QQQ, GLD, and BTC. We sell the other two assets and use the proceeds to buy more of the highest-return assets. If all assets have negative expected daily returns, we sell everything and hold cash. On holidays (non-trading days), QQQ and GLD remain untouched, while BTC is tradable. If the expected daily return for BTC is negative, we sell it; if positive, we use all available cash to buy BTC.
Let denote the cumulative return rate on day
, then
, where
=
is the real daily return rate, and
is overall balance of all assets on day
. Applying the above trading strategy to our test set, covering from September 2019 to July 2023, resulted in a remarkable return rate of 317.54%, with an annual average return rate of 81.91%.
In addition to the high return rate in this trading strategy, another noteworthy outcome is the variability in the number of days each distinct asset is held in the testing set. In 2019 and 2020, our trading strategy involved holding BTC for the longest duration, while the period of holding gold was the shortest, with no gold holdings in 2019. However, in the subsequent two years, gold emerged as the most frequently chosen investment among the three, dominating in 2021 and slightly decreasing thereafter. There were no days of holding BTC in 2021 and 2023, and the holding period in 2022 was also the shortest. In contrast, the days of holding gold constituted a significant proportion over these three years.
The investment strategy discussed above focuses on high returns without taking into consideration the associated risks. However, it's important to note that there are significant differences in the risk profiles of these three investment assets. Let represent the actual daily return rate of an asset, calculated as
with
is the actual price of the asset on day
. The risk standard deviation
is given by
, where N is the number of historical return rates, and
denotes the average historical return value. Subsequently, considering N as 30 days, we generated the following Risk-Return graph based on the actual daily prices of QQQ, GLD, and BTC for the testing set.
In Figure 10, it is evident that among the three investment targets, GLD stands out as the most stable in terms of both returns and risks, followed by QQQ. Conversely, BTC exhibits the highest risk, but it also encompasses a wider range of returns. A natural follow-up question is how the risk-return profiles of these three investment targets will evolve within the trading strategy outlined in this article. Let denote the daily trading return rate for the asset. We define
if we choose to hold the asset on day
, i.e., buy or retain the asset on day
, and
if we choose not to hold the asset on day
, i.e., sell or refrain from buying the asset on day
. The corresponding trading RSD is calculated by
.
Upon comparing Figure 10 and Figure 11, it becomes evident that, when applying our trading strategy, both QQQ and BTC experience a notable reduction in risk along with a substantial increase in returns compared to their actual daily risk-return profiles. Although the risk range for GLD remains relatively stable, there is an enhancement in its return rate as well.
In this article, we delved into the application of LSTM for predicting the prices of investment assets, facilitating the creation of a high-return trading strategy. Our focus revolved around the selection of two crucial hyperparameters for LSTM: and
. There is potential for further optimizing the model by incorporating techniques such as Dropout 8, 9, and batch normalization 10, among others. Moreover, when crafting trading strategies based on predicted prices, we can broaden our analysis to encompass factors like transaction costs and the inflation rate of cash. In addition, our current trading strategy involves daily transactions. In the future, we may investigate strategies with weekly or monthly trading frequencies and compare the optimal trading frequency for various investment assets.
[1] | Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. | ||
In article | View Article PubMed | ||
[2] | Graves, Alex & Mohamed, Abdel-rahman & Hinton, Geoffrey. (2013). Speech Recognition with Deep Recurrent Neural Networks. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. 38. 10.1109/ICASSP.2013.6638947. | ||
In article | View Article | ||
[3] | Prasoon, A., Petersen, K., Igel, C., Lauze, F., Dam, E., Nielsen, M. (2013). Deep Feature Learning for Knee Cartilage Segmentation Using a Triplanar Convolutional Neural Network. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013. MICCAI 2013. Lecture Notes in Computer Science, vol 8150. Springer, Berlin, Heidelberg. | ||
In article | View Article PubMed | ||
[4] | Y. Lv, Y. Duan, W. Kang, Z. Li and F.-Y. Wang (2015), "Traffic Flow Prediction With Big Data: A Deep Learning Approach," in IEEE Transactions on Intelligent Transportation Systems, 16(2), 865-873. | ||
In article | |||
[5] | Hansika Hewamalage, Christoph Bergmeir, Kasun Bandara (2021), Recurrent Neural Networks for Time Series Forecasting: Current status and future directions, International Journal of Forecasting, 37(1), 388-427. | ||
In article | View Article | ||
[6] | Thomas Fischer, Christopher Krauss (2018), Deep learning with long short-term memory networks for financial market predictions, European Journal of Operational Research, 270(2), 654-669. | ||
In article | View Article | ||
[7] | Qiu J, Wang B, Zhou C (2020). Forecasting stock prices with long-short term memory neural network based on attention mechanism. PLoS One. 15(1), e0227222. PMID: 31899770; PMCID: PMC6941898. | ||
In article | View Article PubMed | ||
[8] | Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958. | ||
In article | |||
[9] | Pascanu Razvan, Mikolov Tomas, and Bengio Yoshua (2013). On the difficulty of training recurrent neural networks. In International Conference on Machine Learning. 1310–1318. | ||
In article | |||
[10] | Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning. Vol. 37, 448-456. | ||
In article | |||
[1] | Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. | ||
In article | View Article PubMed | ||
[2] | Graves, Alex & Mohamed, Abdel-rahman & Hinton, Geoffrey. (2013). Speech Recognition with Deep Recurrent Neural Networks. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. 38. 10.1109/ICASSP.2013.6638947. | ||
In article | View Article | ||
[3] | Prasoon, A., Petersen, K., Igel, C., Lauze, F., Dam, E., Nielsen, M. (2013). Deep Feature Learning for Knee Cartilage Segmentation Using a Triplanar Convolutional Neural Network. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013. MICCAI 2013. Lecture Notes in Computer Science, vol 8150. Springer, Berlin, Heidelberg. | ||
In article | View Article PubMed | ||
[4] | Y. Lv, Y. Duan, W. Kang, Z. Li and F.-Y. Wang (2015), "Traffic Flow Prediction With Big Data: A Deep Learning Approach," in IEEE Transactions on Intelligent Transportation Systems, 16(2), 865-873. | ||
In article | |||
[5] | Hansika Hewamalage, Christoph Bergmeir, Kasun Bandara (2021), Recurrent Neural Networks for Time Series Forecasting: Current status and future directions, International Journal of Forecasting, 37(1), 388-427. | ||
In article | View Article | ||
[6] | Thomas Fischer, Christopher Krauss (2018), Deep learning with long short-term memory networks for financial market predictions, European Journal of Operational Research, 270(2), 654-669. | ||
In article | View Article | ||
[7] | Qiu J, Wang B, Zhou C (2020). Forecasting stock prices with long-short term memory neural network based on attention mechanism. PLoS One. 15(1), e0227222. PMID: 31899770; PMCID: PMC6941898. | ||
In article | View Article PubMed | ||
[8] | Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958. | ||
In article | |||
[9] | Pascanu Razvan, Mikolov Tomas, and Bengio Yoshua (2013). On the difficulty of training recurrent neural networks. In International Conference on Machine Learning. 1310–1318. | ||
In article | |||
[10] | Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning. Vol. 37, 448-456. | ||
In article | |||