MultiCast: Zero-Shot Multivariate Time Series Forecasting Using LLMs

Georgios Chatzigeorgakidis “Athena” Research Center
Greece
gchatzi@athenarc.gr Konstantinos Lentzos “Athena” Research Center
Greece
klentzos@athenarc.gr Dimitrios Skoutas “Athena” Research Center
Greece
dskoutas@athenarc.gr

Abstract

Predicting future values in multivariate time series is vital across various domains. This work explores the use of large language models (LLMs) for this task. However, LLMs typically handle one-dimensional data. We introduce MultiCast, a zero-shot LLM-based approach for multivariate time series forecasting. It allows LLMs to receive multivariate time series as input, through three novel token multiplexing solutions that effectively reduce dimensionality while preserving key repetitive patterns. Additionally, a quantization scheme helps LLMs to better learn these patterns, while significantly reducing token use for practical applications. We showcase the performance of our approach in terms of RMSE and execution time against state-of-the-art approaches on three real-world datasets.

Index Terms:

large language models, multivariate time series, forecasting

I Introduction

A time series is a sequence of data points, typically recorded at successive equally spaced intervals of time. These data points can represent various measurements, observations, or readings taken over time, such as temperature readings, stock prices, sales figures, or sensor readings. Time series analysis involves studying the patterns, trends, and relationships present in the data to understand its behavior over time [1]. Time series forecasting predicts future values of a time series based on its past observations.

Traditional time series forecasting methods have demonstrated considerable efficacy over the years and continue to maintain relevance and widespread adoption in contemporary practice [2]. In general, these methods can be categorized into linear [3], [4] and non-linear models [5], [6].

Arguably, the most popular traditional time series method is AutoRegressive Integrated Moving Average (ARIMA) [7]. ARIMA consists of three independent components; (i) the AutoRegressive (AR) component assumes that the current value of a time series is a linear combination of its past values, with the addition of a white-noise term; (ii) the Moving Average (MA) component assumes that the current value of a time series variable is a linear combination of past white-noise terms, with no dependence on past values of the variable itself; (iii) the integrated (I) component incorporates differencing to make the time series stationary, allowing for the modeling of nonstationary time series data.

Machine learning and, in particular, deep learning has emerged as a transformative approach in the field of time series forecasting, offering new advances [8, 9, 10, 11]. Moreover, pre-training has been used in deep learning, to significantly accelerate the training process and increase performance [12]. In domains such as computer vision and Natural Language Processing (NLP), pre-training facilitates scaling of performance with the availability of data. However, in the context of time series modeling, access to sizable pretraining datasets is often limited.

Large Language Models (LLMs) have emerged as a popular tool for Natural Language Processing (NLP) tasks, and have received considerable attention in recent years. LLMs are pretrained models, trained on vast amounts of text data. Their ability to learn rich representations of language has drawn the attention of the scientific community over the past few years. Specifically, LLMs are quite capable of capturing syntactic, semantic, and contextual information [13]. Another interesting aspect of LLMs are emergent abilities [14], which are capabilities that are not explicitly programmed or designed, but rather spontaneously emerge from the complex internal processes of the models. In the past few years, scientists have focused on leveraging the LLMs’ potential to solve problems from other fields than NLP. In particular, in time series forecasting, by taking advantage of pre-learned representations of language, LLMs can potentially capture temporal relationships and time series dynamics [15]. However, most works have focused on univariate time series forecasting, requiring either fine-tuning [16], or a few-shot prompting approach [17] (i.e., providing a few examples via prompting to guide the model’s behavior for a specific task).

In this work, we examine the utility of LLMs for multivariate time series forecasting via zero-shot prompting (i.e., no additional examples are provided). To the best of our knowledge, ours is the first work that addresses this problem.

Our contributions are summarized as follows:

•

We introduce three dimensional multiplexing techniques to combine all dimensions into a single string, passed to an LLM as input.
•

We employ SAX quantization on the time series to facilitate inference by the model and to significantly reduce the computational cost and token usage.
•

We present an experimental evaluation against existing traditional, machine learning, and LLM-based methods for time series forecasting.

II Related Work

LLMs have been applied into many different domains and contexts such as healthcare [18], [19], [20], financial modeling [21], [22], [23], and education and research [24], [25], as well as in time series data [26] for many different tasks and application domains [27].

The authors of TIME-LLM [28], introduce a reprogramming framework aimed at adapting LLMs for time series forecasting without altering their pre-trained structure. TIME-LLM reprograms input time series into text prototype representations that suit LLMs’ capabilities. By introducing Prompt-as-Prefix (PaP), which enriches the input context with natural language instructions, the reprogrammed input is then processed by the frozen LLM. The output is projected to generate time series forecasts updating only lightweight input transformation and output projection parameters, while the backbone language model remains frozen. Scenarios for both short- and long-term are addressed, as well as few- and one-shot learning.

LLMTIME [15] is the first approach to apply zero-shot forecasting on time series using LLMs. The authors argue that the output of LLMs when predicting digit-by-digit follows a multimodal distribution, which fits well in the case of time series. To apply forecasting, the time series values are tokenized and rescaled to a predefined number of digits to use fewer tokens. Then, to apply forecasting, the time series with their tokenized values separated by commas are passed to the model. Notably, the model’s output is limited to producing only digits and commas (i.e., $[0-9,]$ ). At each time step, a predefined number of samples is drawn and the final forecast is built using the median of all samples after descaling the outputted values.

Despite the potential of LLMs for time series forecasting, there are several limitations that need to be addressed.

•

No multivariate support: Most current approaches using LLMs for time series forecasting focus on univariate time series data. This limitation restricts the applicability of LLMs to certain types of time series data.
•

Fine-tuning requirement: Fine-tuning can be time-consuming and computationally expensive, particularly for large models. It also requires a substantial amount of training data, which may not always be available.
•

Number of tokens required: LLMs are extremely large models, capable of efficiently running on computers equipped with GPUs of high capacity in RAM. Thus, their broad availability depends on services that host such models, which usually charge queries by token. Consequently, very large queries (e.g., a large time series in our context) would be rather expensive to run.

III MultiCast

In the following, we describe our approach to zero-shot multivariate time series forecasting using LLMs. First, we go through the three separate token multiplexing approaches that we propose. Then, we describe our approach to reducing complexity using the SAX representation.

III-A Dimensional Multiplexing

The dimensional multiplexing process takes place after each dimension has been rescaled to avoid decimals. Then, each digit is treated separately. An example of this process is illustrated in the top row of Figure 1. Of course, depending on the LLM used, its tokenizer must be adapted accordingly, as discussed in [15]. After multiplexing, the tokens are replaced with their corresponding corpus id before being passed onto the model for inference. When the model produces the output, this process is reversed to obtain the final result. We introduce three separate dimensional multiplexing techniques, namely (i) digit-interleaving, (ii) value-interleaving, and (iii) value-concatenation.

Refer to caption — Figure 1: The three token multiplexing techniques.

III-A1 Digit-Interleaving

After each dimension has been rescaled, the Digit-Interleaving (DI) multiplexing technique places the digits of each dimension per timestamp interchangeably. This is exemplified in Figure 1a. Consider a 2-dimensional time series. Specifically, $d_{1}=[1.7,2.6,...]$ and $d_{2}=[2.3,3.1,...]$ are the two dimensions (i.e., we only show the first two timestamps for brevity). After rescaling, the dimensions become $d_{1}=[17,26,...]$ and $d_{2}=[23,31,...]$ , respectively. Then, as described previously, each digit is considered a separate token. Before being assigned the corresponding corpus id, tokens are interchangeably placed per dimension for each timestamp, reducing the dimensions to 1. The resulting series in the example would be $d=[1273,2361,...]$ . Then, each digit (token) and comma are assigned with the corresponding id. This technique attempts to take advantage of the fact that, in many multivariate time series, all values are correlated and similarly scaled. Such an example are z-normalized series, which have zero mean with values differing a few standard deviations from it. In such a case, the left-wise digits per dimension will be all placed first; since the model is producing the output token-by-token, this can help it infer the correct scaling of the series. More formally, DI multiplexing can be formulated as follows.

\begin{split}I_{d}=\{t_{111}...t_{d11}\quad t_{11b}...t_{d1b}\}\quad t_{c}% \quad\{t_{1n1}...t_{dn1}\quad t_{1nb}...t_{dnb}\}\end{split}

(1)

where $d$ is the number of dimensions, $b$ the predefined number of digits per timestamp, and $n$ the time series length.

III-A2 Value-Interleaving

Figure 1b shows the Value-Interleaving (VI) dimensional multiplexing technique. This time, instead of interchangeably placing the digits per timestamp and dimension, we place the whole values of each dimension per timestamp one after the other. Thus, in the example, the 1-dimensional result will be $d=[1723,2631,...]$ . Intuitively, this technique is more suitable in cases where the dimensions of the series are on a different scale. We expect the model to be able to distinguish between the different dimensions –especially when they differ in scale–, and manage to internally demultiplex the input before inference. The VI multiplexing can be formulated as follows.

\begin{split}I_{t}s=\{t_{111}...t_{11b}\quad t_{d11}...t_{d1b}\}\quad t_{c}% \quad\{t_{1n1}...t_{1nb}\quad t_{dn1}...t_{dnb}\}\end{split}

(2)

where $d$ is the number of dimensions, $b$ the predefined number of digits per timestamp, and $n$ the time series length.

III-A3 Value-Concatenation

Finally, Figure 1b shows the Value-Concatenation (VC) dimensional multiplexing technique which is an extension of the value-interleaving technique; for each timestamp, we now place the values of each dimension separated by commas, thus considering them as different values (e.g., in the figure, the 1-dimensional result will be $d=[17,23,26,31,...]$ . We expect this to further faciliate the internal demultiplexing by the model before detecting any patterns. The VC multiplexing can be formulated as follows.

\begin{split}I_{t}d=\{t_{111}...t_{11b}\}\hskip 0.80005ptt_{c}\hskip 0.80005pt% \{t_{d11}...t_{d1b}\}\hskip 0.80005ptt_{c}\hskip 0.80005pt\{t_{1n1}...t_{1nb}% \}\hskip 0.80005ptt_{c}\hskip 0.80005pt\{t_{dn1}...t_{dnb}\}\end{split}

(3)

where $d$ is the number of dimensions, $b$ the predefined number of digits per timestamp and $n$ the length of the time series. Of course, in all cases, upon receiving the multiplexed output from the model, the tokens must be properly decoded, demultiplexed, and brought back to their initial scale for each dimension, depending on the selected technique. A significant advantage of this multiplexing technique against forecasting each dimension separately is the fact that multivariate time series tend to have high interdimensional correlations (e.g., temperature and humidity in weather data). We expect that providing them altogether in the model can lead to the detection of such interdimensional patterns, yielding better results.

III-B Quantization Using SAX

The Symbolic Aggregate approXimation (SAX) is a multi-resolution representation of a time series introduced in [29]. It can be derived from its Piecewise Aggregate Approximation (PAA) [30, 31] by quantizing the PAA segments on the $v$ -axis. A time series is first transformed into a PAA representation of $w$ segments with real-valued coefficients. To obtain a $SAX$ word for a time series, these coefficients are discretized along the value axis using breakpoints assuming a $\mathcal{N}(0,1)$ Gaussian distribution that enables the generation of equiprobable symbols for a given cardinality. Although bitwise representations were used for these symbols in the original paper, other encoding types are also possible. Two such popular alternatives are using alphabetical characters or digits for each symbol.

Forecasting time series is an inherently difficult task due to the nature of the data. This is also the case for zero-shot foreasting using LLMs, since (i.e., as also described in [15]) they have to infer a sequence of tokens for each timestamp, thus simulating a multi-modal distribution. This becomes even harder when applying the above-mentioned dimensional multiplexing techniques. Also, for large time series, such a process becomes significantly more computationally intensive; plus, it requires many tokens, which, depending on the application, can be rather expensive to infer according to currently LLM pricing policies. To alleviate these issues, we quantize the time series across all dimensions in both axes using the SAX representation, before applying tokenization. We support two different quantization types, either using an alphabetical or a digital SAX alphabet. Now, each value per timestamp is consisted of only one token instead of multiple. For example, the time series in Figure 1 could become $d_{1}^{sax}=[a,b,...]$ and $d_{2}^{sax}=[b,c,...]$ after alphabetical quantization. We expect that it will be easier for the model to detect patterns when dealing only with one token per timestamp.

IV Experimental Evaluation

This section presents the results of our experiments. We first explain how we set up our tests and assess the suggested methods.

IV-A Experimental Setup

IV-A1 System

We used Python and the Hugging Face API¹¹1https://huggingface.co/. The experiments were run on a server with an AMD Ryzen Threadripper 3960X 24-core CPU and 256GB memory. The experiments were run on CPU.

IV-A2 Datasets

We employ three real-world multivariate time series datasets.

Gas Rate: This is a 2-dimensional dataset containing carbon dioxide (CO₂) emissions. The first dimension contains the input CO₂ measurements (ft3/min) in a gas furnace. The second dimension contains the output CO₂ percentage. The dataset is obtained from the darts library²²2https://unit8co.github.io/darts. Of course, the two dimensions are correlated, which makes this dataset ideal for multivariate forecasting.

Electricity: This multivariate time series is part of the Electricity Transformer Dataset (ETDataset)³³3https://github.com/zhouhaoyi/ETDataset. It contains hourly measurements of various metrics, which were resampled on a 3-day basis, for a total of 242 timestamps. From this dataset, we extracted 3 dimensions of electricity measurements, specifically the High UseFul Load (HUFL), High UseLess Load (HULL), and Oil Temperature (OT). Again, the dimensions are correlated; specifically, OT is used as a target variable in regression problems.

Weather: The weather dataset was generated by the Max Planck Institute⁴⁴4https://www.bgc-jena.mpg.de/wetter/ and contains 21 weather-related metrics obtained from a weather station located in Germany. From the 21 variables, we extracted the air temperatures (Tlog) measured in Celsius degrees, the water vapor concentration (H2OC) measured in mmol/mol, the saturation water vapor pressure (VPmax), measured in mbar, and the potential temperature (Tpot) measured in Kelvin degrees. Again, being weather-related, all dimensions are correlated.

TABLE I: Datasets.

Dataset	Dimensions	Length
Gas Rate	2	296
Electricity	3	242
Weather	4	217

TABLE II: Parameters.

Parameter	Range
Dimensions	2, 3, 4
Number of samples	5, 10, 20
SAX segment length	3, 6, 9
SAX alphabet size	5, 10, 20

IV-A3 Competitors

We evaluate the following methods:

•

MultiCast (DI): MultiCast using the digit-interleaving dimensional multiplexing method.
•

MultiCast (VI): MultiCast using the value-interleaving dimensional multiplexing method on the same value.
•

MultiCast (VC): MultiCast using the value-concatenation dimensional multiplexing method on consecutive values.
•

LLMTIME: The state-of-the-art in LLM-based zero-shot time series forecasting (i.e., applied in each dimension separately).
•

ARIMA: Autoregressive Integrated Moving Average (ARIMA) is one of the most widely used univariate time series forecasting methods.
•

LSTM [32]: Termed as Long-Short-Term Memory (LSTM), LSTMs are Recurrent Neural Networks (RNNs) designed to handle the vanishing gradient problem. This ability allows LSTMs to learn and remember information over time, making them ideal for time series forecasting. LSTMs have been used successfully for multivariate time series forecasting [33, 34].

IV-A4 Parameters

The parameters utilized in our experimental assessment are listed in Table II. For each parameter, we performed tuning tests to establish their ranges and default values, which are highlighted in bold within the table. More specifically, the dimensions parameter corresponds to the dimensionality of each dataset; the number of samples only applies to the LLM-related models and is the number of inference values taken for each timestamp; the SAX segment length is the level of quantization on the x-axis, which determines the level of compression of a time series; the SAX alphabet size is the level of quantization on the y-axis, as performed by the SAX method. Regarding the LSTM parametrization, we performed a grid search, which yielded a 1-hidden-level network of 128 units and a dropout rate of 0.2. It was trained for 30 epochs using the Adam [35] optimizer with the Mean Squared Error (MAE) as loss function.

IV-A5 Metrics

In accordance with standard practices in time series forecasting, the Root Mean Squared Error (RMSE) metric was employed to evaluate our methods. RMSE is formulated as $\sqrt{\Sigma_{i=1}^{n}{({y_{i}-\hat{y}_{i}})^{2}}/{n}}$ , where $y_{i}$ is the actual value, $\hat{y}_{i}$ is the predicted value at timestamp $i$ and $n$ is the number of timestamps on which forecasting was applied.

IV-B LLM Model Selection

MultiCast can be used with any LLM to apply multivariate time series forecasting. In the following, we evaluate its accuracy using LLaMA2 (i.e., the 7B parameter variant) and Phi-2 [36] as back-end models. LLaMA2 is one of the most popular LLMs that achieves good performance with fewer parameters. Phi-2 is a math-oriented LLM (i.e., 2.7B parameters), tailored to solving math problems. Table III lists the forecast RMSE for the Gas Rate data set in both dimensions for both LLMs. In both cases, the VI variant of MultiCast was used. LLaMA2 achieves better performance (i.e., approx. twice as good) in all cases. This can be attributed to the fewer parameters of the Phi-2 model; while it is math-oriented and quite capable of solving complex problems described by textual prompting, it seems to not properly detect the patterns in the series, leading to larger errors.

TABLE III: LLM model comparison.

Model	Dimension
Model	GasRate	CO2
MultiCast (LLaMA2 / 7B)	1.154	2.71
MultiCast (Phi-2 / 2.7B)	2.106	4.676

Figures 2a and b depict two indicative examples of forecasting the first dimension of the Gas Rate dataset using the LLaMA2 and Phi-2 models, respectively. Clearly, the LLaMA2 model performs better, being able to properly follow the upward trend of the time series and even infer two local maxima of the original time series. Phi-2, on the other hand, fails to accurately forecast the time series in this dimension; while it seems to successfully detect the upward trend, its entire output is shifted 1 to 2 units on the y-axis. Since LLaMA2 seems to perform significantly better in all cases, for the rest of the experiential evaluation, we will be using LLaMA2 as the back-end model for MultiCast.

IV-C Forecasting Accuracy

Next, we compare the prediction precision in terms of RMSE of all MultiCast variants against the rest of the competitor approaches. Table IV lists the results for the Gas Rate dataset. To better comprehend the insights behind the results and acquire knowledge regarding the differences in forecasting ability between the LLM-based models and the rest of the competition, for each dimension, we denote the first best overall performance using bold font and the second best using italic font. Interestingly, for the GasRate dimension, the best overall approach was LLMTIME ( $0.703$ ), followed by MultiCast (DI) with $0.781$ . The LLM-based approaches all seem to cope well with detecting the underlying patterns for this dimension, thus producing good results. The case is different for the second dimension (CO2), where the conventional methods seem to yield a better overall performance, with ARIMA being the best (2.63). MultiCast (VI) was the second-best overall and the best LLM-based performer (2.71).

TABLE IV: Forecasting RMSE for the Gas Rate dataset.

Model	Dimension
Model	GasRate	CO2
MultiCast (DI)	0.781	4.639
MultiCast (VI)	1.154	2.71
MultiCast (VC)	0.965	3.626
LLMTIME	0.703	2.75
ARIMA	0.92	2.63
LSTM	1.122	3.89

TABLE V: Forecasting RMSE for the Electricity dataset.

Model	Dimension
Model	HUFL	HULL	OT
MultiCast (DI)	5.914	1.444	9.198
MultiCast (VI)	8.63	1.882	13.752
MultiCast (VC)	2.424	1.913	10.230
LLMTIME	4.299	1.432	7.543
ARIMA	7.063	1.572	4.181
LSTM	4.892	1.43	8.740

TABLE VI: Forecasting RMSE for the Weather dataset.

Model	Dimension
Model	Tlog	H2OC	VPmax	Tpot
MultiCast (DI)	3.711	2.43	3.025	6.888
MultiCast (VI)	3.26	2.122	2.387	11.352
MultiCast (VC)	4.983	3.819	5.776	5.993
LLMTIME	3.14	1.746	4.044	6.981
ARIMA	3.324	2.686	4.331	6.067
LSTM	3.524	1.796	2.708	5.559

Figure 3 depicts two indicative forecast outputs of the best MultiCast approach (DI) for the first dimension of the Gas Rate data set against the corresponding ARIMA result. Both seem to yield a good result here; MultiCast seems to properly detect a continuously upward trend in the time series; however, the result seems to have larger variance than that of the original time series. On the other hand, the ARIMA approach does not clearly follow the upward trend; however, its variance seems to be on par with the one of the original time series.

Table V lists the results for the Electricity dataset. For the HUFL dimension, MultiCast (VC) seems to yield significantly better RMSE than the rest of the approaches. However, the rest of the MultiCast variants do not cope as well. For the HULL dimension, all approaches seem to produce good results, with LLMTIME achieving the best RMSE. Finally, for the OT dimension, ARIMA performs significantly better than the competition. The MultiCast approaches do not perform well. LLMTIME is the best among LLM-based models. This suggests a possible drop in the performance of MultiCast as the dimensionality of the time series increases since there is the extra step of demultiplexing the input that the LLMs must infer. However, the error in the best LLM-based model (9.198) is very close to that of the LSTM model (8.740).

Figure 4 illustrates an indicative example of the MultiCast (VC) forecast output (Figure 4a) against the LSTM (Figure 4b) for the HUFL dimension of the electricity data set. Clearly, MultiCast manages to correctly infer both the trend and variance of the time series. On the other hand, the LSTM seems to perform rather poorly, falsely yielding a non-existent linear upward trend.

The RMSE results for the Weather dataset are listed in Table VI. LLMTIME achieves the best performance in the Tlog dimension, though, all approaches except MultiCast (VC) are close. This is also the case for the H2OC dimension. For the VPmax dimension, the best overall approach was MultiCast (VI), with MultiCast (VC) again performing worse than the rest. However, this is reversed in the Tpot dimension, where the MultiCast variant (VC) yields the best performance among all LLM-based approaches. LSTM is the better performer in this dimension. Notice that the degradation in forecasting accuracy for more dimensions is not present in this case; the MultiCast variants are all either close to, or outperform the rest in all dimensions. Another key takeaway here is that the optimal multiplexing method differs from dimension to dimension and from dataset to dataset. A comprehensive analysis on which dataset characteristics cause this behavior is an interesting future work.

As in the rest of the cases, an indicative example of MultiCast against a conventional method is illustrated in Figure 5. Clearly, the DI variant of MultiCast (Figure 5a) yields better results than ARIMA (Figure 5b) here, able to accurately estimate the upward trend and fluctuation at the end of the time series.

Overall, we notice a trade-off when using MultiCast for multivariate time series forecasting, as opposed to LLMTIME. Forecasting each dimension separately using LLMTIME will completely ignore the interdimensional correlations, which is not desirable in such scenarios. On the other hand, MultiCast poses an additional challenge to the LLM models, which now have to also infer the demultiplexing of the dimensions. Both cases hinder the accuracy of the obtained result. Having in mind the interesting aspect of emergent abilities, we argue that using very large LLMs (e.g., GPT-4, Gemini) will further improve MultiCast’s performance.

IV-D Increasing Number of Samples

Table VII lists the accuracy in terms of RMSE of all LLM-based models for an increasing number of samples. As a reminder, all LLM-based models draw several samples of the values of each timestamp, and the final estimated value is derived by computing the median among all samples. The LLMTIME approach seems to produce better results for 5 and 10 samples. Interestingly, the error worsens for 20 samples. This could be because the inherent variance of the produced series tends to be averaged out as we draw more samples. However, this is not the case for the MultiCast method; all three approaches seem to produce better results for more samples, and the MultiCast DI variant achieves the best performance for 20 samples. A drawback of drawing many samples is the performance deterioration in execution time (i.e., each execution time is listed below each corresponding RMSE in Table VII). Notice that, in all cases, the execution time doubles when the number of samples is doubled, which is expected since the model must infer twice as many tokens. Interestingly, the LLMTIME requires slightly less total time (i.e., sum of time needed per dimension) than its MultiCast counterparts, since the latter also need to infer the multiplexing/demultiplexing of the tokens.

TABLE VII: Performance for an increasing number of samples.

Method	Number of samples
Method	5	10	20
MultiCast (DI)	\pbox0.781
1036 sec	\pbox0.762
2050 sec	\pbox0.592
4159 sec
MultiCast (VI)	\pbox0.965
1041 sec	\pbox1.302
2068 sec	\pbox0.877
4131 sec
MultiCast (VC)	\pbox1.154
1168 sec	\pbox0.704
2468 sec	\pbox0.63
4981 sec
LLMTIME	\pbox0.703
1023 sec	\pbox0.606
1939 sec	\pbox0.842
3684 sec

IV-E Performance Boost Using SAX

Next, we will show the results obtained when quantization is applied using the SAX method, as described in Section III-A. Specifically, we evaluate the effects of increasing the length of the SAX segment and the size of the alphabet on the performance of zero-shot time series forecasting using MultiCast.

IV-E1 Increasing SAX Segment length

Table VIII lists the results for an increasing number of SAX segments for the CO2 $\%$ dimension of the Gas Rate dataset, in terms of RMSE and execution time. We also list the results for using a different kind of SAX quantization; either using alphabetical characters or digits to encode SAX words. Compressing the time series significantly facilitates the inference process since now the model has to generate only one symbol per timestamp, instead of three or more. This is reflected in the execution times shown in Table VIII; inference after applying SAX compression is more than an order of magnitude faster, from 52 seconds in the best case (i.e., using 9 SAX segments) to 1168 seconds, when no quantization is applied. The large difference in performance can have a big impact on forecasting tasks that are run on CPU, which may often be the case in scenarios where access to a GPU with large enough memory to fit an LLM is not possible.

TABLE VIII: Increasing SAX segment length.

Method	SAX Segment Length
Method	3	6	9
MultiCast SAX (alphabetical)	\pbox1.089
148 sec	\pbox0.983
77 sec	\pbox0.888
54 sec
MultiCast SAX (digital)	\pbox0.992
156 sec	\pbox0.99
71 sec	\pbox0.912
52 sec
MultiCast	\pbox0.781
1168 sec

As expected, quantizing the time series leads to a loss of information. Again, this is reflected in the RMSE scores for the SAX approaches in Table VIII, which are worse than when no quantization is applied. However, this may not always be the case; having to infer only one symbol per timestamp is easier for the LLM. Patterns, if they exist, will be easier to detect and guess. The higher RMSE scores in these cases can be attributed to the quantization that SAX applies on both axes. However, the final result, when plotted, could properly follow the initial time series. This effect is illustrated in Figures 6b and 6c, for the CO2 $\%$ dimension of the Gas Rate dataset. On the other hand, in this case, MultiCast using 3 SAX segments managed to detect the initial upward trend (Figure 6a), but the result worsened afterwards.

Figure 8 shows an indicative example of the prediction result when applying SAX quantization using digits to encode symbols, for the CO2 $\%$ dimension of the Gas Rate dataset. It is easily noticeable that the resulting line (red in the figure) closely follows the initial time series in this dimension. This could suggest that it may be easier for the LLMs to detect patterns in time series represented by numbers rather than alphabetical characters.

IV-E2 Increasing SAX Alphabet Size

In the following, we evaluate the performance of MultiCast, in terms of RMSE and execution time, when increasing the size of the SAX alphabet. Table IX lists the results. We should note that for digits we can only go up to an alphabet of size 10. Again, the non-quantized MultiCast yields better performance in terms of RMSE, but is significantly slower. Increasing the size of the alphabet does not seem to affect the execution time. Also, in terms of RMSE, larger alphabet sizes produced higher errors, possibly due to the increase in complexity that the use of more symbols induces.

TABLE IX: Increasing SAX alphabet size.

Method	SAX Alphabet Size
Method	5	10	20
MultiCast SAX (alphabetical)	\pbox0.983
77 sec	\pbox1.198
81 sec	\pbox1.273
83 sec
MultiCast SAX (digital)	\pbox0.99
71 sec	\pbox1.21
75 sec	N/A
MultiCast	\pbox0.781
1168 sec

Finally, Figure 7 shows an indicative forecasting example for 5, 10 and 20 SAX symbols for the CO2 $\%$ dimension of the Gas Rate dataset. The drop in RMSE scores is also reflected here; only in the case of using five symbols does the forecasted time series follow the trend of the original.

V Conclusions

In this paper, we presented MultiCast, an approach that leverages LLMs for zero-shot multivariate time series forecasting. To make this model work with multiple dimensions, we proposed three token multiplexing solutions that reduce the dimensionality of the time series to one. This allows the time series to be compatible with the input of an LLM, while still preserving its ability to detect repetitive patterns. Additionally, we presented a quantization solution that aims to facilitate the learning of existing patterns in the series by LLMs. This solution also significantly reduces the execution time. In our comprehensive experimental analysis using three real-world datasets, we found that the use of LLMs for multivariate zero-shot time series forecasting shows promise and offers a significant advantage compared to other similar methods available in the literature: No expert knowledge or time and resource-consuming parameter search processes are required. In the future, we plan to expand our research on employing LLMs for zero-shot solutions on other similar time series-related tasks, such as imputation, anomaly detection, and change point detection, as well as evaluate MultiCast’s inference performance using more LLMs as back-end models and further improving it in more dimensions.

Acknowledgment

This work was partially funded by the EU Horizon Europe projects STELAR (101070122) and DT4GS (101056799).

References

[1] R. H. Shumway, D. S. Stoffer, and D. S. Stoffer, Time series analysis and its applications. Springer, 2000, vol. 3.
[2] A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are transformers effective for time series forecasting?” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, pp. 11 121–11 128, Jun. 2023.
[3] Z. Liu, Z. Zhu, J. Gao, and C. Xu, “Forecast methods for time series data: A survey,” IEEE Access, vol. 9, pp. 91 896–91 912, 2021.
[4] J. G. De Gooijer and R. J. Hyndman, “25 years of time series forecasting,” International Journal of Forecasting, vol. 22, no. 3, pp. 443–473, 2006.
[5] J. Kuvulmaz, S. Usanmaz, and S. N. Engin, “Time-series forecasting by means of linear and nonlinear models,” in MICAI 2005: Advances in Artificial Intelligence, A. Gelbukh, Á. de Albornoz, and H. Terashima-Marín, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 504–513.
[6] C. Cheng, A. Sa-Ngasoongsong, O. Beyca, T. Le, H. Yang, Z. Kong, and S. Bukkapatnam, “Time series forecasting for nonlinear and non-stationary processes: a review and comparative study,” IIE Transactions, vol. 47, no. 10, pp. 1053–1071, 2015.
[7] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis: forecasting and control. John Wiley & Sons, 2015.
[8] K. e. a. Benidis, “Deep learning for time series forecasting: Tutorial and literature survey,” ACM Comput. Surv., vol. 55, no. 6, dec 2022. [Online]. Available: https://doi.org/10.1145/3533382
[9] J. F. Torres, D. Hadjout, A. Sebaa, F. Martínez-Álvarez, and A. Troncoso, “Deep learning for time series forecasting: A survey,” Big Data, vol. 9, no. 1, pp. 3–21, 2021.
[10] A. Mahmoud and A. Mohammed, A Survey on Deep Learning for Time-Series Forecasting. Springer International Publishing, 2021, pp. 365–392.
[11] S. Du, T. Li, Y. Yang, and S.-J. Horng, “Multivariate time series forecasting via attention-based encoder–decoder framework,” Neurocomputing, vol. 388, pp. 269–279, 2020.
[12] J. Jiang, Y. Shu, J. Wang, and M. Long, “Transferability in deep learning: A survey,” 2022.
[13] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
[14] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
[15] N. Gruver, M. Finzi, S. Qiu, and A. G. Wilson, “Large language models are zero-shot time series forecasters,” in NeurIPS, 2023.
[16] C. Chang, W.-C. Peng, and T.-F. Chen, “Llm4ts: Two-stage fine-tuning for time-series forecasting with pre-trained llms,” arXiv preprint arXiv:2308.08469, 2023.
[17] H. Xue and F. D. Salim, “Promptcast: A new prompt-based learning paradigm for time series forecasting,” IEEE Transactions on Knowledge and Data Engineering, 2023.
[18] P. Schmiedmayer, A. Rao, P. Zagar, V. Ravi, A. Zahedivash, A. Fereydooni, and O. Aalami, “Llm on fhir–demystifying health records,” arXiv preprint arXiv:2402.01711, 2024.
[19] A. J. Nashwan, A. A. AbuJaber, and A. AbuJaber, “Harnessing the power of large language models (llms) for electronic health records (ehrs) optimization,” Cureus, vol. 15, no. 7, 2023.
[20] M. Wornow, A. Lozano, D. Dash, J. Jindal, K. W. Mahaffey, and N. H. Shah, “Zero-shot clinical trial patient matching with llms,” arXiv preprint arXiv:2402.05125, 2024.
[21] Y. Li, S. Wang, H. Ding, and H. Chen, “Large language models in finance: A survey,” in Proceedings of the Fourth ACM International Conference on AI in Finance, 2023, pp. 374–382.
[22] H. Yang, X.-Y. Liu, and C. D. Wang, “Fingpt: Open-source financial large language models,” arXiv preprint arXiv:2306.06031, 2023.
[23] H. Zhao, Z. Liu, Z. Wu, Y. Li, T. Yang, P. Shu, S. Xu, H. Dai, L. Zhao, G. Mai et al., “Revolutionizing finance with llms: An overview of applications and insights,” arXiv preprint arXiv:2401.11641, 2024.
[24] M. Hosseini, C. A. Gao, D. M. Liebovitz, A. M. Carvalho, F. S. Ahmad, Y. Luo, N. MacDonald, K. L. Holmes, and A. Kho, “An exploratory survey about using chatgpt in education, healthcare, and research,” medRxiv, pp. 2023–03, 2023.
[25] S. Moore, R. Tong, A. Singh, Z. Liu, X. Hu, Y. Lu, J. Liang, C. Cao, H. Khosravi, P. Denny et al., “Empowering education with llms-the next-gen interface and content generation,” in International Conference on Artificial Intelligence in Education. Springer, 2023, pp. 32–37.
[26] Y. Jiang, Z. Pan, X. Zhang, S. Garg, A. Schneider, Y. Nevmyvaka, and D. Song, “Empowering time series analysis with large language models: A survey,” 2024.
[27] X. Zhang, R. R. Chowdhury, R. K. Gupta, and J. Shang, “Large language models for time series: A survey,” 2024.
[28] M. Jin, S. Wang, L. Ma, Z. Chu, J. Y. Zhang, X. Shi, P.-Y. Chen, Y. Liang, Y.-F. Li, S. Pan, and Q. Wen, “Time-LLM: Time series forecasting by reprogramming large language models,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=Unb5CVPtae
[29] J. Shieh and E. J. Keogh, “iSAX: indexing and mining terabyte sized time series,” in SIGKDD, 2008, pp. 623–631.
[30] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra, “Dimensionality reduction for fast similarity search in large time series databases,” Knowl. Inf. Syst., vol. 3, no. 3, pp. 263–286, 2001.
[31] B. Yi and C. Faloutsos, “Fast time sequence indexing for arbitrary Lp norms,” in VLDB, 2000, pp. 385–394.
[32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, pp. 1735–80, 12 1997.
[33] S. Alhirmizy and B. Qader, “Multivariate time series forecasting with lstm for madrid, spain pollution,” in 2019 international conference on computing and information science and technology and their applications (ICCISTA). IEEE, 2019, pp. 1–5.
[34] J. Ju and F.-A. Liu, “Multivariate time series data prediction based on att-lstm network,” Applied sciences, vol. 11, no. 20, p. 9373, 2021.
[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[36] M. Javaheripi, S. Bubeck, M. Abdin, J. Aneja, S. Bubeck, C. C. T. Mendes, W. Chen, A. Del Giorno, R. Eldan, S. Gopi et al., “Phi-2: The surprising power of small language models,” 2023.