As someone who's worked in time series forecasting for a while, I haven't yet found a use case for these "time series" focused deep learning models.
On extremely high dimensional data (I worked at a credit card processor company doing fraud modeling), deep learning dominates, but there's simply no advantage in using a designated "time series" model that treats time differently than any other feature. We've tried most time series deep learning models that claim to be SoTA - N-BEATS, N-HiTS, every RNN variant that was popular pre-transformers, and they don't beat an MLP that just uses lagged values as features. I've talked to several others in the forecasting space and they've found the same result.
On mid-dimensional data, LightGBM/Xgboost is by far the best and generally performs at or better than any deep learning model, while requiring much less finetuning and a tiny fraction of the computation time.
And on low-dimensional data, (V)ARIMA/ETS/Factor models are still king, since without adequate data, the model needs to be structured with human intuition.
As a result I'm extremely skeptical of any of these claims about a generally high performing "time series" model. Training on time series gives a model very limited understanding of the fundamental structure of how the world works, unlike a language model, so the amount of generalization ability a model will gain is very limited.
Great write-up, thank you. Do you have rough measures for what constitutes high/mid/low- dimensional data? And how do you use XGBoost et al for multi-step forecasting, I.e. in scenarios where you want to predict multiple time steps in the future?
The added benefit is that you optimize each regressor towards its own target timestep t+1 ... t+n. A single loss on the aggregate of all timesteps is often problematic
I've found that it works well to add the prediction horizon as a numerical feature (e.g. # of days), and them replicate each row for many such horizons, while ensuring that all such rows go to the same training fold.
Thanks for this write up. Your comment clears up a lot of the confusion I've had around these time series transformers.
How does lagged features for an MLP compare to longer sequence lengths for attention in Transformers? Are you able to lag 128 time steps in a feed forward network and get good results?
I agree that the conventional (numeric) forecasting can hardly benefit from the newest approaches like transformers and LLMs. I made such a conclusion while working on the intelligent trading bot [0] by experimenting with many ML algorithms. Yet, there exist some cases where transformers might provide significant advantages. They could be useful where the (numeric) forecasting is augmented with discrete event analysis and where sequences of events are important. Another use case is where certain patterns are important like those detected in technical analysis. Yet, for these cases much more data is needed.
Foundational models can work where so far „needs human intuition“ was the state of things. I can picture a time series model with large enough Training corpus being able to deal quite well with typical quirks of seasonalities, shocks, outliers, etc.
I fully agree regarding how things have been so far, but I’m excited to see practitioners try out models such as the one presented here — it might just work.
Reminds me a bit how in psychology you have ANOVA, MANOVA, ANCOVA, MANCOVA etc etc but really in the end we are just running regressions—variables are just variables.
My read on this was that you can just dump the lagged values as inputs and let the network figure it out just as well as the other, time series specific models do, not that time doesn't matter.
I assume the time series modelling is used to predict normal non-fraud behaviour. And then simpler algorithms are able to highlight deviations from the norm?
On extremely high dimensional data (I worked at a credit card processor company doing fraud modeling), deep learning dominates, but there's simply no advantage in using a designated "time series" model that treats time differently than any other feature. We've tried most time series deep learning models that claim to be SoTA - N-BEATS, N-HiTS, every RNN variant that was popular pre-transformers, and they don't beat an MLP that just uses lagged values as features. I've talked to several others in the forecasting space and they've found the same result.
On mid-dimensional data, LightGBM/Xgboost is by far the best and generally performs at or better than any deep learning model, while requiring much less finetuning and a tiny fraction of the computation time.
And on low-dimensional data, (V)ARIMA/ETS/Factor models are still king, since without adequate data, the model needs to be structured with human intuition.
As a result I'm extremely skeptical of any of these claims about a generally high performing "time series" model. Training on time series gives a model very limited understanding of the fundamental structure of how the world works, unlike a language model, so the amount of generalization ability a model will gain is very limited.