Interesting, but I'm very skeptical. There are over a dozen transformers-based foundation time series model released in the past year and without fail, every one of them claims to be at or near SOTA. For example:
Yet not a SINGLE transformer-based model I've managed to successfully run has beaten gradient boosted tree models on my use case (economic forecasting). To be honest I believe these foundational models are all vastly overfit. There's basically only 2 benchmarking sets that are ever used in time series (the Monash set and the M-competition set), so it'd be easy to overtune a model just to perform well on these.
I would love to see someone make a broader set of varied benchmarks and have an independent third party do these evaluations like with LLM leaderboards. Otherwise I assume all published benchmarks are 100% meaningless and gamed.
Pretty much any real-world time series prediction task is going to involve more data than just the time series itself, and some of this data will probably be tabular, so it's not surprise gradient boosted trees perform better.
Neural nets are known to struggle with tabular data. Have you tried fine tuning or attaching a decoder somewhere that you train on your task? Zero-shot inference might be asking for too much.
"Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well."
Very cool that the dataset and model weights are open right away! This paper also doesn't have a bunch of weird architectural choices pulled out of nowhere like the other TS foundation models recently. Looks like it will actually be useful, thank you! Maybe I will actually get to do representation learning for TS during my PhD.
As a sidenote/rant, it would be nice if all supervised TS benchmarks included "DLinear + RevIN" as the standard baseline, as in my experiments it tends to get about the same performance as all other new SOTA forecasting models. Most papers compare to the linear model without RevIN while they themselves use it, and only beat it because of that :) And in any case supervised training of transformers from scratch on datasets having less than 1M points is just stupid (so less raw data than a single image?). Less than 1B is still at least mildly stupid.
Here of course the angle is zero-shot so its somewhat excused from this, but it still would be interesting whether it can beat that supervised model combination.
I'm curious where universal forecasting models are most useful. It is technically fascinating but forecasting specifically seems like a domain where you'd want interpretable modeling - you use it for big-value problems and it significantly affects your action/policy. So, the tradeoff between performance and model simplicity should lean towards the latter?
Same for my shop - we manage a large pool of cost driven by partially forcastable factors; we've repeatedly rejected methods purely on explainability grounds. Our accountability requirements do not allow us to point the finger at an LLM if we get it wrong.
Choosing beggers and all that, but the LOTSA dataset could really benefit from a Readme on HuggingFace. Even just a citation back to the original paper would be good.
Makridakis and Hibon reached the sad conclusion that "statistically sophisticated and complex methods do not necessarily provide more accurate forecasts than simpler ones."
That was true in the first Makridakis competition ("M1") in 1982, and possibly until M4 in 2018, but both M5 and M6 were won by what would generally be considered relatively sophisticated methods (e.g. LightGBM).
The Wikipedia article doesn't have that much detail on M5 or M6, but the M5 papers are in the International Journal of Forecasting[1] and M6 should be published later this year (there's already a preprint on arxiv [2]).
I recently spent some time looking into the history and results of the M competitions and had a chance to speak to Professor Makridakis about them, as well as the winners of each of the M6 competition tracks [3]. While the methods have become more sophisticated, some conclusions from M1 still seem to hold: in particular, that there is no overall "best" method, and that the winning method tends to be different for different types of data, time horizons, and evaluation metrics.
A recent thread on Amazon’s new Chronos forecasting model showed that an ensemble of simple models outperformed it (a highly parametrized transformed model) on the M competition datasets.
Looks super interesting. Definitely going to play with this though it took me way too long to figure out what Salesforce Air Search was.. maybe that's a sign I should log off for the day.
References this paper on Time Series transformers. First I’ve seen someone apply transformers to time series specifically. Very curious how well this might work for low-frequency events.
What does 'any-variate' forecasting mean? Can you use this pre-trained model to produce forecasts when there exists useful covariates/features/predictors?
Is this something the other TS foundation models can/cannot do?
When we deal with many different multivariate time series, each time series can have a different number of variates. So "any-variate" means that the model is able to take as inputs multivariate time series with arbitrary number of variates, and model the interactions with the Transformer's attention mechanism. This is something that many other TS foundation models do not consider yet - they convert all multivariate time series into multiple univariate time series.
Whether or not the forecasts improves as a result of the additional covariates is still an open question which needs to be studied more -- we need to build better evaluations and benchmarks for this.
This looks very interesting! I'm trying to understand if the flattening technique might work for my ts.
It's structured as follows:
At each time step t, I have an m by n data matrix. The value for m (rows) varies per time step. n stays constant and represents the features. And i want to predict one of the n values.
(In this case, t represents a single day, m (rows) represent the people that entered a store on that day, and n (cols) represent various features of the people. I want to predict one of those features, given the others.)
The fact that it's a time series matters, because i expect the relationship to change over time. For instance some feature n[x] (person wears a yellow shirt) might be correlated with my target feature n[y] (person steals) but only in the summer.
would it be possible to flatten this too? What would that look like?
Understood, thank you. There are certainly applications in demand sensing/demand forecasting where things like recent order information, recent sales, CRM inputs are quite predictive of near-term outcomes, but become useless for longer horizon forecasts.
In my experience, when information like this is available, no time-series technique that is unable to leverage this information would beat even simple regressions for short term horizon forecasts.
One detail I don’t really understand is the low-variance normal component of the target mixture. Would be curious to see from the weights how often that was used
- Time-LLM (https://arxiv.org/abs/2310.01728)
- Lag-Llama (https://arxiv.org/abs/2310.08278)
- UniTime (https://arxiv.org/abs/2310.09751)
- TEMPO (https://arxiv.org/abs/2310.04948)
- TimeGPT (https://arxiv.org/abs/2310.03589)
- TimesFM (https://arxiv.org/html/2310.10688v2)
- GPT4TS (https://arxiv.org/pdf/2308.08469.pdf)
Yet not a SINGLE transformer-based model I've managed to successfully run has beaten gradient boosted tree models on my use case (economic forecasting). To be honest I believe these foundational models are all vastly overfit. There's basically only 2 benchmarking sets that are ever used in time series (the Monash set and the M-competition set), so it'd be easy to overtune a model just to perform well on these.
I would love to see someone make a broader set of varied benchmarks and have an independent third party do these evaluations like with LLM leaderboards. Otherwise I assume all published benchmarks are 100% meaningless and gamed.