Hacker News new | past | comments | ask | show | jobs | submit login
Moirai: A time series foundation model for universal forecasting (salesforceairesearch.com)
196 points by throwaway888abc 33 days ago | hide | past | favorite | 37 comments



Interesting, but I'm very skeptical. There are over a dozen transformers-based foundation time series model released in the past year and without fail, every one of them claims to be at or near SOTA. For example:

- Time-LLM (https://arxiv.org/abs/2310.01728)

- Lag-Llama (https://arxiv.org/abs/2310.08278)

- UniTime (https://arxiv.org/abs/2310.09751)

- TEMPO (https://arxiv.org/abs/2310.04948)

- TimeGPT (https://arxiv.org/abs/2310.03589)

- TimesFM (https://arxiv.org/html/2310.10688v2)

- GPT4TS (https://arxiv.org/pdf/2308.08469.pdf)

Yet not a SINGLE transformer-based model I've managed to successfully run has beaten gradient boosted tree models on my use case (economic forecasting). To be honest I believe these foundational models are all vastly overfit. There's basically only 2 benchmarking sets that are ever used in time series (the Monash set and the M-competition set), so it'd be easy to overtune a model just to perform well on these.

I would love to see someone make a broader set of varied benchmarks and have an independent third party do these evaluations like with LLM leaderboards. Otherwise I assume all published benchmarks are 100% meaningless and gamed.


Why would you expect anything to work well for economic forecasting :p


Jamie pull up the article that proves none of the published models work well with economic forecasting


There is always Gary Stevensons Economics model. Works without fail.


I'm so sad. This hilarious comment is languishing in the doldrums.


Not reddit.


Pretty much any real-world time series prediction task is going to involve more data than just the time series itself, and some of this data will probably be tabular, so it's not surprise gradient boosted trees perform better.


Neural nets are known to struggle with tabular data. Have you tried fine tuning or attaching a decoder somewhere that you train on your task? Zero-shot inference might be asking for too much.


>> Neural nets are known to struggle with tabular data.

Not disagreeing with you, and I'm not a specialist, but it's funny that lot of papers seem to claim exactly the opposite.


What paper says the opposite? This is what I can find:

https://arxiv.org/abs/2207.08815

https://arxiv.org/abs/2305.02997


Honestly the best part of this paper is they've put together a large new set of time series for benchmarking.


https://facebook.github.io/prophet/

"Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well."


?


Very cool that the dataset and model weights are open right away! This paper also doesn't have a bunch of weird architectural choices pulled out of nowhere like the other TS foundation models recently. Looks like it will actually be useful, thank you! Maybe I will actually get to do representation learning for TS during my PhD.

As a sidenote/rant, it would be nice if all supervised TS benchmarks included "DLinear + RevIN" as the standard baseline, as in my experiments it tends to get about the same performance as all other new SOTA forecasting models. Most papers compare to the linear model without RevIN while they themselves use it, and only beat it because of that :) And in any case supervised training of transformers from scratch on datasets having less than 1M points is just stupid (so less raw data than a single image?). Less than 1B is still at least mildly stupid.

Here of course the angle is zero-shot so its somewhat excused from this, but it still would be interesting whether it can beat that supervised model combination.


I'm curious where universal forecasting models are most useful. It is technically fascinating but forecasting specifically seems like a domain where you'd want interpretable modeling - you use it for big-value problems and it significantly affects your action/policy. So, the tradeoff between performance and model simplicity should lean towards the latter?


So I am not alone! There seem so few people who hold this view these days.


Same for my shop - we manage a large pool of cost driven by partially forcastable factors; we've repeatedly rejected methods purely on explainability grounds. Our accountability requirements do not allow us to point the finger at an LLM if we get it wrong.


I know. Here I am modeling my data generating process like a chump.



And the documentation makes me think they did a great job making this easy to use. Looking forward to playing around with it.

Edit: oh you’re one of the authors — thank you, and congratulations!


Choosing beggers and all that, but the LOTSA dataset could really benefit from a Readme on HuggingFace. Even just a citation back to the original paper would be good.


That's actually a great suggestion, thanks! We're also still working on improving the readability/usability of the codebase too


They should sign up for the next Makridakis forecasting competition.

https://en.wikipedia.org/wiki/Makridakis_Competitions

Makridakis and Hibon reached the sad conclusion that "statistically sophisticated and complex methods do not necessarily provide more accurate forecasts than simpler ones."


That was true in the first Makridakis competition ("M1") in 1982, and possibly until M4 in 2018, but both M5 and M6 were won by what would generally be considered relatively sophisticated methods (e.g. LightGBM).

The Wikipedia article doesn't have that much detail on M5 or M6, but the M5 papers are in the International Journal of Forecasting[1] and M6 should be published later this year (there's already a preprint on arxiv [2]).

I recently spent some time looking into the history and results of the M competitions and had a chance to speak to Professor Makridakis about them, as well as the winners of each of the M6 competition tracks [3]. While the methods have become more sophisticated, some conclusions from M1 still seem to hold: in particular, that there is no overall "best" method, and that the winning method tends to be different for different types of data, time horizons, and evaluation metrics.

[1]: https://www.sciencedirect.com/science/article/pii/S016920702... [2]: https://arxiv.org/abs/2310.13357 [3]: https://mlcontests.com/state-of-competitive-machine-learning...


Our basic low-dimensional parametric model landed No1 at the SKU level at the M5, see my lecture https://www.lokad.com/tv/2022/1/5/no1-at-the-sku-level-in-th... (more references at the bottom)


Interesting, thanks for sharing!


A recent thread on Amazon’s new Chronos forecasting model showed that an ensemble of simple models outperformed it (a highly parametrized transformed model) on the M competition datasets.

https://github.com/Nixtla/nixtla/tree/main/experiments/amazo...


Show us how it performs against other models on the M3, M4 and M5 competition.

This is the gold standard of forecasting tools.

Moirai stands for fates [https://en.wikipedia.org/wiki/Moirai] in Greek mythology


Looks super interesting. Definitely going to play with this though it took me way too long to figure out what Salesforce Air Search was.. maybe that's a sign I should log off for the day.


References this paper on Time Series transformers. First I’ve seen someone apply transformers to time series specifically. Very curious how well this might work for low-frequency events.

https://arxiv.org/abs/2402.02592


What does 'any-variate' forecasting mean? Can you use this pre-trained model to produce forecasts when there exists useful covariates/features/predictors? Is this something the other TS foundation models can/cannot do?


When we deal with many different multivariate time series, each time series can have a different number of variates. So "any-variate" means that the model is able to take as inputs multivariate time series with arbitrary number of variates, and model the interactions with the Transformer's attention mechanism. This is something that many other TS foundation models do not consider yet - they convert all multivariate time series into multiple univariate time series.

Whether or not the forecasts improves as a result of the additional covariates is still an open question which needs to be studied more -- we need to build better evaluations and benchmarks for this.


This looks very interesting! I'm trying to understand if the flattening technique might work for my ts. It's structured as follows: At each time step t, I have an m by n data matrix. The value for m (rows) varies per time step. n stays constant and represents the features. And i want to predict one of the n values. (In this case, t represents a single day, m (rows) represent the people that entered a store on that day, and n (cols) represent various features of the people. I want to predict one of those features, given the others.) The fact that it's a time series matters, because i expect the relationship to change over time. For instance some feature n[x] (person wears a yellow shirt) might be correlated with my target feature n[y] (person steals) but only in the summer. would it be possible to flatten this too? What would that look like?


Understood, thank you. There are certainly applications in demand sensing/demand forecasting where things like recent order information, recent sales, CRM inputs are quite predictive of near-term outcomes, but become useless for longer horizon forecasts. In my experience, when information like this is available, no time-series technique that is unable to leverage this information would beat even simple regressions for short term horizon forecasts.


They flatten the time and variate dimensions into a single 1D vector. So it can handle arbitrary numbers of features.


One detail I don’t really understand is the low-variance normal component of the target mixture. Would be curious to see from the weights how often that was used


Any one tried this for Prometheus metrics?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: