Comment by sheepscreek
2 days ago
> Our dataset consisted of Kubernetes pod metrics collected from a production retail checkout application.
That sums it up and it’s no surprise why Datadog’s toto model performed exceptionally well.
The results would have been much more useful had they opted for a heterogenous mix of data sets. I am thinking of census data and statistics, or financial forecasting (GDP, interest rates), or clinical trial drop-out rates etc. So many interesting problems out there.
At the moment our focus is on observability, hence the narrow scope of our dataset. A pretty good benchmark for observability seems to be Datadog's BOOM- https://huggingface.co/datasets/Datadog/BOOM
But for general purpose time-series forecasting, benchmarks mentioned in other comments like GIFT or M4 might come in handy. We might include them in the follow-up experiment.
The GIFT Eval benchmark would be a good place to start: https://huggingface.co/spaces/Salesforce/GIFT-Eval