Comment by lysecret
1 day ago
Im pretty deep into this topic and what might be interesting to an outsider is that the leading models like neuralgcm/weathernext 1 before as well as this model now are all trained with a "crps" objective which I haven't seen at all outside of ml weather prediction.
Essentially you add random noise to the inputs and train by minimizing the regular loss (like l1) and at the same time maximizing the difference between 2 members with different random noise initialisations. I wonder if this will be applied to more traditional genai at some point.
The random noise is added to the model parameters, not the inputs, or not?
This reminds me of variational noise (https://www.cs.toronto.edu/~graves/nips_2011.pdf).
If it is random noise on the input, it would be like many of the SSL methods, e.g. DINO (https://arxiv.org/abs/2104.14294), right?
Yes you are right it's applied to the parameters, but other models (like ngcm) applied it to the inputs. IMO it shouldn't make a huge difference main point is you max differences between models.
> Essentially you add random noise to the inputs and train by minimizing the regular loss (like l1) and at the same time maximizing the difference between 2 members with different random noise initialisations. I wonder if this will be applied to more traditional genai at some point.
We recently had a situation where we specifically wanted to generate 2 "different" outputs from an optimization task and struggled to come up with a good heuristic for doing so. Not at all a GenAI task, but this technique probably would have helped us.
This idea is often used for self-supervised learning (SSL). E.g. see DINO (https://arxiv.org/abs/2104.14294).
That’s pretty neat. It reminds me of how VAEs work: https://en.wikipedia.org/wiki/Variational_autoencoder
What is the goal of doing that vs using L2 loss?
To add to the existing answers - L2 losses induce a "blurring" effect when you autoregressively roll out these models. That means you not only lose import spatial features, you also truncate the extrema of the predictions - in other terms, you can't forecast high-impact extreme weather with these models at moderate lead times.
Yes very good point this to me is one of the most magical elements of this loss how it suddenly makes the model "collapse" on one output and the predictions become sharp.
1 reply →
To encourage diversity between the different members in an ensemble. I think people are doing very similar things for MOE networks but im not that deep into that topic.
The goal of using CRPS is to produce an ensemble that is a good probabilistic forecast without needing calibration/post processing.
[edit: "without", not "with"]
> Im pretty deep into this topic and what might be interesting to an outsider is that the leading models like neuralgcm/weathernext 1 before as well as this model now are all trained with a "crps" objective which I haven't seen at all outside of ml weather prediction.
You are a bit misleading here. The model is trained on historical data but each run off of new instrument readings will be generated a few times in an ensemble.