Comment by lysecret

1 day ago

Im pretty deep into this topic and what might be interesting to an outsider is that the leading models like neuralgcm/weathernext 1 before as well as this model now are all trained with a "crps" objective which I haven't seen at all outside of ml weather prediction.

Essentially you add random noise to the inputs and train by minimizing the regular loss (like l1) and at the same time maximizing the difference between 2 members with different random noise initialisations. I wonder if this will be applied to more traditional genai at some point.

12 comments

lysecret

albertzeyer 9 hours ago

The random noise is added to the model parameters, not the inputs, or not?

This reminds me of variational noise (https://www.cs.toronto.edu/~graves/nips_2011.pdf).

If it is random noise on the input, it would be like many of the SSL methods, e.g. DINO (https://arxiv.org/abs/2104.14294), right?

lysecret 4 hours ago

Yes you are right it's applied to the parameters, but other models (like ngcm) applied it to the inputs. IMO it shouldn't make a huge difference main point is you max differences between models.

nerdponx 20 hours ago

> Essentially you add random noise to the inputs and train by minimizing the regular loss (like l1) and at the same time maximizing the difference between 2 members with different random noise initialisations. I wonder if this will be applied to more traditional genai at some point.

We recently had a situation where we specifically wanted to generate 2 "different" outputs from an optimization task and struggled to come up with a good heuristic for doing so. Not at all a GenAI task, but this technique probably would have helped us.

albertzeyer 8 hours ago

This idea is often used for self-supervised learning (SSL). E.g. see DINO (https://arxiv.org/abs/2104.14294).

cleak 21 hours ago

That’s pretty neat. It reminds me of how VAEs work: https://en.wikipedia.org/wiki/Variational_autoencoder

rytill 1 day ago

What is the goal of doing that vs using L2 loss?

counters 18 hours ago
To add to the existing answers - L2 losses induce a "blurring" effect when you autoregressively roll out these models. That means you not only lose import spatial features, you also truncate the extrema of the predictions - in other terms, you can't forecast high-impact extreme weather with these models at moderate lead times.
- lysecret 4 hours ago
  
  Yes very good point this to me is one of the most magical elements of this loss how it suddenly makes the model "collapse" on one output and the predictions become sharp.
  
  1 reply →
lysecret 1 day ago

To encourage diversity between the different members in an ensemble. I think people are doing very similar things for MOE networks but im not that deep into that topic.
sunshinesnacks 20 hours ago

The goal of using CRPS is to produce an ensemble that is a good probabilistic forecast without needing calibration/post processing.
[edit: "without", not "with"]

jasonmarks_ 16 hours ago

> Im pretty deep into this topic and what might be interesting to an outsider is that the leading models like neuralgcm/weathernext 1 before as well as this model now are all trained with a "crps" objective which I haven't seen at all outside of ml weather prediction.

You are a bit misleading here. The model is trained on historical data but each run off of new instrument readings will be generated a few times in an ensemble.