Comment by vjerancrnjak

18 days ago

I would say IMO results demonstrated that. Silver was tiny 3B model.

All of our theorem provers had no way to approach silver medal performance despite decades of algorithmic leaps.

Learning stage for transformers has a while ago demonstrated some insanely good distributed jumps into good areas of combinatorial structures. Inference is just much faster than inference of algorithms that aren’t heavily informed by data.

It’s just a fully different distributed algorithm where we can’t probably even extract one working piece without breaking the performance of the whole.

World/word model is just not the case there. Gradient descent obviously landed to a distributed representation of an algorithm that does search.