Comment by softwaredoug

4 months ago

The problem will always be the training data. We can have LLMs because we have the web.

Can we get to another level without a corresponding massive training set that demonstrates those abilities?

1 comment

softwaredoug

I would say IMO results demonstrated that. Silver was tiny 3B model.

All of our theorem provers had no way to approach silver medal performance despite decades of algorithmic leaps.

Learning stage for transformers has a while ago demonstrated some insanely good distributed jumps into good areas of combinatorial structures. Inference is just much faster than inference of algorithms that aren’t heavily informed by data.

It’s just a fully different distributed algorithm where we can’t probably even extract one working piece without breaking the performance of the whole.

World/word model is just not the case there. Gradient descent obviously landed to a distributed representation of an algorithm that does search.