Comment by pama
1 year ago
Thanks. I don’t think we disagree on major points. Maybe there is a communication barrier and it may be on me. I came from a computational math/science/statistics background to ML. These next token prediction algorithms are of course learned mappings. Not sure one needs anything else when the mappings involve reasonably powerful abilities. If you are perhaps from a pure CS background and you think about search, then, yes one could simply explore a sequence of A’:B’ -> A’’:B’’ -> … before finding A:B and use the conditional probability formula of the sequence as the guiding point for a best first search or MCTS expansion (if the training data had a similar structure). Are there other ways to learn that type of search? Probably. But what I meant above by algorithm is what you correctly understood as the mapping itself: the transformer computes intermediate useful quantities distributed throughout its weights and sometimes centered at different depths so that it can eventually produce the step mapping of A’:B’ -> A:B. We don’t yet have a clean disassembler to probe this trained “algorithm” so there are some rare efforts where we can map this mapping back to conventional pseudo-code but not in the general case (and I wouldn’t even know how easy it would be for us to work with a somehwat shorter but still huge functional form that translates English language to a different language, or to computer code.) Part of why o1-like efforts didnt start before we had reasonably powerful architectures and the required compute, is that these types of “algorithm” developments require large enough models (though we had those since a couple years now) and relevant training data (which are easier to procure/build/clean up with the aid of the early tools).
No comments yet
Contribute on Hacker News ↗