Comment by bobsondugnut
2 days ago
not a fan of these kinds of arguments. the 'correct' token is entirely dependent on the dataset. a LLM could have perfect training loss given a dataset, but this has no predictive power on its ability to 'answer' arbitrary prompts.
In natural language, many strings are equally valid. there are many ways to chain tokens together to get the 'correct' answer to an in sample prompt. A model with perfect loss will then for ambiguous sequences of tokens, produce a likelihood over the next tokens that corresponds to number valid token paths in the given corpus given the next token.
Compounding errors can certainly happen, but for many things upstream of the key tokens its irrelevant. There are so many ways to phrase things that are equally correct- I mean this is how language evolved (and continues to). Getting back to my first point, if you assume you have a LLM with perfect loss on the training dataset, you still can get garbage back at test time- thus i'm not sure thinking about 'compounding errors' is useful.
Errors in LLM reasoning I suspect are more closely related to noisy training data or an overabundance of low quality training data. I've observed this in how all the reasoning LLMs work, given things that are less common in the corpus of (the internet and digital assets) and require higher order reasoning, they tend to fail. Whereas these advanced math or programming problems tend to go a bit better, input data is likely much cleaner.
But for something like: how do I change the fixture on this light, I'll get back some kind of garbage from the SEO-verse. IMO next step for LLMs is figuring out how to curate an extremely high quality dataset at scale.
No comments yet
Contribute on Hacker News ↗