Comment by moffkalast
1 year ago
If you read the paper again, they deal with pre-training data and fine tuning data specifically. Their test is on information being pulled out zero-shot, which would mean the steps when attention finds associations between tokens are one directional. This is just testing recall as well, as such my example is as apples to apples you can get when comparing systems with such large complexity disparities.
In-context reasoning tends to work a lot more reliably for these examples, if you put any of the test statements into it directly before asking the question, practically any llm can answer correctly. That's why very small models are still useful for RAG use cases.
No comments yet
Contribute on Hacker News ↗