Comment by nsagent
1 year ago
Sorry, but that does not seem to be the case. A friend of mine who runs a long context benchmark on understanding novels [1] just ran an eval and o1 seemed to improve by 2.9% over GPT-4o (the result isn't on the website yet). It's great that there is an improvement, but it isn't drastic by any stretch. Additionally, since we cannot see the raw reasoning it's basing the answers off of, it's hard to attribute this increase to their complicated approach as opposed to just cleaner higher quality data.
EDIT: Note this was run over a dataset of short stories rather than the novels since the API errors out with very long contexts like novels.
It's a good rebranding. It was getting ridiculous 3.5, 4, 4.5,