Comment by boxed
21 hours ago
> SOTA LLMs have been shown again and again to solve problems unseen in their data set
We have no idea what the training data is though, so you can't say that.
> and despite their shortcomings they have become extremely useful for a wide variety of tasks.
That seems like a separate question.
I have applied O3 pro on unpublished abandoned research of mine that was never published and lives in an intersection that is as entirely novel as it's uninteresting.
O3 pro (but not O3) was successfully able to apply reasoning and math to this domain in interesting ways, much like an expert researcher in these areas would.
Again, the field and the problem is with 100% certainty OOD of the data.
However, the techniques and reasoning methods are of course learned from data. But that's the point, right?
The paper is evaluating how well an LLM can handle novelty, and on the paper's terms you need to calculate or otherwise somehow deduce the degree or type of novelty rather than simply describing your never published research as novel.
I don't even know that this is possible without seeing the training data. Hence the difficulty in describing how good at "reasoning" O3 Pro is.
The most novel problem would presumably be something only a martian could understand, written in an alien language, the least novel problem would be a basic question taught in preschool like what color is the sky.
Your research falls somewhere between those extremes.
LLMs don't learn reasoning. At all. They are statistical language models. Nothing else. If they get math right it's because correct math is more statistically probable given the training data, it can't actually do math. This should be pretty clear from all the "how many Rs are there in strawberry" type examples.