Comment by gettingoverit

6 days ago

One piece of evidence I have is that most powerful reasoning models can answer at most 10% of my questions on PhD-student level computer science, and are unable to produce correct implementations for basic algorithms, being provided direct references to their implementations and materials that describe them. Damn, o3 can't draw an SVG arrow. Recent advancement in counting "r" in "strawberry" is basically as far as it goes.

I don't know what exactly is at play here, and how exactly OpenAI's models can produce those "exceptionally good" results in benchmarks and at the same time be utterly unable to do even a quarter of that in private evaluation of pretty much everyone I knew. I'd expect them to use some kind of RAG techniques that make the question "what was in the training set at model checkpoint" irrelevant.

If you consider that several billion dollars of investment and national security are at stake, "weird conspiracy" becomes a regular Tuesday.