Comment by overfeed

7 months ago

The level of proof for the latter is much higher, and IMO, OpenAI hasn't met the bar yet.

Something really funky is going on with newer AI models and benchmarks, versus how they perform subjectively when I use them for my use-cases. I say this across the board[1], not just regarding IpenAI. I don't know if frontier labs have run into Goodheart's law viz benchmarks, or if my use-cases that are atypical.