Comment by overfeed
6 days ago
The level of proof for the latter is much higher, and IMO, OpenAI hasn't met the bar yet.
Something really funky is going on with newer AI models and benchmarks, versus how they perform subjectively when I use them for my use-cases. I say this across the board[1], not just regarding IpenAI. I don't know if frontier labs have run into Goodheart's law viz benchmarks, or if my use-cases that are atypical.
1. I first noticed this with Claud 3.5 vs Claud 3.7
No comments yet
Contribute on Hacker News ↗