← Back to context

Comment by uh_uh

5 days ago

But don't you think this might be a case where there is both self-congragulation and actual progress?

The level of proof for the latter is much higher, and IMO, OpenAI hasn't met the bar yet.

Something really funky is going on with newer AI models and benchmarks, versus how they perform subjectively when I use them for my use-cases. I say this across the board[1], not just regarding IpenAI. I don't know if frontier labs have run into Goodheart's law viz benchmarks, or if my use-cases that are atypical.

1. I first noticed this with Claud 3.5 vs Claud 3.7

That's a fair question, and I agree. I just find it odd how we shout across the aisle, whether in favor or against. It's a case of thinking the tech is neat, while cringing at all the money-people and their ideations.