← Back to context

Comment by marcellus23

5 months ago

I think it's hard to take any LLM criticism seriously if they don't even specify which model they used. Saying "an LLM model" is totally useless for deriving any kind of conclusion.

When talking about the capabilities of a class of tools long term, it makes sense to be general. I think deriving conclusions at all is pretty difficult given how fast everything is moving, but there is some realities we do actually know about how LLMs work and we can talk about that.

Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.

  • > Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.

    Isnt that exactly what is going to help us understand the value these tools bring to end-users, and how to optimize these tools for better future use? None of these models are copy+pastes, they tend to be doing things slightly differently under the hood. How those differences affect results seems like the exact data we would want here

    • I guess I disagree that the main concern is the differences per each model, rather than the overall technology of LLMs in general. Given how fast it's all changing, I would rather focus on the broader conversation personally. I don't really care if GPT5 is better at benchmarks, I care that LLMs are actually capable of the type of reasoning and productive output that the world currently thinks they are.

      1 reply →

Yes, I’d be curious about his experience with GPT-5 Thinking model. So far I haven’t seen any blunders from it.

  • I've seen plenty of blunders, but in general it's better than their previous models.

    Well, it depends a bit on what you mean by blunders. But eg I've seen it confidently assert mathematically wrong statements with nonsense proofs, instead of admitting that it doesn't know.