Comment by embedding-shape

12 hours ago

> I think Evans is completely wrong. There are only 2 truly frontier models. (at least for now). And Anthropic seems to be leaving OpenAI behind so there might be only 1 in the near future. (which is scary/dangerous)

Truly fascinating ecosystem and community in general, as experiences differ so wildly. Anthropic's models seems far behind OpenAI to me, especially when you get into "Pro" territory, and there doesn't seem to be any worthy competition to Pro Mode available at all.

And this is said with someone who use both platforms, and spend a lot of my day interacting with agents and LLMs in various ways. The interesting part is that probably so do you too, and probably your experience and what you share lines up with what you experience! Yet we come away with basically opposite takeaways :) I don't think either of us are wrong either, somehow.

I agree with what you're saying. I have a Claude plan for work and I prefer using Claude more than any other LLM I've tried. Having recently tried the Codex 100€ plan with GPT-5.5 in high/xhigh, I don't think it's worse that the Opus models, just different.

I've noticed that depending on how you talk to it, you get wildly different outputs. This seems to happen less with Opus: it mostly understand what I want. GPT is often a bit too literal.

Just my two cents.

  • > I've noticed that depending on how you talk to it, you get wildly different outputs. This seems to happen less with Opus: it mostly understand what I want. GPT is often a bit too literal.

    Yeah, exact prompting matters a lot, seemingly more than people think. There is definitely tradeoffs between how literal the models takes the prompts, on one hand it's useful for the model to ignore their own instinct when you know better, so they don't go chasing geese randomly, but on the other hand it's useful sometimes when they self-direct, when you misworded something and it's obvious you meant something different because of the context, and similar things. They're basically good at different things.

    Really agree every model isn't equal and they aren't as interchangeable without adjusting how you prompt them as people seem to think.

People use a model as their daily driver, get very familiar with it and it's behavior, and then go and use another model and have a hard time. It's very difficult to separate "the model is bad" from "the model works differently".

  • > It's very difficult to separate "the model is bad" from "the model works differently"

    At which point it’s fair to reject the commoditization label.

    Also missing from these discussions are e.g. Qwen, which is at least as good as one back from OpenAI or Anthropic’s frontiers.

    • > Also missing from these discussions are e.g. Qwen, which is at least as good as one back from OpenAI or Anthropic’s frontiers.

      They're missing in the discussion because the ones you can run locally, aren't actually "one step away from other closed-source labs" in practice when you use them. They might benchmark as such, but they're sadly far away from measuring up to those scores except for very specific use cases, even when you have say 96GB of VRAM available to run the bigger models even most (at home) consumers won't be able to run.

      3 replies →

For HPC/ai work opus blows gpt away, it’s no competition.

  • As someone who just spent the last three days (tried using both, ended up using mostly Codex) implementing DiffusionGemma in Rust, I think they're more or less equal when it comes to machine learning and AI. They get stuck at different points, but wouldn't say one is a clear winner over the other. HPC I have no idea so I'll take your word for it :)

When you say "Pro" territory, do you include Fable?

  • You mean the model that was available for a whole of three days? No, I had played around with it a tiny bit, but not much than that. I guess time will tell if it gets close.