← Back to context

Comment by jjani

4 days ago

It feels similar to Llama4 - rushed. Sonnet had been king for at least 6 months, then Gemini 2.5 Pro recently raised the bar. They felt they had to respond. Ghibli memes are great, but not at the cost of losing the whole enterprise market. Currently for B2C, there's almost no lock in. Users can switch to a better app/model at very little cost. With B2B it's different, a product built on Sonnet generally isn't just going to switch to an OA model overnight unless there's huge benefits. OA will want a piece of that lock-in pie, which they'd been losing at a very rapid pace. Whether their new models solve that remains to be seen. To me, actually building products on top of these models, I still don't see much reason to use any of their models. From all testing I've been doing over the last 2 days, they don't seem particularly competitive. Potentially 4.1 or o4-mini for certain tasks, but whether they beat e.g. Deepseek v3 currently isn't clear-cut.

Yeah. God knows. I was really surprised to see the Fchollet's benchmark being aced months ago, but whatever their internal QA was perhaps lacking. I was asking some fairly simple code, that too in Python, using Scikit learn for which I presume there must be a lot of training data, it for some reason, changed the casing of the columns, and didn't follow my instructions as I asked it, cause the function was being rewritten to reduce bloat, along with other random things I didn't ask it for.

  • Everyone games the benchmarks, but a lot is pointing towards both Meta and OpenAI going to even further lengths than the others.

    • I am however wondering if this is o3-preview or o3? I have had wildly fluctuating experiences when I used the preview models previously, esp. the GPT4-Turbo previews though the GPT4-Turbo/V/o were a lot more stable.