Comment by fpgaminer

10 months ago

It does seem like individual prompting styles greatly effects the performance of these models. Which makes sense of course, but the disparity is a lot larger than I would have expected. As an example, I'd say I see far more people in the HN comments preferring Claude over everything else. This is in stark contrast to my experience, where ChatGPT has and continues to be my go to for everything. And that's on a range of problems: general questions, coding tasks, visual understanding, and creative writing. I use these AIs all day, every day as part of my research, so my experience is quite extensive. Yet in all cases Claude has performed significantly worse for me. Perhaps it just comes down to the way that I prompt versus the average HN user? Very odd.

But yeah, o1 has been a _huge_ leap in my experience. One huge thing, which OpenAI's announcement mentions as well, is that o1 is more _consistently_ strong. 4o is a great model, but sometimes you have to spin the wheel a few times. I much more rarely need to spin o1's wheel, which mostly makes up for its thinking time. (Which is much less these days compared to o1-preview). It also has much stronger knowledge. So far it has solved a number of troubleshooting tasks that there were _no_ fixes for online. One of them was an obscure bug in libjpeg.

It's also better at just general questions, like wanting to know the best/most reputable store for something. 4o is too "everything is good! everything is happy!" to give helpful advice here. It'll say Temu is a "great store for affordable options." That kind of stuff. Whereas o1 will be more honest and thus helpful. o1 is also significantly better at following instructions overall, and inferring meaning behind instructions. 4o will be very literal about examples that you give it whereas o1 can more often extrapolate.

One surprising thing that o1 does that 4o has never done, is that it _pushes back_. It tells me when I'm wrong (and is often right!). Again, part of that being less happy and compliant. I have had scenarios where it's wrong and it's harder to convince it otherwise, so it's a double edged sword, but overall it has been an improvement in the bot's usefulness.

I also find it interesting that o1 is less censored. It refuses far less than 4o, even without coaxing, despite its supposed ability to "reason" about its guidelines :P What's funny is that the "inner thoughts" that it shows says that it's refusing, but its response doesn't.

Is it worth $200? I don't think it is, in general. It's not really an "engineer" replacement yet, in that if you don't have the knowledge to ask o1 the right questions it won't really be helpful. So you have to be an engineer for it to work at the level of one. Maybe $50/mo?

I haven't found o1-pro to be useful for anything; it's never really given better responses than o1 for me.

(As an aside, Gemini 2.0 Flash Experimental is _very_ good. It's been trading blows with even o1 for some tasks. It's a bit chaotic, since its training isn't done, but I rank it at about #2 between all SOTA models. A 2.0 Pro model would likely be tied with o1 if Google's trajectory here continues.)

0 comments

fpgaminer

No comments yet

Contribute on Hacker News ↗