Comment by jonotime

1 year ago

Interesting. So in my use, I rarely see gpt get it right on the first pass but thats mostly due to interpretation of the question. I'm ruling out the times when it hallucinates calls to functions that dont exist.

Lets say I ask for some function that calculates some matrix math in python. It will spit out something but I dont like what it did. So I will say, now dont us any calls to that library you pulled in, and also allow for these types of inputs. Add exception handling...

So response time is important since its a conversation, no matter how correct the response is.

When you say deep seek "nailed it on the first attempt" do you mean it was without bugs? Or do you mean it worked how you imagined? Or what exactly?

1 comment

jonotime

HarHarVeryFunny 1 year ago

DeepSeek-R generated a working web page on first attempt, based on a single brief prompt I gave it.

With Sonnet 3.5, given the same brief prompt I gave DeepSeek-R, it took a half dozen feedback steps to get to 90%. Trying a hand drawn sketch input to Sonnet instead was quicker - impressive first attempt, but iterative attempts to fix it failed before I hit the usage limit. Gemini was the slowest to work with, and took a lot of feedback to get to the "almost there" stage, after which it floundered.

The AI companies seem to want to move in the direction of autonomous agents (with reasoning) that you hand a task off to that they'll work on while you do something else. I guess that'd be useful if they are close to human level and can make meaningful progress without feedback, and I suppose today's slow-responding reasoning models can be seen as a step in that direction.

I think I'd personally prefer something fast enough responding to use as a capable "pair programmer", rather than an autonomous agent trying to be an independent team member (at least until the AI gets MUCH better), but in either case being able to do what's being asked is what matters. If the fast/interactive AI only gets me 90% complete (then wastes my time floundering until I figure out it's just not capable of the task), then the slower but more capable model seems preferable as long as it's significantly better.