Comment by anotherpaulg
1 year ago
OpenAI just released GPT-4 Turbo with Vision and it performs worse on aider’s coding benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models.
Thanks again for running all these benchmarks with model releases. They are really helpful to track progress!
Really appreciate the thoroughness you apply to evaluating models for use with Aider. Did you adjust the prompt at all for the newer models?
I've definitely run into this personally. But even even I explicitly tell it to not skip implementation and to generate fully functional code, it says that it understands but continues right into omitting things again.
It was honestly shocking because we're so used to it understanding our commands that a blatant disregard like that made me seriously wonder what kind of laziness layer they added
I suspect they might be worried it could reproduce copyrighted code in certain circumstances, so their solution was to condition the model to never produce large continuous chunks of code. It was a very noticeable change across the board.
I thought it would be for performance, since it doesn't output all of the code then each reply is shorter/quicker. Although you can still ask it to generate more of the code but that introduces some latency so there's less overall load.
People hypothesized that OpenAI added laziness in order to save money on token generation, since they are burning through GPU time.
This has been my conclusion too. Given it’s a product I’m paying monthly for it seems super regressive to have to trick it into doing what it used to do just fine.
1 reply →
I'd probably pay triple to go back to the pre-"Dev Day" product at this point
1 reply →
They should offer different models at this point.
This laziness occurs over and over, so why bother with omniscience.
The laziness layer seems to be to be an assistant but not a replacement or doing the task.