← Back to context

Comment by anotherpaulg

1 year ago

FWIW, I agree with you that each model has its own personality and that models may do better or worse on different kinds of coding tasks. Aider leans into both of these concepts.

The GPT-4 Turbo models have a lazy coding personality, and I spent a significant effort figuring out how to both measure and reduce that laziness. This resulted in aider supporting a "unified diffs" code editing format to reduce such laziness by 3X [0] and the aider refactoring benchmark as a way to quantify these benefits [1].

The benchmark results I just shared about GPT-4 Turbo with Vision cover both smaller, toy coding problems [2] as well as larger edits to larger source files [3]. The new model slightly underperforms on smaller coding tasks, and significantly underperforms on the larger edits where laziness is often a culprit.

[0] https://aider.chat/2023/12/21/unified-diffs.html

[1] https://github.com/paul-gauthier/refactor-benchmark

[2] https://aider.chat/2024/04/09/gpt-4-turbo.html#code-editing-...

[3] https://aider.chat/2024/04/09/gpt-4-turbo.html#lazy-coding