GPT-4 Turbo with Vision is a step backwards for coding

1 year ago (aider.chat)

Interestingly, GPT-4 Turbo with Vision is at the top of the LiveCodeBench Leaderboard: https://livecodebench.github.io/leaderboard.html

(GPT-4 Turbo with Vision has a knowledge cutoff of Dec 2023, so filter to Jan 2024+ to minimize the chance of contamination.)

In general, my take is that each model has its own personality, which can cause it to do better or worse on different sorts of tasks. From evaluating many LLMs, I've found that it's almost never the case that one model is better than an another at everything. When an eval only has a certain type of problem (e.g., only edits to long codebases, or only short self-contained competition problems), it's not clear how homogeneously its performance rankings will generalize to other coding tasks. Unfortunately, if you're a developer using an LLM API, the best thing to do is to test all of the models from all the providers to see which works best for your use case.

(I work at OpenAI, so feel free to discount my opinions as much as you like.)

  • As a user, I basically just care about a minimum baseline of competence... which most models do well enough on. But then I want the model to "just give me the code". I switched to Claude, and canceled my chatgpt subscription because the amount of placeholders and just general "laziness" in chatgpt was terrible.

    Using Claude was a breath of fresh air. I asked for some code, I got the entire code.

    • I’ve been using Claude 3 Opus for a while now and was fairly happy with the results. Wouldn’t say they were better than GPT-4, but considerably less verbose which I really appreciated. Recently though I ran into two questions I had that Claude actually answered incorrectly and incompletely until I prompted it. One was a Java GC questions where is forgot Epsilon and then hallucinated that is wasn’t experimental anymore. The other was a coding question where I know there wouldn’t be a good answer, but Claude kept repeating a previous answer even though I had twice told it that it wasn’t what I was looking for.

      So I’ve switched back to GPT-4 again for a the time being to see if I’m happier with the results. I never felt that Claude 3 Opus measurably better than GPT-4 to begin with.

      3 replies →

    • Claude is a bit more expensive though, no? I felt like I burned through 5$ worth of credit in one evening, but perhaps it was also because I was using the big-AGI UI and it was producing diagrams for me, often in quintuplicates for some reason. Still, I really like Claude and much more prefer it over others.

      1 reply →

    • What were the placeholders and laziness? I just ended my prompts with something akin to "give me the full code and nothing else" and ChatGPT does exactly that. How does Claude do any better?

      3 replies →

  • FWIW, I agree with you that each model has its own personality and that models may do better or worse on different kinds of coding tasks. Aider leans into both of these concepts.

    The GPT-4 Turbo models have a lazy coding personality, and I spent a significant effort figuring out how to both measure and reduce that laziness. This resulted in aider supporting a "unified diffs" code editing format to reduce such laziness by 3X [0] and the aider refactoring benchmark as a way to quantify these benefits [1].

    The benchmark results I just shared about GPT-4 Turbo with Vision cover both smaller, toy coding problems [2] as well as larger edits to larger source files [3]. The new model slightly underperforms on smaller coding tasks, and significantly underperforms on the larger edits where laziness is often a culprit.

    [0] https://aider.chat/2023/12/21/unified-diffs.html

    [1] https://github.com/paul-gauthier/refactor-benchmark

    [2] https://aider.chat/2024/04/09/gpt-4-turbo.html#code-editing-...

    [3] https://aider.chat/2024/04/09/gpt-4-turbo.html#lazy-coding

  • Hi Ted, since I have been using GPT 4 pretty much every day, I have a few questions about the performance, We had been using 1106 preview for several months to generate SQL queries for a project, but one fine day in February, it stopped responding and it used to respond like so "As a language model, I do not have the ability to generate queries etc...". This lasted for a few hours. Anyway, switching to 0125-preview which helped us immediately resolve the problem. We have been using that for whenever we have code generation related tasks unless we are doing FAQ stuff (where GPT 3.5 Turbo was good enough).

    However, off late, I am noticing some really inconsistent behaviours in 0125-preview where it responds inconsistently for certain problems, ie one time it works with a detailed prompt and other time it doesn't. I know these models are predicting the next most likely token which is not always deterministic.

    So I was hoping for the ability to fine tune GPT 4 Turbo some time soon. Is that on the roadmap for Open AI?

    • I don’t work for OpenAI but I do remember them saying that a select few customers would be invited to test out fine tuning GPT-4, and that was several months ago now. They said they would prioritise those who had previously fine tuned GPT-3.5 Turbo.

  • The ongoing model anchoring/grounding issue likely affects all GPT-4 checkpoints/variants, but is most prominent with the latest "gpt-4-turbo-2024-04-09" variant due to its most recent cutoff date, might imply deeper issues with the current model architecture, or at least how it's been updated:

    See the issue: https://github.com/openai/openai-python/issues/1310

    See also the original thread on OpenAI's developer forums (https://community.openai.com/t/gpt-4-turbo-2024-04-09-will-t...) for multiple confirmations on this issue.

    Basically, without a separate declaration of the model variant in use in system message, even the latest gpt-4-turbo-2024-09 variant over the API might hallucinate being GPT-3 and its cutoff date being in 2021.

    A test code snippet is included in the GitHub issue to A/B test the problem yourself with a reference question.

  • I think there's a bigger underlying problem with the current GPT-4 model(s) atm:

    Go to the API Playground and ask the model what is its current cutoff date. For example, in its chat, if you're not instructing it with anything else, it will tell you that its cutoff date is in 2021. Even if you explicitly tell the model via system prompt: "you are gpt-4-turbo-2024-04-09", in some cases it still thinks its in April 2023.

    The fact that the model (variants of GPT-4 including gpt-4-turbo-2024-04-09) hallucinates its cutoff date being in 2021 unless specifically instructed with its model type is a major factor in this equation.

    Here are the steps to reproduce the problem:

    Try an A/B comparison at: https://platform.openai.com/playground/chat?model=gpt-4-turb...

    A) Make sure "gpt-4-turbo-2024-04-09" is indeed selected. Don't tell it anything specific via the system prompt and in a worst case scenario, it'll think it's in 2021 as to its cutoff date. It also can't answer to questions about more current events.

    * Reload the web page between prompts! *

    B) Tell it via the system prompt: "You are gpt-4-turbo-2024-04-09" => you'll get answers to recent events. Ask anything about what's been going on in the world i.e. after April 2023 to verify against A.

    I've tried this multiple times now, and have always gotten the same results. IMHO this implies a deeper issue in the model where the priming goes way off if the model number isn't mentioned in its system message. This might explain the bad initial benchmarks as well.

    The problem seems pretty bad at the moment. Basically, if you omit the priming message ("You are gpt-4-turbo-2024-04-09"), it will in worst cases revert to hallucinating 2021 cutoff dates and doesn't get grounded into what should be its most current cutoff date.

    If you do work at OpenAI, I suggest you look into it. :-)

  • >I work at OpenAI

    I know there's a lot you can't talk about. I'm not going to ask for a leak or anything like that. I'd just like to know, what do you think programming will look like by 2025? What do you think will happen to junior software developers in the near future? Just your personal opinion.

  • Hey Ted, I had a question about working at OpenAI, if you don't mind talking with me. If so, email address is in my profile. Thank you!

  • Pretty sweet site, thx for sharing. Hope y'all will start bringing token count up at some point. Will be testing this newer version too.

  • Appreciate OpenAI popped in say new release is probably better at something else, but it would have been nice to acknowledge that this suggestion...

    > “Unfortunately, if you're a developer using an LLM API, the best thing to do is to test all of the models from all the providers to see which works best for your use case.”

    ...is exactly what is done by the author of these benchmark suites:

    "It performs worse on aider’s coding benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models."

    • Agreed! Kudos to Paul for creating the evals, running them quickly, and sharing results. My comment (not on behalf on OpenAI, but just me as an individual) was meant as a "yes and" not a "no but".

I feel like degrading the discipline of programming/development to "coding" is a bigger step backwards. Coding is used in programming, but if you're just churning out code then you're not developing well-architected, maintainable and safe software

It's like saying that accountant is just adding. I think come then of tax year you'd be avoiding an accountant who says they've got experience in adding.

  • I lean in the opposite direction. When someone random like a new neighbor initially asks me what I do, I say "I'm a coder at an insurance company". I teach fifth graders Python in an "Advanced Coding Club". When people ask me what I do for hobbies, one of them is "learning new languages to code with". I will only go into more detail if they are also technical and want more details about what it is I code.

    I don't think of it as degrading the thing that I do. I think of it as boiling it down to the simplest description, and I find it more refreshing than "software developer" or "computer programmer" or f"{word} engineer".

    • I use what impact and benefit my work has to answer what I do.

      In your insurance case, I would say something like "I build tools to shield businesses from unexpected disasters like earthquakes or floods" or "I help people worry less about expenses during an emergency"

      If someone asks me more, then I might add on that I work on software to automate claim process or similar.

  • I think the field has long had an issue describing the differences between design and implementation, which has only grown worse as more levels of designing and implementing have appeared. It is a bit like explaining the difference, back in the day between the person who works out a formula and the person assigned to computing it. Neither is trivial work, and the outsider who doesn't like math will view both of them as doing math, but there is still a gap in the mathematical skill and insight involved.

    You mention taxes, which makes me think of how many tax preparers are basically helping their customer input data into software and not providing any tax specific advice. That might still be a value add for someone who struggles computer UIs, but that isn't the same as the person helping move money between accounts to reduce tax liability.

    I've seen similar when it comes to doing science in a lab.

    How can any discipline protect the inner distinction against a much larger world which has a very simplified understanding and will mix the inner groups together?

  • I've never come across someone that was more enlightened by what I do when using the word "engineer" vs using the word "coder". If anything I would assume coder elicits a more accurate mental image than something a bit more overloaded like engineer.

OpenAI just released GPT-4 Turbo with Vision and it performs worse on aider’s coding benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models.

  • Thanks again for running all these benchmarks with model releases. They are really helpful to track progress!

  • Really appreciate the thoroughness you apply to evaluating models for use with Aider. Did you adjust the prompt at all for the newer models?

  • I've definitely run into this personally. But even even I explicitly tell it to not skip implementation and to generate fully functional code, it says that it understands but continues right into omitting things again.

    It was honestly shocking because we're so used to it understanding our commands that a blatant disregard like that made me seriously wonder what kind of laziness layer they added

    • I suspect they might be worried it could reproduce copyrighted code in certain circumstances, so their solution was to condition the model to never produce large continuous chunks of code. It was a very noticeable change across the board.

      1 reply →

    • They should offer different models at this point.

      This laziness occurs over and over, so why bother with omniscience.

    • The laziness layer seems to be to be an assistant but not a replacement or doing the task.

A big limitation with GPT4 Turbo (and Claude 3) for coding is the output token size. The only way to overcome the 4k limitation is by generating a file (if it fits), and feeding it back to generate the second and so on.

For this reason, GPT4-32k is my preferred model for codegen. I wish there were cheaper options.

  • Can you use 32k with Chat?

    • Chat is a fairly terrible interface for real work, since you can’t modify anything, meaning the context can get easily poisoned. I much prefer the API playgrounds, and third party interfaces, that slow editing both my input and the responses.

      4 replies →

I posted this before, I'll post this again: GPT getting lazier is not an objectively bad thing. I don't copy code that it generates but ask it about more high level concepts more often, and have to instruct it not to generate imports and other boilerplate code. In most cases, this “lazy” generation saves time and tokens and is exactly what I need.

  • Yes that's true, but if you ask to give the full code specifically it should do do

  • Interesting, I only trust them for boilerplates that I hate writing 100 times...

    • It can give you ideas and lead you to new paths when problem solving, you just have to be aware that its knowledge is planet wide but inch deep. I lost the count of how many times it conceptually gets "above the target" rather nicely and then its implementation is like a blind person throwing darts. I also lost the count of how many times it describes the code it wrote and it's like it has two brains, one which writes the descriprion and the other, ahem, a little bit "slower", that writes the code.

      Classic debate looks like this:

      - Hey how do you implement X in lang Y using Z? - Certainly! Blablablah this code adds 1 and the return is 3! - Your code returns 5 and it seems to add 2, fix it. - I apologize for the oversight, here's the fixed version (replies with the same, maybe slightly altered, but still broken, code)

      Well I guess ultimately one can't expect miracles from a statistics based token generation machine.

      Sometimes I do wonder if the entire gen AI craze of the last few years is just one massive bubble and we're actually nowhere near AGI.

      All the evidence I see when interacting with these models points towards them "knowing" things, but not "understanding" things, a context aware planet-scale Wikipedia. (Don't get me wrong, I still think LLMs are life changing for language specific tasks like translation etc., but they're just not in any way new forms of intelligent beings, which is what a lot of mainstream population and even some investors seem to think).

      1 reply →

Good thing Claude's a massive step forward.

  • I had my Anthropic account banned (presumably) because I was testing out the vision capabilities and took a photo of a Japanese kitchen knife and asked it to "translate the characters on the knife into English". This wasn't a Claude Pro account, but an API account, so it's extra weird because what if I had some product based off the API, and an end user asked/searched for something taboo..does my entire business get taken offline? Good thing this was just a test account with like $10 in credit on it. They haven't responded to my "account suspension appeal" which is just a google form to enter your email address, not even a box to enter any details.

    Anyways, Claude 3 Opus is pretty great for coding (I think better in most cases than the GPT4-Turbo previews) but I'm a bit weary of Anthropic now.

    • I just tried to make an account

      1. Asks me to enter my phone number and sends me a code

      2. Enter code

      3. Asks me to enter email and get code

      4. Enter code

      5. Redirects to asking me to enter phone number, but my number is already used now

      6. My account is automatically banned

      4 replies →

    • > They haven't responded to my "account suspension appeal" which is just a google form to enter your email address, not even a box to enter any details.

      The complete lack of customer service is going to get more and more dystopian as these AI companies become more interwoven with everyday life.

      3 replies →

    • Were you still on the very first test account, e.g. before even adding any money?

      I know indirectly Anthropic was the #1 target for a lot of ERPdenizens for a while now, so they're probably extremely trigger happy until you clear a hurdle or two.

    • I guess you can always use AI to detect inappropriate content from users... oh wait.

      Seriously though, I understand that these mostly play to the enterprise market where even a hint of anything remotely "unsafe" needs to be shut down and deleted but why can't they allow us to turn off the strict filtering like Google does? Why can Google offer "unsafe" content (in a limited fashion but it's FINE) but LLM providers can't?

      Lack of competition?

      7 replies →

  • Well, our team has been using Claude Opus for the past month and we are now switching back to GPT-4. While the code is better, it is hard to make it do further modifications to the given code. Scores Low on the reasoning end in our experience.

  • And yet the UI for their consumer offering is hot garbage. I really don’t feel like it’s better than ChatGPT in capabilities and the UI is not as good. Not to mention there is no app to use on mobile.

  • It's worthless until they open up the api for private use.

    • I’ve been using the Claude 3 API since the models were announced. I believe it’s generally available (though capacity constrained & rate limited at present).

      3 replies →

Another thing I have noticed is that if you use ChatGPT and it at some points uses Bing to look up something, it becomes super lazy afterwards, going from page long responses on average to a single paragraph.

  • So the more advanced the AI, the more human-like it becomes. Senior Programmer level AI will spend all computing resources browsing memes.

  • It probably has to do with the extended context window. Keeping websites in there is kind of a hassle. But I actually consider that a feature, not a bug. If I have ChatGPT use the internet, I don't want a full page answer - especially not on the relatively slow GPT4. It's also a hassle if you're unsure about the validity of the output. In that case I might as well browse myself. Just give me a short preview so I can either start searching on my own or ask more questions.

  • You can/should make a custom GPT that isn't allowed to use Bing. Works much better that way

  • If answer is too lazy, you can tell it to elaborate. However, repairing a lazy context is sometimes slow and unreliable.

    To avoid that, use backtracking and up the pressure for detailed answers. Then consider taking the least lazy of 2 or 3 samples.

    A good prompt for detailed answers is Critique Of Though, an enhanced chain of thought technique. You ask for a search and a detailed response with simple sections including analysis, critique and key assumptions.

    It will expend more tokens, get more ideas out, and achieve higher accuracy. It will also be less lazy and more liable to recover from laziness or mistakes.

    TLDR; if GPT4 is being lazy, backtrack and request a detailed multi section critical analysis.

GPT-3.5 performance for basic programming tasks used to be just fine, but over the past few weeks the output quality has dropped dramatically. All of this tweaking definitely has it's downsides.

  • If you prefer using GPT-3.5 due to its lower price or speed, wouldn't it be better to switch to Haiku? People were even able to match the performance of Opus when they added a couple of examples to the prompt.

    • "Unfortunately, Claude.ai is only available in certain regions right now."

      GPT 3.5 used to be good enough, so I never bothered getting a paid account. I also heard some reports about 3.5 actually being better for the type of coding tasks I usually offload.

      1 reply →

  • Why someone still would use GPT-3.5 in 2024? There are tens of fully open models available which beat GPT-3.5 in every possible skill and you can run them locally.

    • I tried all I can run on an RTX 3080 Ti, but none got close for the kind of basic tasks I like to outsource to an LLM. Which would you recommend for mostly node/react/python/php work?

      I do have a 4090 available at work, if the extra 8GB vRAM makes a big difference. The task I used as a test case was converting existing PHP & JS code (views and controllers) with static texts to files with dynamic translation references.

If it was possible to hook into token selection process (kind of like JSON restricted grammar but using custom scripts), then it would be possible to detect that GPT-4 is about to add "# impement code here" and then we could force it select a different set of tokens which would make GPT4 generate a proper method body.

  • That's called guidance and the problem is that it has to be done carefully or else you'll just get rephrasings that work around the block.

    I think a better approach is multi-pass coding along with fine-tuning or prompting to use a particular form of TODO comment. Aider can already do a form of fake "fill in the middle" by making it emit diffs. If it notices that some code has been filled out lazily, it could go back and ask it to do the next chunk of work. Given that large tasks are normally split up into small tasks by programmers anyway, this seems like a natural approach that is required for scaling up regardless.

Its not just for coding, the base “gpt-4” model seems better than the latest preview model

https://platform.openai.com/docs/models/continuous-model-upg...

> In particular, it seems much more prone to “lazy coding” than the existing GPT-4 Turbo “preview” models.

The previous model (without vision) was already „lazy“. It will omit large portions of code and wants you to merge your changes into previous answers. Then try hard to force him to give the full code, no omissions.

That‘s why I reach for Claude 3 more and more. Its clntext window is larger, and it gives me full detailed answers, no omissions. But it is hallucinating more, my impression. Mentioning packages / functions that are not available. But all in all a superb choice in addition to ChatGPT4.

I would be curious to see if the results improve by using DSPy to improve your prompts (and also reevaluate which prompts work better on the newest model).

How hard could it be to let ChatGPT Plus users choose model versions? (especially when older versions are accessible through the API)

We're missing the elephant in the room. Who's going to maintain the code?

You think GPT5 and Llama4 aren't going to be opinionated and change your code going forward.

I am a bit lost looking at the models

Can the following be assumed:

- The gpt-4-preview models are history now

- gpt-4-turbo-2024-* are the now released models

- There will be no more 'preview' models released in the '4' branch

?

The only thing I learned in the last year that you can't really benchmark llms at all. Above a certain level it's just edge case against edge case or script kiddies and multi billion corps optimizing their fine tune against the test.