Comment by tedsanders

1 year ago

Interestingly, GPT-4 Turbo with Vision is at the top of the LiveCodeBench Leaderboard: https://livecodebench.github.io/leaderboard.html

(GPT-4 Turbo with Vision has a knowledge cutoff of Dec 2023, so filter to Jan 2024+ to minimize the chance of contamination.)

In general, my take is that each model has its own personality, which can cause it to do better or worse on different sorts of tasks. From evaluating many LLMs, I've found that it's almost never the case that one model is better than an another at everything. When an eval only has a certain type of problem (e.g., only edits to long codebases, or only short self-contained competition problems), it's not clear how homogeneously its performance rankings will generalize to other coding tasks. Unfortunately, if you're a developer using an LLM API, the best thing to do is to test all of the models from all the providers to see which works best for your use case.

(I work at OpenAI, so feel free to discount my opinions as much as you like.)

As a user, I basically just care about a minimum baseline of competence... which most models do well enough on. But then I want the model to "just give me the code". I switched to Claude, and canceled my chatgpt subscription because the amount of placeholders and just general "laziness" in chatgpt was terrible.

Using Claude was a breath of fresh air. I asked for some code, I got the entire code.

  • I’ve been using Claude 3 Opus for a while now and was fairly happy with the results. Wouldn’t say they were better than GPT-4, but considerably less verbose which I really appreciated. Recently though I ran into two questions I had that Claude actually answered incorrectly and incompletely until I prompted it. One was a Java GC questions where is forgot Epsilon and then hallucinated that is wasn’t experimental anymore. The other was a coding question where I know there wouldn’t be a good answer, but Claude kept repeating a previous answer even though I had twice told it that it wasn’t what I was looking for.

    So I’ve switched back to GPT-4 again for a the time being to see if I’m happier with the results. I never felt that Claude 3 Opus measurably better than GPT-4 to begin with.

  • Claude is a bit more expensive though, no? I felt like I burned through 5$ worth of credit in one evening, but perhaps it was also because I was using the big-AGI UI and it was producing diagrams for me, often in quintuplicates for some reason. Still, I really like Claude and much more prefer it over others.

  • What were the placeholders and laziness? I just ended my prompts with something akin to "give me the full code and nothing else" and ChatGPT does exactly that. How does Claude do any better?

    • Even if I ask in caps it often comment out large pieces of code. I often give large pieces of code and ask for adjustments. Then I don’t want to have to search & only copy paste the small adjustments of gpt. But it never listens

      2 replies →

FWIW, I agree with you that each model has its own personality and that models may do better or worse on different kinds of coding tasks. Aider leans into both of these concepts.

The GPT-4 Turbo models have a lazy coding personality, and I spent a significant effort figuring out how to both measure and reduce that laziness. This resulted in aider supporting a "unified diffs" code editing format to reduce such laziness by 3X [0] and the aider refactoring benchmark as a way to quantify these benefits [1].

The benchmark results I just shared about GPT-4 Turbo with Vision cover both smaller, toy coding problems [2] as well as larger edits to larger source files [3]. The new model slightly underperforms on smaller coding tasks, and significantly underperforms on the larger edits where laziness is often a culprit.

[0] https://aider.chat/2023/12/21/unified-diffs.html

[1] https://github.com/paul-gauthier/refactor-benchmark

[2] https://aider.chat/2024/04/09/gpt-4-turbo.html#code-editing-...

[3] https://aider.chat/2024/04/09/gpt-4-turbo.html#lazy-coding

Hi Ted, since I have been using GPT 4 pretty much every day, I have a few questions about the performance, We had been using 1106 preview for several months to generate SQL queries for a project, but one fine day in February, it stopped responding and it used to respond like so "As a language model, I do not have the ability to generate queries etc...". This lasted for a few hours. Anyway, switching to 0125-preview which helped us immediately resolve the problem. We have been using that for whenever we have code generation related tasks unless we are doing FAQ stuff (where GPT 3.5 Turbo was good enough).

However, off late, I am noticing some really inconsistent behaviours in 0125-preview where it responds inconsistently for certain problems, ie one time it works with a detailed prompt and other time it doesn't. I know these models are predicting the next most likely token which is not always deterministic.

So I was hoping for the ability to fine tune GPT 4 Turbo some time soon. Is that on the roadmap for Open AI?

  • I don’t work for OpenAI but I do remember them saying that a select few customers would be invited to test out fine tuning GPT-4, and that was several months ago now. They said they would prioritise those who had previously fine tuned GPT-3.5 Turbo.

The ongoing model anchoring/grounding issue likely affects all GPT-4 checkpoints/variants, but is most prominent with the latest "gpt-4-turbo-2024-04-09" variant due to its most recent cutoff date, might imply deeper issues with the current model architecture, or at least how it's been updated:

See the issue: https://github.com/openai/openai-python/issues/1310

See also the original thread on OpenAI's developer forums (https://community.openai.com/t/gpt-4-turbo-2024-04-09-will-t...) for multiple confirmations on this issue.

Basically, without a separate declaration of the model variant in use in system message, even the latest gpt-4-turbo-2024-09 variant over the API might hallucinate being GPT-3 and its cutoff date being in 2021.

A test code snippet is included in the GitHub issue to A/B test the problem yourself with a reference question.

I think there's a bigger underlying problem with the current GPT-4 model(s) atm:

Go to the API Playground and ask the model what is its current cutoff date. For example, in its chat, if you're not instructing it with anything else, it will tell you that its cutoff date is in 2021. Even if you explicitly tell the model via system prompt: "you are gpt-4-turbo-2024-04-09", in some cases it still thinks its in April 2023.

The fact that the model (variants of GPT-4 including gpt-4-turbo-2024-04-09) hallucinates its cutoff date being in 2021 unless specifically instructed with its model type is a major factor in this equation.

Here are the steps to reproduce the problem:

Try an A/B comparison at: https://platform.openai.com/playground/chat?model=gpt-4-turb...

A) Make sure "gpt-4-turbo-2024-04-09" is indeed selected. Don't tell it anything specific via the system prompt and in a worst case scenario, it'll think it's in 2021 as to its cutoff date. It also can't answer to questions about more current events.

* Reload the web page between prompts! *

B) Tell it via the system prompt: "You are gpt-4-turbo-2024-04-09" => you'll get answers to recent events. Ask anything about what's been going on in the world i.e. after April 2023 to verify against A.

I've tried this multiple times now, and have always gotten the same results. IMHO this implies a deeper issue in the model where the priming goes way off if the model number isn't mentioned in its system message. This might explain the bad initial benchmarks as well.

The problem seems pretty bad at the moment. Basically, if you omit the priming message ("You are gpt-4-turbo-2024-04-09"), it will in worst cases revert to hallucinating 2021 cutoff dates and doesn't get grounded into what should be its most current cutoff date.

If you do work at OpenAI, I suggest you look into it. :-)

>I work at OpenAI

I know there's a lot you can't talk about. I'm not going to ask for a leak or anything like that. I'd just like to know, what do you think programming will look like by 2025? What do you think will happen to junior software developers in the near future? Just your personal opinion.

Hey Ted, I had a question about working at OpenAI, if you don't mind talking with me. If so, email address is in my profile. Thank you!

Pretty sweet site, thx for sharing. Hope y'all will start bringing token count up at some point. Will be testing this newer version too.

Appreciate OpenAI popped in say new release is probably better at something else, but it would have been nice to acknowledge that this suggestion...

> “Unfortunately, if you're a developer using an LLM API, the best thing to do is to test all of the models from all the providers to see which works best for your use case.”

...is exactly what is done by the author of these benchmark suites:

"It performs worse on aider’s coding benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models."

  • Agreed! Kudos to Paul for creating the evals, running them quickly, and sharing results. My comment (not on behalf on OpenAI, but just me as an individual) was meant as a "yes and" not a "no but".