← Back to context

Comment by bambax

1 year ago

Near the end, the quote from OpenAI researcher Jason Wei seems damning to me:

> Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.

Results are "strong" but can't be felt by the user? What does that even mean?

But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine.

"This hammer hammers better, but in most cases it's not obvious how better it is. But when you stumble upon a very specific kind of nail, man does it feel magical! We need to craft more of those weird nails to help the world understand the value of this hammer."

But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?

He's speaking about his objective to make ever stronger LLMs: so for this his secondary objective is to measure their real performance.

The human preference is not that good of a proxy measurement: for instance, it can be gamed by making the model more assertive, causing the human error-spotting ability to decrease a lot [0].

So what he's really saying is that non-rigorous human vibe checks (like those LMSys Chatbot Arena is built on, although I love it) won't cut it anymore to evaluate models, because now models are past that point. Just like you can't evaluate how smart a smart person really is in a 2min casual conversation.

[0]: https://openreview.net/pdf?id=7W3GLNImfS

  • It's trivial to come up with prompts that 4o fails. If it's hard to come up with prompts that 1o succeeds on but 4o fails, that implies the delta is not that great.

    • Or, the delta depends on the nature of the problem/prompt, we’ve not yet figured that out, there’s a relatively narrow range of prompts with large delta, and so finding those examples is a work in progress?

  • ie when you cant beat them, make new metrics

    and you can absolutely evaluate how smart someone is in a 2min casual conversation. You wont be able to tell how well they are in some niche topic, but %insert something about different flavors of intelligence and how they do not equate do subject matter expertise%

    • It’s a common pattern that AI benchmarks get too easy, so they make new ones that are harder.

  • As models improve, human preference will become worse as a proxy measurement (e.g. as model capabilities surpass the human's ability to judge correctness at a glance). This can be due to more raw capability - or more persuasion / charisma.

> Results are "strong" but can't be felt by the user? What does that even mean?

Not every conversation you have with a PhD will make it obvious that that person is a PhD. Someone can be really smart, but if you don't see them in a setting where they can express it, then you'll have no way of fully assessing their intelligence. Similarly, if you only use OAI models with low-demand prompts, you may not be able to tell the difference between a good model and a great one.

> What does that even mean?

It explicitly says "Results on AIME and GPQA are really strong". So I would assume it means it can get (statistically significantly, I assume) better score in AIME and GPQA benchmarks compared to 4o.

I think they are saying they have invented the screwdriver. We have all been using, hammers to sink screws, but if you try this new tool it may be better. However, you will still encounter a lot of nails.

  • It's more like they're saying they have invented the screwdriver, but they haven't invented screws yet.

    But it doesn't feel right. It's unlikely the screwdriver would come first, and then people would go around looking for things to use it with, no?

    • It's more like they have invented a computer, an extremely versatile and powerful tool that can be used in many ways, but is not a solution to every problem.

      Now they need people to write software that uses this capability to perform useful tasks, such as text processing, working with spreadsheets and providing new ways of communication.

      5 replies →

> But why? Why would we do that?

Because OpenAI needs a steady influx of money, big money. In order to do so, they have to convince the people who are giving them money that they are the best. An objective way to achieve this is by benchmarking. But once you enter this game, you start optimizing for benchmarks.

At the same time, in the real world, Anthropic is following them in huge leaps and for many users Claude 3.5 is already the default tool for daily work.

  • Agree completely.

    From a user perspective too, I was a subscriber from the first day of gpt4 until about a month ago. I thought about subscribing for the month to check this out but I am tired of the OpenAI experience.

    Where is Sora? Where is the version of chatgpt that responds in real time to your voice? Remember the gpt4 demo that you would draw a website on a napkin?

    How about Q* lol. Strawberry/Q*/o1, "it is super dangerous, be very careful!"

    Quietly, Anthropic has just kicked their ass without all the hype and I am about to go work in sonnet instead even bothering to check o1 out.

> Results are "strong" but can't be felt by the user? What does that even mean?

This means it often doesn't provide the answer the user is looking for. In my opinion, it's an alignment problem, people are very presumptuous and leave out a lot of detail in their request. Like the "which is bigger - 9.8 or 9.11? question, if you ask "numerically which is bigger - 9.8 or 9.11?" It gets the correct answer, basically it prioritizes a different meaning for bigger.

> But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine. But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?

Without better questions we can't test and prove that it is getting more intelligent or is just wrong. If it is more intelligent than us it it might provide answers that don't make sense to us but are actually clever, 4d chess as they say. Again an alignment problem, better questions aid with solving that.

  • The irony here is that Jason is speaking in the context of LLM development, which he lives and breaths all day.

    Reading his comments without framing it in that context makes it come off pretty badly - humans failing to understand what is being said because they don't have context.

> we all need to find harder prompts

"One of the biggest traps for engineers is optimizing a thing that shouldn't exist." (from Musk I believe)

This is something we've been grappeling with on my team. Many of the researchers in the org want to try all these reasoning techniques to increase performance, and my team keeps pushing back that we don't actually need that extra performance- we just want to decrease latency and cost.

  • So make the requirement using a cheaper and lower latency model and try to increase the performance to a satisfactory level. Assuming that you are not already using the cheapest/lowest latency model.

The stupidest thing about ai and automation is that they are trying to target it at large corporations looking to cut down on jobs or 10x productivity when all anyone actually wants is a robot to do their laundry and dishes.

  • Because a robot that do everyone's laundry is much more closer to AGI than ChatGPT. I'm dead serious.

    • Not really. You don't need to move wet clothes from the first machine to a second machine if you get one machine that does both jobs. That's very much not AGI. The second job, of taking dry crumpled clothes and folding them, also doesn't need an artificial general intelligence. It's very computationally expensive (as evidenced by the speed of https://pantor.github.io/speedfolding/, out of UC Berkeley) and a hard robotics question, but it's also very fixed function.

      Taking the clothes out of the combined washer dryer machine, my laundry folding robot isn't suddenly going to need to come up with a creative answer to a question I have about politics in order to fold the laundry, or come up with a new way to organize my board game collection, or reason about how to refactor some code. There are no logical leaps of reasoning or deep thinking required. My laundry folding robot doesn't need to be creative in order to fold laundry, just application of some very complex algorithms, some of which have yet to be discovered.

  • You're describing a dish-washer and washing-machine.

    • The GP is almost certainly describing a robot that can move dirty stuff into the machines, run them, and put away the clean stuff afterwards.

Dont you know by now

Speaking with AI maxis it’s easy:

The AI is always right

You are always wrong

If AI might enable something dangerous, it was already possible by hand, scale is irrelevant

But also AI enables many amazing things not previously possible, at scale

If you don’t get the answers you want, you’re prompting it wrong. You need to work harder to show how much better the AI is. But definitely, it cannot make things worse at scale in any way. And anyone who wants regulations to even require attribution and labeling, is a dangerous luddite depriving humanity of innovations.