← Back to context

Comment by ttoinou

1 month ago

   the excellent performance demonstrated by the models fully proves the crucial role of reinforcement learning in the optimization process

What if this reinforcement is just gaming the benchmarks (Goodhart's law) without providing better answers elsewhere, how would we notice it ?

A large amount of work in the last few years has gone into building benchmarks because models have been going though and beating them at a fairly astonishing rate. It's generally accepted as true that passing any one of them does not constitute fully general intelligence but the difficult part has been finding things that they cannot do. They are giving them more and more difficult tasks. The ARC prize in particular was designed to be focused on reasoning more than knowledge. The 87.5% score achieved in such a short time by throwing lots of resources at conventional methods was quite a surprise.

You can at least have a degree of confidence that they will perform well in the areas covered by the benchmarks (as long as they weren't contaminated) and with enough benchmarks you get fairly broad coverage.

  • > It's generally accepted as true that passing any one of them does not constitute fully general intelligence but the difficult part has been finding things that they cannot do.

    It's pretty easy to find things they can't do. They lack a level of abstraction that even small mammals have, which is why you see them constantly failing when it comes to things like spacial awareness.

    The difficult part is creating an intelligence test that they score badly on. But that's more of an issue with treating intelligence tests as if they're representative of general intelligence.

    It's like have difficulty finding a math problem that Wolfram Alpha would do poorly on. If a human was able to solve all of these problems as well as Wolfram Alpha, they would be considered a genius. But Wolfram Alpha being able to solve those questions doesn't show that it has general intelligence, and trying to come up with more and more complicated math problems to test it with doesn't help us answer that question either.

    • yeah like ask them to use tailwindcss.

      most llm's actually fail that task, even in agent modes and there is a really simple reason for that. because tailwindcss changed their packages / syntax.

      and this is basically a test that should be focused on. change things and see if the llm can find a solutions on its own. (...it can't)

      7 replies →

  • > does not constitute fully general intelligence but the difficult part has been finding things that they cannot do

    I am very surprised when people say things like this. For example, the best ChatGPT model continues to lie to me on a daily basis for even basic things. E.g. when I ask it to explain what code is contained on a certain line on github, it just makes up the code and the code it's "explaining" isn't found anywhere in the repo.

    From my experience, every model is untrustworthy and full of hallucinations. I have a big disconnect when people say things like this. Why?

    • Well, language models don't measure the state of the world - they turn your input text into a state of text dynamics, and then basically hit 'play' on a best guess of what the rest of the text from that state would contain. Part of your getting 'lies' is that you're asking questions for which the answer couldn't really be said to be contained anywhere inside the envelope/hull of some mixture of thousands of existing texts.

      Like, suppose for a thought experiment, that you got ten thousand random github users, collected every documented instance of a time that they had referred to a line number of a file in any repo, and then tried to use those related answers to come up with a mean prediction for the contents of a wholly different repo. Odds are, you would get something like the LLM answer.

      My opinion is that it is worth it to get a sense, through trial and error (checking answers), of when a question you have may or may not be in a blindspot of the wisdom of the crowd.

      1 reply →

    • I am not an expert, but I suspect the disconnect concerns number of data sources. LLMs are good at generalising over many points of data, but not good at recapitulating a single data point like in your example.

    • I’m splitting hairs a little bit but I feel like there should be a difference in how we think about current “hard(er)” limitations of the models vs limits in general intelligence and reasoning, I.e I think the grandparent comment is talking about overall advancement in reasoning and logic and in that finding things AI “cannot do” whereas you’re referring to what is more classify as a “known issue”. Of course it’s an important issue that needs to get fixed and yes technically until we don’t have that kind of issue we can’t call it “general intelligence” but I do think the original comment is about something different than a few known limitations that probably a lot of models have (and that frankly you’d have thought wouldn’t be that difficult to solve!?)

      1 reply →

    • For clarity, could you say exactly what model you are using? The very best ChatGPT model would be a very expensive way to perform that sort of task.

    • Is this a version of ChatGPT that can actually go and check on the web? If not it is kind of forced to make things up.

The trick is that the benchmarks must have a wide enough distribution so that a well scoring model is potentially useful for the widest span of users.

There also would need to be a guarantee (or checking of the model somehow) that model providers don't just train on the benchmarks. Solutions are dynamic components (random names, numbers, etc) or private parts of benchmarks.

  • A common pattern is for benchmarks owners to hold back X% of their set so they can independently validate that models perform similarly on the holdback set. See: FrontierMath / OpenAI brouhaha.

Typically you train it on one set and test it on another set. If you see that the differences between the two sets are significant enough and yet it has maintained good performance on the test set, you claim that it has done something useful [alongside gaming the benchmark that is the train set]. That "side effect" is always the useful part in any ML process.

If the test set is extremely similar to the train set then yes, it's goodharts law all around. For modern LLMs, it's hard to make a test set that is different from what it has trained on, because of the sheer expanse of the training data used. Note that the two sets are different only if they are statistically different. It is not enough that they simply don't repeat verbatim.

We've been able to pass the Turing Test on text, audio, and short form video (think AI's on video passing coding tests). I think there's an important distinction now with AI streamers where people notice they are AI's eventually. Now there might pop up AI streamers where you don't know they're an AI. However, there's a ceiling on how far digital interactions on the Turing Test can go. The next big hurdle towards AGI is physical interactions, like entering a room.

I mean all optimization algorithms do is game a benchmark. That’s the whole point.

The hard part is making the benchmark meaningful in the first place.