Comment by Lerc
1 month ago
A large amount of work in the last few years has gone into building benchmarks because models have been going though and beating them at a fairly astonishing rate. It's generally accepted as true that passing any one of them does not constitute fully general intelligence but the difficult part has been finding things that they cannot do. They are giving them more and more difficult tasks. The ARC prize in particular was designed to be focused on reasoning more than knowledge. The 87.5% score achieved in such a short time by throwing lots of resources at conventional methods was quite a surprise.
You can at least have a degree of confidence that they will perform well in the areas covered by the benchmarks (as long as they weren't contaminated) and with enough benchmarks you get fairly broad coverage.
> It's generally accepted as true that passing any one of them does not constitute fully general intelligence but the difficult part has been finding things that they cannot do.
It's pretty easy to find things they can't do. They lack a level of abstraction that even small mammals have, which is why you see them constantly failing when it comes to things like spacial awareness.
The difficult part is creating an intelligence test that they score badly on. But that's more of an issue with treating intelligence tests as if they're representative of general intelligence.
It's like have difficulty finding a math problem that Wolfram Alpha would do poorly on. If a human was able to solve all of these problems as well as Wolfram Alpha, they would be considered a genius. But Wolfram Alpha being able to solve those questions doesn't show that it has general intelligence, and trying to come up with more and more complicated math problems to test it with doesn't help us answer that question either.
yeah like ask them to use tailwindcss.
most llm's actually fail that task, even in agent modes and there is a really simple reason for that. because tailwindcss changed their packages / syntax.
and this is basically a test that should be focused on. change things and see if the llm can find a solutions on its own. (...it can't)
And if I take my regular ordinary commuter car off the paved road and onto the dirt I get stuck in the mud. That doesn't mean the whole concept of cars is worthless, instead we paved all over the world with roads. But for some reason with LLMs, the attitude is that them being unable to go offroad means everyone's totally deluded and we should give up on the whole idea.
3 replies →
How do they do if you include the updated docs in the context?
2 replies →
Can it solve the prime number maze
> does not constitute fully general intelligence but the difficult part has been finding things that they cannot do
I am very surprised when people say things like this. For example, the best ChatGPT model continues to lie to me on a daily basis for even basic things. E.g. when I ask it to explain what code is contained on a certain line on github, it just makes up the code and the code it's "explaining" isn't found anywhere in the repo.
From my experience, every model is untrustworthy and full of hallucinations. I have a big disconnect when people say things like this. Why?
Well, language models don't measure the state of the world - they turn your input text into a state of text dynamics, and then basically hit 'play' on a best guess of what the rest of the text from that state would contain. Part of your getting 'lies' is that you're asking questions for which the answer couldn't really be said to be contained anywhere inside the envelope/hull of some mixture of thousands of existing texts.
Like, suppose for a thought experiment, that you got ten thousand random github users, collected every documented instance of a time that they had referred to a line number of a file in any repo, and then tried to use those related answers to come up with a mean prediction for the contents of a wholly different repo. Odds are, you would get something like the LLM answer.
My opinion is that it is worth it to get a sense, through trial and error (checking answers), of when a question you have may or may not be in a blindspot of the wisdom of the crowd.
[dead]
I am not an expert, but I suspect the disconnect concerns number of data sources. LLMs are good at generalising over many points of data, but not good at recapitulating a single data point like in your example.
I’m splitting hairs a little bit but I feel like there should be a difference in how we think about current “hard(er)” limitations of the models vs limits in general intelligence and reasoning, I.e I think the grandparent comment is talking about overall advancement in reasoning and logic and in that finding things AI “cannot do” whereas you’re referring to what is more classify as a “known issue”. Of course it’s an important issue that needs to get fixed and yes technically until we don’t have that kind of issue we can’t call it “general intelligence” but I do think the original comment is about something different than a few known limitations that probably a lot of models have (and that frankly you’d have thought wouldn’t be that difficult to solve!?)
Yes but I am just giving an example of something recent, I could also point to pure logic errors if I go back and search my discussions.
Maybe you are on to something for "classifying" issues; the type of problems LLMs have are hard to categorize and hence it is hard to benchmark around. Maybe it is just a long tail of many different categories of problems.
It does this even if you give it instructions to make sure the code is truly in the code base? You never told it can’t lie.
Telling a LLM 'do not hallucinate' doesn't make it stop hallucinating. Anyone who has used an LLM even moderately seriously can tell you that. They're very useful tools, but right now they're mostly good for writing boilerplate that you'll be reviewing anyhow.
5 replies →
For clarity, could you say exactly what model you are using? The very best ChatGPT model would be a very expensive way to perform that sort of task.
Is this a version of ChatGPT that can actually go and check on the web? If not it is kind of forced to make things up.