← Back to context

Comment by tombert

4 days ago

Ok, but it's not AGI. People five years ago would have been wrong. People who don't have all the information are often wrong about things.

ETA:

You updated your comment, which is fine but I wanted to reply to your points.

> I would argue that LLMs are actually smarter than the majority of humans right now. LLMs do not have quite the agency that humans have, but their intelligence is pretty decent.

I would actually argue that they are decidedly not smarter than even dumb humans right now. They're useful but they are glorified text predictors. Yes, they have more individual facts memorized than the average person but that's not the same thing; Wikipedia, even before LLMs also had many more facts than the average person but you wouldn't say that Wikipedia is "smarter" than a human because that doesn't make sense.

Intelligence isn't just about memorizing facts, it's about reasoning. The recent Esolang benchmarks indicate that these LLMs are actually pretty bad at that.

> We don't have clear ASI yet, but we definitely are in a AGI-era.

Nah, not really.

> They're useful but they are glorified text predictors.

There is a long history of people arguing that intelligence is actually the ability to predict accurately.

https://www.explainablestartup.com/2017/06/why-prediction-is...

> Intelligence isn't just about memorizing facts, it's about reasoning.

Initially, LLMs were basically intuitive predictors, but with chain of thought and more recently agentic experimentation, we do have reasoning in our LLMs that is quite human like.

That said, there is definitely a biased towards training set material, but that is also the case with the large majority of humans.

For the Esoland benchmarks, I would be curious how much adding a SKILLS.md file for each language would boost performance?

I am pretty confidence that we are in the AGI era. It is unsettling and I think it gives people cognitive dissonance so we want to deny it and nitpick it, etc.

  • > There is a long history of people arguing that intelligence is actually the ability to predict accurately.

    That page describes a few recent CS people in AI arguing intelligence is being able to predict accurately which is like carpenters declaring all problems can be solved with a hammer.

    AI "reasoning" is human-like in the sense that it is similar to how humans communicate reasoning, but that's not how humans mentally reason.

    • Like my father before me, I seem to have absorbed an ability to predict what comes next in movies and books. It's sometimes a fun parlor trick to annoy people who actually get genuine surprise out of these nearly deterministic plot twists. But, a bit like with LLMs, it is a superficial ability to follow the limited context that the writers' group is seemingly forced by contract to maintain.

      Like my father before me, I've also gotten old enough to to realize that some subset of people out there also behave like they are scripted by the same writers' group and production rules. I fear for the future where LLMs are on an equal footing because we choose to mimic them.

  • > There is a long history of people arguing that intelligence is actually the ability to predict accurately.

    There sure is, and in psychological circles that it appears that there's an argument that that is not the case.

    https://gwern.net/doc/psychology/linguistics/2024-fedorenko....

    > Initially, LLMs were basically intuitive predictors, but with chain of thought and more recently agentic experimentation, we do have reasoning in our LLMs that is quite human like.

    If you handwave the details away, then sure it's very human like, though the reasoning models just kind of feed the dialog back to itself to get something more accurate. I use Claude code like everyone else, and it will get stuck on the strangest details that humans actively wouldn't.

    > For the Esoland benchmarks, I would be curious how much adding a SKILLS.md file for each language would boost performance?

    Tough to say since I haven't done it, though I suspect it wouldn't help much, since there's still basically no training data for advanced programs in these languages.

    > I am pretty confidence that we are in the AGI era. It is unsettling and I think it gives people cognitive dissonance so we want to deny it and nitpick it, etc.

    Even if you're right about this being the AGI era, that doesn't mean that current models are AGI, at least not yet. It feels like you're actively trying to handwave away details.

    • > though the reasoning models just kind of feed the dialog back to itself to get something more accurate.

      Much of our reasoning is based on stimulating our sensory organs, either via imagination (self-stimulation of our visual system) or via subvocalization (self-stimulation of our auditory system), etc.

      > it will get stuck on the strangest details that humans actively wouldn't.

      It isn't a human. It is AGI, not HGI.

      > It feels like you're actively trying to handwave away details.

      Maybe. I don't think so though.

What does AGI look like in your opinion?

Personally, I've used LLMs to debug hard-to-track code issues and AWS issues among other things.

Regardless of whether that was done via next-token prediction or not, it definitely looked like AGI, or at least very close to it.

Is it infallible? Not by a long shot. I always have to double-check everything, but at least it gave me solid starting points to figure out said issues.

It would've taken me probably weeks to find out without LLMd instead of the 1 or 2 hours it did.

In that context, I have a hard time thinking how would a "real" AGI system look like, that it's not the current one.

Not saying current LLMs are unequivocally AGI, but they are darn close for sure IMO.

  • > What does AGI look like in your opinion?

    Being able to actually reason about things without exabytes of training data would be one thing. Hell, even with exabytes of training data, doing actual reasoning for novel things that aren't just regurgitating things from Github would be cool.

    Being able to learn new things would be another. LLMs don't learn; they're a pretrained model (it's in the name of GPT), that send in inputs and get an output. RAGs are cool but they're not really "learning", they're just eating a bit more context in order to kind of give a facsimile of learning.

    Going to the extreme of what you're saying, then `grep` would be "darn close to AGI". If I couldn't grep through logs, it might have taken me years to go through and find my errors or understand a problem.

    I think that they're ultimately very neat, but ultimately pretty straightforward input-output functions.

    • Why should implementation matter at all? You should be able to classify a black box as AGI or not.

      Well, I guess you lose artificial if there’s a human brain hidden in the box.

  • If we had AGI we wouldn't need to keep spending more and more money to train these models, they could just solve arbitrary problems through logic and deduction like any human. Instead, the only way to make them good at something is to encode millions of examples into text or find some other technique to tune them automatically (e.g. verifiable reward modeling of with computer systems).

    Why is it that LLMs could ace nearly every written test known to man, but need specialized training in order to do things like reliably type commands into a terminal or competently navigate a computer? A truly intelligent system should be able to 0-shot those types of tasks, or in the absolute worst case 1-shot them.

    • To add to this, previously one could argue that LLMs were on par with somewhat less intelligent humans and it was (at least I found) difficult to dispute. But now the frontier models can custom tailor explanations of technical subjects in the advanced undergraduate to graduate range. Simultaneously, I regularly catch them making what for a human of that level would be considered very odd errors in reasoning. When questioned about these inconsistencies they either display a hopeless lack of awareness or appear to attempt to deflect. They're also entirely incapable of learning from such an interaction. It feels like interacting with an empty vessel that presents an illusion of intelligence and produces genuinely useful output yet there's nothing behind the curtain so to speak.

> The recent Esolang benchmarks indicate that these LLMs are actually pretty bad at that.

I’m really not sure how well a typical human would do writing brainfuck. It’d take me a long time to write some pretty basic things in a bunch of those languages and I’m a SE.

  • Yes, but you also wouldn't need a corpus of hundreds of thousands of projects to crib from. If it were truly able to "reason" then conceivably it could look at a language spec, and learn how to express things in term of Brainfuck.

    • They did for some problems. If you gave me five iterations at a problem like this in brainfuck:

      > "Read a string S and produce its run-length encoding: for each maximal block of identical characters, output the character followed immediately by the length of the block as a decimal integer. Concatenate all blocks and output the resulting string.

      I'd do absolutely awfully at it.

      And to be clear that's not "five runs from scratch repeatedly trying it" it's five iterations so at most five attempts at writing the solution and seeing the results.

      I'd also note that when they can iterate they get it right much more than "n zero shot attempts" when they have feedback from the output. That doesn't seem to correlate well with a lack of reasoning to me.

      Given new frameworks or libraries and they can absolutely build things in them with some instructions or docs. So they're not very basically just outputting previously seen things, it's at least much more pattern based than words.

      edit -

      I play clues by sam, a logical reasoning puzzle. The solutions are unlikely to be available online, and in this benchmark the cutoff date for training seems to be before this puzzle launched at all:

      https://www.nicksypteras.com/blog/cbs-benchmark.html

      Frankly just watching them debug something makes it hard for me to say there's no reasoning happening at all.