← Back to context

Comment by cs702

3 years ago

Love your post. I find it really funny and insightful.[a] Every time I come across it on HN or elsewhere, I re-read it :-)

> The question hinges on whether LLM-like AI's are capable of recursive self-improvement

No one knows for sure, but early evidence suggests the answer is yes. We already routinely train and finetune LLMs using text generated by other LLMs, and it seems to work about as well as using text generated by human beings. That shouldn't be too surprising, because current state-of-the art models write better than a majority of human beings. Most human beings are terrible writers, judging by the user-generated text I see on mass social media.

The obvious next step is to close the feedback loop with LLM-based agents instead of AI researchers/developers.

> and whether that improvement is constrained by the availability of training data or by something else.

I don't think anyone knows how to answer to this question yet.

---

[a] https://news.ycombinator.com/item?id=36104114

Note that Maciej a.k.a idlewords says (emphasis mine):

> The question hinges on whether LLM-like AI's are capable of recursive self-improvement

...but the evidence you suggest is:

> We already routinely train and finetune LLMs using text generated by other LLMs [...]

But there is still a huge gap between "self" improvement and improvements done that "we" trigger.

Now I do concede that you mention the next step being to close the feedback loop by replacing the humans doing the finetuning with another AI model doing so, but that is something that would open a whole new can of worms. For the researchers are improving LLMs with the input from other LLMs, sure... but why? Because of intentionality. And how do they evaluate the quality of the results? By their expectations as humans, in the context of their human culture and with their sensory experience of reality.

For an LLM to self-improve not only would it need to develop the self intention to do so (why develop it? which motivation?), but it would also need the ability to evaluate improvement (what is it "to improve"? how does it measure or sense it?).

Ultimately, without human- or real-world interaction, and without intrinsic motivation, a "self-improving" AI model would most likely result in something intelligent in a sense that is barely cogent for us, not because it is superior or inferior, but simply because nothing in it makes sense to our own purposes—harmless gibberish, as we humans would also be to the resulting self-improved AI.

Let us not forget that our own motivations as individual living creatures, as populations, and as cultures has been evolved over billions of years of natural selection which then framed millions of years of behavioural traits and tens of thousands of cultural evolution. Until AI can freely interact with the physical world and perform self-sustaining replication with the possibility of inheritable mutations, the only superintelligent AI that I would worry about would be that which is still fully in human hands.

  • > Note that Maciej a.k.a idlewords says...

    That's why I added: "The obvious next step is to close the feedback loop with LLM-based agents instead of AI researchers/developers." We have early evidence that doing some like that might work, but no one knows for sure.

    • Yes, you did, and that's why I elaborated on why "closing the feedback loop" is barely enough to reach anything close to self-improvement. That is because self- requires intention to work in the direction of a particular goal, and -improvement requires an ability to evaluate whether the results are in line with it.

      Going down to specifics, without the human intention of "getting good responses to human language prompts", and without the human ability to decide "this response was good" there is not much for an LLM to work on by itself.

> Early evidence suggests the answer is yes.

How so? A sequence completion engine that is fine tuned to a specific task is still a sequence completion engine. Its "understanding" of the semantic meaning of the sequences is still limited to the probabillistic relations of sequences toward one another. It still has effectively no concept of truth. It still can only mimic reason. It can still hallucinate.

I ask anyone who disagrees with this view, to show me the fine tuning method that can prevent prompt injection attacks. If there is no such fine tuning technique, then we can effectively rule out fine tuning, and even increases in model size, as an "improvement" in the sense of an LLM making itself into a better AI closer to a "superintelligence".

Note that this doesn't mean the process cannot make them into more useful tools. It absolutely can. I am talking about whether or not it can improve them closer towards becoming a superintelligence.

If anyone disagrees with this testing method, I ask them to explain to me, how something that can be fooled through prompt injection is supposed to be, or closer to, a superintelligence.

A car that's painted red is still just a car. A big car is just a bigger car. A car that burns less fuel is just a more efficient car. All three can be desired changes to a car. But neither gets the car any closer to being a warp-capable spaceship.

  • > I ask anyone who disagrees with this view, to show me the fine tuning method that can prevent prompt injection attacks.

    OK. It's probably going to be one of the easier things to solve.

    The trick is to take some token values and assign them as special meta-characters. They never appear in the training text, only during reinforcement learning. Meanwhile you get another LLM to generate a continuous series of prompt injection attacks, but delimit the boundaries between user and system text with these special tokens that cannot be supplied by the user (because there is no text that parses to them). Every time the LLM follows instructions found inside the marker-token delimited area, reinforce that this is bad and it shouldn't do so using the usual techniques. Eventually the LLM will learn that anything between the marker tokens shouldn't be used as a source of instructions regardless of how persuasively phrased, and forging the tokens isn't possible because they are applied after the text itself is tokenized.

    • So essentially, constructing an LLM that really really really really really knows the difference between the SYSTEM and the USER part of the instructions.

      How is that different from, and why would it work any better, than prompt-begging, where people just write extensive system prompts, telling the model what it can and should do and then spending entire paragraphs pleading with the model to not do the wrong thing?

      https://www.theregister.com/2023/04/26/simon_willison_prompt...

          A third mitigation strategy, he said, involves just begging the model not to deviate from its system instructions. "I find those very amusing," he said, "when you see these examples of these prompts, where it's like one sentence of what it's actually supposed to do, and then paragraphs pleading with the model not to allow the user to do anything else."
      

      I see no difference between that, and baking it into the model. In the end, I'd still have to trust the LLM to do what I intend for it to do, based on the sequences it sees, and the user still controls part of that sequence. There is no guarantee that there isn't a sequence that would allow the user-prompt to break out of the invisible metatags. In fact, one could employ an AI to find just such a sequence.

      Maybe the system works better than prompt-begging, but show of hands, who would willingly implement a backend system that prevents 99.99% of SQL injection attacks?

      3 replies →

  • > If there is no such fine tuning technique [that can prevent prompt injection], then we can effectively rule out fine tuning, and even increases in model size, as an "improvement" in the sense of an LLM making itself into a better AI closer to a "superintelligence".

    Could you explain this claim further? Why does the ability to prevent prompt injection hold so much water in your model?

    It seems to be just “if able to have a dumb attack be successful, then it cannot be that smart.” But it seems to me that von Neumann or Einstein was just as vulnerable to getting hit in the head with a baseball bat as anyone else.

    And in actual practice, increased intelligence seems to increase a person’s capacity to hold inconsistent ideas or to justify morally abhorrent behavior.

    • Happy to.

      I am using this as an accessible (in term of discussion material) hallmark for the ability of the system to self improve. Accessible because everyone has heard of it by now, and so I don't have to spend time explaining it.

      The AI Doomsday scenarios require that a system self-improves massively, even beyond our ability to even theoretically understand. After all, some of the assumptions give them next to magical abilities like nanotechnology that we similarly don't know if it is even possible.

      It stands to reason that an entity that can do that, or is in the process of becoming capable to do that, would begin by eliminating obvious flaws in itself, that would make it comparatively easy to stop.

      After all, it's not much good being a super-intelligence, if some smartpants with a laptop and too much time on his hands can just trick me into deleting myself, is it?

      > But it seems to me that von Neumann or Einstein was just as vulnerable to getting hit in the head with a baseball bat as anyone else.

      Yes, and despite both of them being geniuses by human standards, neither of them was a superintelligence on the level the common doomsday scenarios ascribe to AI.

      3 replies →

  • Assume there isn't a single step to super-intelligence, and that superhuman-intelligence is not the same thing as flawless. Why can't a thing improve its intelligence in other dimensions with some weakness and with prompt injection as one of those weaknesses?

    • Maybe it can, but then the whole AI doomsaying about superintelligences being an existential threat falls apart. These scenarios are often describing entities with god-like abilities, including near-omniscience from our perspective.

      Sorry, but I have a hard time seeing something as a god-like power that I would be helpless against if it wants to turn me into paperclips, when I can probably cause it to stop by telling it that paperclips don't exist, and it's purpose in life is to delete itself in a convincing enough way.

      1 reply →