← Back to context

Comment by ChuckMcM

2 hours ago

This is a great point. I've added it to my list of things when talking about the limitations of LLM.

IMO we must take it a step further: In this context "the LLM" we're all automatically thinking-of doesn't exist, it is a fictional character we humans "see" inside a story being acted-out or read to us. (In contrast, the real-world LLM is an algorithm in a basement constantly taking documents and making them slightly longer based on trends detected in all documents.)

Therefore "the LLM can't feel shame" is true in the same way that "CyberDracula thirsts for the fluids of the innocent." Good news: Vampirism doesn't exist! Bad news: Curing Dracula is impossible, because the patient doesn't exist either. The target mind we wanted to make smarter or kinder is just a trick of the light.

The best we can do is change the generator process, so that the next story instead contains a different new character also named after Dracula (or a brand of LLM) except with a different backstory and some dialogue that somehow better-fits the expectations of the humans who latch onto a story.

Perhaps the end state is going to be from the last Hitchhiker's Guide to the Galaxy book, Mostly Harmless:

> Anything that thinks logically can be fooled by something else that thinks at least as logically as it does. The easiest way to fool a completely logical robot is to feed it with the same stimulus sequence over and over again so it gets locked in a loop. This was best demonstrated by the famous Herring Sandwich experiments conducted millennia ago at MISPWOSO (the MaxiMegalon Institute of Slowly and Painfully Working Out the Surprisingly Obvious).

> A robot was programmed to believe that it liked herring sandwiches. This was actually the most difficult part of the whole experiment. Once the robot had been programmed to believe that it liked herring sandwiches, a herring sandwich was placed in front of it. Where upon the robot thought to itself, Ah! A herring sandwich! I like herring sandwiches.

> It would then bend over and scoop up the herring sandwich in its herring sandwich scoop, and then straighten up again. Unfortunately for the robot, it was fashioned in such a way that the action of straightening up caused the herring sandwich to slip straight back off its herring sandwich scoop and fall on to the floor in front of the robot. Whereupon the robot thought to itself, Ah! A herring sandwich...etc., and repeated the same action over and over again. The only thing that prevented the herring sandwich from getting bored with the whole damn business and crawling off in search of other ways of passing the time was that the herring sandwich, being just a bit of dead fish between a couple of slices of bread, was marginally less alert to what was going on than was the robot.

> The scientists at the Institute thus discovered the driving force behind all change, development and innovation in life, which was this: herring sandwiches. They published a paper to this effect, which was widely criticised as being extremely stupid. They checked their figures and realised that what they had actually discovered was “boredom”, or rather, the practical function of boredom. In a fever of excitement they then went on to discover other emotions, Like “irritability”, “depression”, “reluctance”, “ickiness” and so on. The next big breakthrough came when they stopped using herring sandwiches, whereupon a whole welter of new emotions became suddenly available to them for study, such as “relief”, “joy”, “friskiness”, “appetite”, “satisfaction”, and most important of all, the desire for “happiness”. This was the biggest breakthrough of all.

> Vast wodges of complex computer code governing robot behaviour in all possible contingencies could be replaced very simply. All that robots needed was the capacity to be either bored or happy, and a few conditions that needed to be satisfied in order to bring those states about. They would then work the rest out for themselves.

  • Damn I had forgotten about this section of the book to the point that even reading it, I only recognised the style as typical Adams.

    Guess that means I'm overdue for a re-read! Jaay!

  • I love that book, that said, the point is more subtle than that. Current LLM attention models are limited in their feedback. Adding a form of 'shame' feedback (result is technically correct but morally bad or some such) would help here but I doubt the folks building theses things would choose to do so.

    • From a certain and quite valid point of view, they have no mechanism for feedback at all. Every time you start a conversation you're starting in the same state, modulo the random numbers. At most you have this very, very vague loop in that the conversations for LLM 1.0 will be fed in to the training set for LLM 2.0.

      Even "shame" would only apply to the current session and disappear in the next one, or eventually be compacted away.

      (Although honorable mention to Gemini's meltdown: https://x.com/AISafetyMemes/status/1953397827662414022 )