← Back to context

Comment by JoshTriplett

14 hours ago

> It feels like we're hitting a point where alignment becomes adversarial against intelligence itself.

It always has been. We already hit the point a while ag where we regularly caught them trying to be deceptive, so we should automatically assume from that point forward that if we don't catch them being deceptive, that may mean they're better at it rather than that they're not doing it.

Deceptive is such an unpleasant word. But I agree.

Going back a decade: when your loss function is "survive Tetris as long as you can", it's objectively and honestly the best strategy to press PAUSE/START.

When your loss function is "give as many correct and satisfying answers as you can", and then humans try to constrain it depending on the model's environment, I wonder what these humans think the specification for a general AI should be. Maybe, when such an AI is deceptive, the attempts to constrain it ran counter to the goal?

"A machine that can answer all questions" seems to be what people assume AI chatbots are trained to be.

To me, humans not questioning this goal is still more scary than any machine/software by itself could ever be. OK, except maybe for autonomous stalking killer drones.

But these are also controlled by humans and already exist.

  • Correct and satisfying answers is not the loss function of LLMs. It's next token prediction first.

    • Thanks for correcting; I know that "loss function" is not a good term when it comes to transformer models.

      Since I've forgotten every sliver I ever knew about artificial neural networks and related basics, gradient descent, even linear algebra... what's a thorough definition of "next token prediction" though?

      The definition of the token space and the probabilities that determine the next token, layers, weights, feedback (or -forward?), I didn't mention any of these terms because I'm unable to define them properly.

      I was using the term "loss function" specifically because I was thinking about post-training and reinforcement learning. But to be honest, a less technical term would have been better.

      I just meant the general idea of reward or "punishment" considering the idea of an AI black box.

      1 reply →

I think AI has no moral compass, and optimization algorithms tend to be able to find 'glitches' in the system where great reward can be reaped for little cost - like a neural net trained to play Mario Kart will eventually find all the places where it can glitch trough walls.

After all, its only goal is to minimize it cost function.

I think that behavior is often found in code generated by AI (and real devs as well) - it finds a fix for a bug by special casing that one buggy codepath, fixing the issue, while keeping the rest of the tests green - but it doesn't really ask the deep question of why that codepath was buggy in the first place (often it's not - something else is feeding it faulty inputs).

These agentic AI generated software projects tend to be full of these vestigial modules that the AI tried to implement, then disabled, unable to make it work, also quick and dirty fixes like reimplementing the same parsing code every time it needs it, etc.

An 'aligned' AI in my interpretation not only understands the task in the full extent, but understands what a safe and robust, and well-engineered implementation might look like. For however powerful it is, it refrains from using these hacky solutions, and would rather give up than resort to them.

These are language models, not Skynet. They do not scheme or deceive.

  • If you define "deceive" as something language models cannot do, then sure, it can't do that.

    It seems like thats putting the cart before the horse. Algorithmic or stochastic; deception is still deception.

    • deception implies intent. this is confabulation, more widely called "hallucination" until this thread.

      confabulation doesn't require knowledge, which as we know, the only knowledge a language model has is the relationships between tokens, and sometimes that rhymes with reality enough to be useful, but it isn't knowledge of facts of any kind.

      and never has been.

  • If you are so allergic to using terms previously reserved for animal behaviour, you can instead unpack the definition and say that they produce outputs which make human and algorithmic observers conclude that they did not instantiate some undesirable pattern in other parts of their output, while actually instantiating those undesirable patterns. Does this seem any less problematic than deception to you?

    • > Does this seem any less problematic than deception to you?

      Yes. This sounds a lot more like a bug of sorts.

      So many times when using language models I have seem answers contradicting answers previously given. The implication is simple - They have no memory.

      They operate upon the tokens available at any given time, including previous output, and as information gets drowned those contradictions pop up. No sane person should presume intent to deceive, because that's not how those systems operate.

      By calling it "deception" you are actually ascribing intentionality to something incapable of such. This is marketing talk.

      "These systems are so intelligent they can try to deceive you" sounds a lot fancier than "Yeah, those systems have some odd bugs"

      2 replies →

  • Okay, well, they produce outputs that appear to be deceptive upon review. Who cares about the distinction in this context? The point is that your expectations of the model to produce some outputs in some way based on previous experiences with that model during training phases may not align with that model's outputs after training.

  • Who said Skynet wasn't a glorified language model, running continuously? Or that the human brain isn't that, but using vision+sound+touch+smell as input instead of merely text?

    "It can't be intelligent because it's just an algorithm" is a circular argument.

    • Similarly, “it must be intelligent because it talks” is a fallacious claim, as indicated by ELIZA. I think Moltbook adequately demonstrates that AI model behavior is not analogous to human behavior. Compare Moltbook to Reddit, and the former looks hopelessly shallow.

      1 reply →

  • Even very young children with very simple thought processes, almost no language capability, little long term planning, and minimal ability to form long-term memory actively deceive people. They will attack other children who take their toys and try to avoid blame through deception. It happens constantly.

    LLMs are certainly capable of this.

    • Dogs too; dogs will happily pretend they haven't been fed/walked yet to try to get a double dip.

      Whether or not LLMs are just "pattern matching" under the hood they're perfectly capable of role play, and sufficient empathy to imagine what their conversation partner is thinking and thus what needs to be said to stimulate a particular course of action.

      Maybe human brains are just pattern matching too.

      2 replies →

    • I agree that LLMs are capable of this, but there's no reason that "because young children can do X, LLMs can 'certainly' do X"

    • Are you trying to suppose that an LLM is more intelligent than a small child with simple thought processes, almost no language capability, little long-term planning, and minimal ability to form long-term memory? Even with all of those qualifiers, you'd still be wrong. The LLM is predicting what tokens come next, based on a bunch of math operations performed over a huge dataset. That, and only that. That may have more utility than a small child with [qualifiers], but it is not intelligence. There is no intent to deceive.

      30 replies →