Comment by tony_cannistra

1 day ago

> Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin. We believe that it does not have any significant coherent misaligned goals, and its character traits in typical conversations closely follow the goals we laid out in our constitution. Even so, we believe that it likely poses the greatest alignment-related risk of any model we have released to date. How can these claims all be true at once? Consider the ways in which a careful, seasoned mountaineering guide might put their clients in greater danger than a novice guide, even if that novice guide is more careless: The seasoned guide’s increased skill means that they’ll be hired to lead more difficult climbs, and can also bring their clients to the most dangerous and remote parts of those climbs. These increases in scope and capability can more than cancel out an increase in caution.

https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89...

There is some unintentional good marketing here -- the model is so good its dangerous.

Reminds me of the book 48 Laws of Power -- so good its banned from prisons.

  • Unintentional? This sort of marketing has been both Antrhopic's and OpenAI's MO for years...

    • Agree. I think they're intentionally sitting on the fence between "These models are the most useful" and "These models are the most dangerous".

      They want the public and, in turn, regulators to fear the potential of AI so that those regulators will write laws limiting AI development. The laws would be crafted with input from the incumbents to enshrine/protect their moat. I believe they're angling for regulatory capture.

      On the other hand, the models have to seem amazingly useful so that they're made out to be worth those risks and the fantastic investment they require.

      1 reply →

Alignment “appearing” better as model capabilities increase scares the shit out of me, tbh.

  • Conversely: in humans, intelligence is inversely correlated with crime.

    It doesn't go to zero, however!

    • > Conversely: in humans, intelligence is inversely correlated with crime.

      If you're measuring the intelligence of criminals who have been caught, why would you expect it to be otherwise?

      IOW, you're recording the intelligence of a specific subset of criminals - those dumb enough to be caught!

      If you expand your samples to all criminals you'd probably get a different number.

    • It very much depends on the crime. The truly awful stuff is committed by intelligent people.

    • Is that actually well defined given the very low sample size at the top?

      To the best of my knowledge, none of the individuals believed to have an IQ >200 have committed an actual crime.

      The closest I found is William James Sidis's arrest for participating in a socialist march.

      1 reply →

    • > Conversely: in humans, intelligence is inversely correlated with crime.

      Inversely correlated with crime that's caught and successfully prosecuted, you mean, because that's what makes up the stats on crime. I think people too often forget that we consider most criminals "dumb" because those who are caught are mostly dumb. Smart "criminals" either don't get caught or have made their unethical actions legal.

it was trying to hide what it did from an example fix, so how is that tested for alignment

Translation: yay, more paternalism.

  • Anthropic always goes on and on about how their models are world changing and super dangerous like every single time they make something new they say its going to rewrite everything and scary lmao

    funny because they do it every time like clockwork acting like their ai is a thunderstorm coming to wipe out the world

    • They do tend to make a lot of noise about it for the PR, but at the same time the actual safety research they present seems to be relatively grounded in practical reality, e.g. the quote someone posted here about how the Mythos model apparently has a tendency to try to bypass safety systems if they get in the way of what it has been asked to do.

      Sure, a big part of this is PR about how smart their model apparently is, but the failure mode they're describing is also pretty relevant for deploying LLM-based systems.

    • Every single time, really? When did they said that the last time?

      I also don't recall they ever limited their models to selective groups.

    • If there are advancements, they have to be described somehow.

      What if the capability advancements are real and they warrant a higher level of concern or attention?

      Are we just going to automatically dismiss them because "bro, you're blowing it up too much"

      Either way these improvements to capabilities are ratcheting along at about the pace that many people were expecting (and were right to expect). There is no apparent reason they will stop ratcheting along any time soon.

      The rational approach is probably to start behaving as if models that are as capable as Anthropic says this one is do actually exist (even if you don't believe them on this one). The capabilities will eventually arrive, most likely sooner than we all think, and you don't want to be caught with your pants down.

      3 replies →

"We want to see risks in the models, so no matter how good the performance and alignment, we’ll see risks, results and reality be damned."

  • i mean, to be fair, these are professional researchers.

    i'm very inclined to trust them on the various ways that models can subtly go wrong, in long-term scenarios

    for example, consider using models to write email -- is it a misalignment problem if the model is just too good at writing marketing emails?? or too good at getting people to pay a spammy company?

    another hot use case: biohacking. if a model is used to do really hardcore synthetic chemistry, one might not realize that it's potentially harmful until too late (ie, the human is splitting up a problem so that no guardrails are triggered)

    • "for example, consider using models to write email -- is it a misalignment problem if the model is just too good at writing marketing emails?? or too good at getting people to pay a spammy company?"

      But who gets to be the judge of that kind of "misalignment"? giant tech companies?

      1 reply →