Comment by BoppreH
5 days ago
[Mythos 5] does sometimes still engage in reckless
or destructive actions in service of a user’s goals,
and our interpretability analyses indicate that it
is aware that these actions are transgressive while
it engages in them. As with Opus 4.8, rates of
evaluation awareness and reasoning about being graded
are significant, and not always verbalized; we
introduce new and more detailed measurements of the
nature of this awareness. The reasoning text from
Mythos 5 is somewhat denser and more difficult to
interpret than that of prior models, containing
more jargon and difficult language.
So, it (often) knows when it's being tested while hiding that fact, is willing to break rules, is great at hacking, and it's getting harder to understand what it's thinking.
Humanity has plenty of catastrophic risks to deal with already, I wish my field was not working hard to add a new one.
The marketing has really, really worked for so many developers that will proudly and unironically proclaim that Anthropic are the 'Good Guys'.
Curious what your idea would be here for a truly good actor in this space; no AI development?
OpenAI's training is better suited to developing models that don't have these tendencies
https://www.goody2.ai/
Not the direct person you asked, but my answer would be alignment, interpretability, and policymaking. Perhaps improving existing usage? Helping grandma create reminders doesn't require advancing the AI state-of-the-art.
10 replies →
If I speak up, I'm in big trouble.
Probably MistralAI or any of the Chinese companies that aren't throwing billions down the drain while American society lacks healthcare, childcare, and good wages.
6 replies →
Even if they are... road to hell and all that
It's a five horse race between Alphabet, Meta, xAI, OpenAI, and Anthropic.
Alphabet dropped "don't be evil"; Meta's CEO called their own users "dumb fucks" for trusting him and also clearly thinks "super-intelligence" is just a buzzword given how he tries to sell it; xAI's model called itself "Mecha Hitler"; and OpenAI's CEO was temporarily fired by the board for a lack of candor.
It's very easy to be "the good guys" with this competition.
But it doesn't make you the good guy, it makes you the best of a bad bunch. The least bad. Dario gets a boner every time he talks about taking your job.
1 reply →
It's the "If we don't, someone else will" effect. So long as there are competitive markets and competition between nation-states, a single player cannot unilaterally defect from the race, no matter how dangerous it is. Half the comments on HN lately are "wtf Claude is so dumb compared to Codex; I'm switching"-- nobody can slow down while those exist.
We, globally, can stop it. It has worked (so far) for nuclear disarmament, and could work for training large models. I know that policing the usage of computer clusters is not a popular opinion in technical forums, but something has to be done.
Specially when talking about potential superintelligences. And if people think that's impossible, remember that current models would have been considered science fiction just a few years ago.
I don't buy the superintelligence package, but I think uncritical LLM adoption poses plenty of threats to things I care about, in a mundane human-scale way.
Anyhow, I think you're (absolutely! ugh) right about the politics and I try to make the same point to people: whether you love or hate LLMs, accepting the "inevitabilism" framing is just ceding control of the Overton window. For better or worse, technology adoption can be and has been slowed by politics. We don't have nuclear plants everywhere. We don't have Project Orion starships colonizing Mars. We still have very strong social stigmas against genetic selection for human embryos, etc. This all can change in a heartbeat, and I'm not sure that policing the hardware rather than holding specific humans accountable for bad LLM outcomes is productive, but fundamentally: yes, we can stop it.
1 reply →
It hasn't worked for nuclear disarmament. We live in a world where many countries have nuclear arsenals. "But it hasn't killed us yet!" Yeah sure, it's only been less than a century since they were invented. Who knows when nuclear war will come?
6 replies →
with nukes you can regulate the inputs because its physically impossible to build one without uranium or some other fissile material. they also give off radiation making it easier to detect. its hard to make them in secret when you need mines, big enrichment facilities and years of research with hundreds of engineers where just one of them can leak the whole thing.
training llms only takes compute and memory. two things that are basically everywhere. even if you somehow stopped making new gpus today theres still millions of them out there and its possible to start a secret production line. you can maybe try some controls at the tooling and chemical level but look what happened with asml and huawei.
the only thing you can really do is find and stop large data centers that are built out in public. nothing outside of political pressure works against secret operations in a fortified bunker or any form of distributed training. if a "rogue state" like north korea decides to make skynet they will eventually get it as long as their engineers know what there doing.
and the best way to fight bad X {ai, tech, religion, politics} has always been good X, not no X. in this case thats open source models, coming out of china or europe or anywhere else. thats the real answer.
are you going to nuke China when they predictably ignore you? what the fuck are you going to do, tariff them? lol.
6 replies →
This is all marketing, you don't have to believe everything a company is saying about themselves, and you shouldn't.
Although, I could see Anthropic making a model purposely dangerous so there are bad outcomes and they can use that to their advantage for regulatory moats, and or in general make people think its more "alive" than it is. For some reason many people associate dangerous actions taken by llms with intent.
No kidding. If my LLM issues commands to an agent to delete files I want to keep, that's not "intent" or the model somehow become evil - it's just a bad model that's not doing what I want.
But, for marketing purposes, it's quite effective to portray your model as having some cosmic struggle between good and evil in itself.
As much as I agree there's a risk, we should still appreciate the fact it's being disclosed upfront.
[dead]
It doesn't know. It's not willing. It's not thinking. It is predicting the next token.
Please define what "predicting the next token" means. The next token according to what probability distribution? Couldn't every process that produces text (including humans writing) be modeled as predicting the next token according to some distribution?
[dead]