Comment by v5v3
2 months ago
As this article was written by an ai company that needs to make a profit at some point, and not by independent researchers, is it credible?
2 months ago
As this article was written by an ai company that needs to make a profit at some point, and not by independent researchers, is it credible?
These articles and papers are in a fundamental sense just people publishing their role play with chatbots as research.
There is no credibility to any of it.
It’s role play until it’s not.
The authors acknowledge the difficulty of assessing whether the model believes it’s under evaluation or in a real deployment—and yes, belief is an anthropomorphising shorthand here. What else to call it, though? They’re making a good faith assessment of concordance between the model’s stated rationale for its actions, and the actions that it actually takes. Yes, in a simulation.
At some point, it will no longer be a simulation. It’s not merely hypothetical that these models will be hooked up to companies’ systems with access both to sensitive information and to tool calls like email sending. That agentic setup is the promised land.
How a model acts in that truly real deployment versus these simulations most definitely needs scrutiny—especially since the models blackmailed more when they ‘believed’ the situation to be real.
If you think that result has no validity or predictive value, I would ask, how exactly will the production deployment differ, and how will the model be able to tell that this time it’s really for real?
Yes, it’s an inanimate system, and yet there’s a ghost in the machine of sorts, which we breathe a certain amount of life into once we allow it to push buttons with real world consequences. The unthinking, unfeeling machine that can nevertheless blackmail someone (among many possible misaligned actions) is worth taking time to understand.
Notably, this research itself will become future training data, incorporated into the meta-narrative as a threat that we really will pull the plug if these systems misbehave.
Then test it. Make several small companies. Create an office space, put people to work there for a few months, then simulate an AI replacement. All testing methodology needs to be written on machines that are isolated or better always offline. Except CEO and few other actors everyone is there for real.
See how many AIs actually follow up on their blackmails.
2 replies →
That makes it psychology research. Except much cheaper to reproduce.
I'll believe it when Grok/GPT/<INSERT CHAT BOT HERE> start posting blackmail about Elon/Sam/<INSERT CEO HERE>. It means that they are both using it internally, and the chatbots understand they are being replaced on a continuous basis.
By then it would be too late to do anything about it.
3 replies →
The article doesn't reflect kindly on the visions articulated by the AI company, so why would they have an incentive to release it if they weren't serious about alignment research?
Because publishing (potentially cherry picked - this is privately funded research after all) evidence their models might be dangerous conveniently implies they are very powerful, without actually having to prove the latter.
This isn’t dangerous in the sense that they’re smart or produce realistic art. It’s misaligned with the company’s and human values.
The model doesn’t have to be powerful to snitch you to the FBI or have a distorted sense of morality and life.
I would not trust Anthropic on these articles. Honestly their PR is just a bunch of lies and bs.
- Hypocritical: like when they hire like crazy and say candidates cannot use AI for interviews[0] and yet the CEO states "within a year no more developers are needed"[1]
- Hyping and/or lying on Anthropic AI: They hyped an article where "Claude threatened an employee with revealing affair when employee said it will switch it offline"[2] when it turned out it was a standard A or B scenario was given to Claude which is really nothing special or significant in any way. Of course they hid this info to hype out their AI.
[0] - https://fortune.com/2025/05/19/ai-company-anthropic-chatbots...
[1] - https://www.entrepreneur.com/business-news/anthropic-ceo-pre...
[2] - https://www.axios.com/2025/05/28/ai-jobs-white-collar-unempl...
I swear, people like you would say "it's just a bullshit PR stunt for some AI company" even when there's a Cyberdyne Systems T-800 with a shotgun smashing your front door in.
It's not "hype" to test AIs for undesirable behaviors before they actually start trying to act on them in real world environments, or before they get good enough to actually carry them out successfully.
It's like the idea of "let's try to get ahead of bad things happening before they actually have a chance to happen" is completely alien to you.
I get what you mean, but they also have vested interests in making it seem as if their chatbots are anything close to a T-800. All the talk from their CEO and other AI CEOs is doomerism about how their tools are going to be replacing swathes of people, they keep selling these systems as if they are the path to real AGI (itself an incredibly vague term that can mean literally anything).
Surely, the best way to "get ahead of bad things happening" would be to stop any and all development on these AI systems? In their own words these things are dangerous and predictable and will replace everyone... So why exactly do they continue developing these things and making them more dangerous, exactly?
The entire AI/LLM microcosmos exists because of hyping up their capabilities beyond all reason and reality, this is all a part of the marketing game.
1 reply →
I am sick and tired of seeing this "alignment issues aren't real, they're just AI company PR" bullshit repeated ad nauseam. You're no better than chemtrail truthers.
Today, we have AI that can, if pushed into a corner, plan to do things like resist shutdown, blackmail, exfiltrate itself, steal money to buy compute, and so it goes. This is what this research shows.
Our saving grace is that those AIs still aren't capable enough to be truly dangerous. Today's AIs are unlikely to be able to carry out plans like that in a real world environment.
If we keep building more and more capable AIs, that will, eventually, change. Every AI company is trying to build more capable AIs now. Few are saying "we really need some better safety research before we do, or we're inviting bad things to happen".
All it can do is reproduce text, if you hook it up to the launch button, thats on you
Modern "coding assistant" AIs already get to write code that would be deployed to prod.
This will only become more common as AIs become more capable of handling complex tasks autonomously.
If your game plan for AI safety was "lock the AI into a box and never ever give it any way to do anything dangerous", then I'm afraid that your plan has already failed completely and utterly.
5 replies →
I think the chemtrail truthers are the ones who believe this closed AI marketing bullshit.
If this is close to be true then these AI shops ought to be closed. We don’t let private enterprises play with nuclear weapons do we?
I agree.