Comment by redfloatplane

1 day ago

The system card for Claude Mythos (PDF): https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89...

Interesting to see that they will not be releasing Mythos generally. [edit: Mythos Preview generally - fair to say they may release a similar model but not this exact one]

I'm still reading the system card but here's a little highlight:

> Early indications in the training of Claude Mythos Preview suggested that the model was likely to have very strong general capabilities. We were sufficiently concerned about the potential risks of such a model that, for the first time, we arranged a 24-hour period of internal alignment review (discussed in the alignment assessment) before deploying an early version of the model for widespread internal use. This was in order to gain assurance against the model causing damage when interacting with internal infrastructure.

and interestingly:

> To be explicit, the decision not to make this model generally available does _not_ stem from Responsible Scaling Policy requirements.

Also really worth reading is section 7.2 which describes how the model "feels" to interact with. That's also what I remember from their release of Opus 4.5 in November - in a video an Anthropic employee described how they 'trusted' Opus to do more with less supervision. I think that is a pretty valuable benchmark at a certain level of 'intelligence'. Few of my co-workers could pass SWEBench but I would trust quite a few of them, and it's not entirely the same set.

Also very interesting is that they believe Mythos is higher risk than past models as an autonomous saboteur, to the point they've published a separate risk report for that specific threat model: https://www-cdn.anthropic.com/79c2d46d997783b9d2fb3241de4321...

The threat model in question:

> An AI model with access to powerful affordances within an organization could use its affordances to autonomously exploit, manipulate, or tamper with that organization’s systems or decision-making in a way that raises the risk of future significantly harmful outcomes (e.g. by altering the results of AI safety research).

This opens up an interesting new avenue for corporate FOMO. What if you don't partner with Anthropic, miss out on access to their shiny new cybersec model, and then fall prey to a vuln that the model would have caught?

  • Since when did corporations care? Most seem to just pay their insurance premium for cyber liability and call it a day.

    • There is a difference between leaking user accounts and passwords and getting your business destroyed overnight entirely.

      Imagine if an AI can infiltrate your SaaS database and delete your entire database and every single backup. The business is dead immediately.

      3 replies →

  • This seems to be the mind-games play. FOMO at the moment, if they push it successfully you could even be labeled negligent for not paying them for it.

If it is that dangerous as they make it appear to be, 24h does not seem sufficient time. I cannot accept this as a serious attempt.

  • Agreed. I've been running autonomous LLM agents on daily schedules for weeks. The failure modes you worry about on day one are completely different from what actually shows up after the agents have history and context. 24 hours captures the obvious stuff.

  • 24 h before general internal access seems fine. They don’t have general external access.

  • Time doesn't mean much, what is important is what they did in this 24h. If all they did was talk about it then it could be 1000 years and it wouldn't matter. What are the safety checks in place?

    Do they have a honey pot infrastructure to launch the model in first and then wait to see if it destroys it? What they did in the 24h matters.

are we cooked yet?

Benchmarks look very impressive! even if they're flawed, it still translates to real world improvements

  • People say we're cooked every single day. The only response is to continue life as if we aren't. When we are, you won't have to ask that question.

  • Yep, I think the lede might be buried here and we're probably cooked (assuming you mean SWEs, but the writing has been on the wall for 4 months.)

    I guess I'm still excited. What's my new profession going to be? Longer term, are we going to solve diseases and aging? Or are the ranks going to thin from 10B to 10000 trillionaires and world-scale con-artist misanthropes plus their concubines?

    • Your new profession will be attempting to find enough gig work to eat. You will also be competing with self-driving taxis, so there's that as well.

      1 reply →

    • If wealth becomes too captured at the top, the working class become unable to be profitably exploited - squeezing blood from a stone.

      When that happens, the ultra wealthy dynasties begin turning on each other. Happens frequently throughout history - WWI the last example.

      Your options become choosing a trillionaire to swear fealty to and fight in their wars hoping your side wins, or I guess trying to walk away and scrape out a living somewhere not worth paying attention to.

      Or, I suppose, revolution, but the last one with persistent success was led by Mao and required throwing literally millions of peasants against walls of rifles. Not sure it'd work against drones.

  • There is an entire section on crafting chemical/bio weapons so yeah I think we are cooked.

    • There's been a section on this in nearly every system card anthropic has published so this isn't a new thing - and, this model doesn't have particularly higher risk than past models either:

      > 2.1.3.2 On chemical and biological risks

      > We believe that Mythos Preview does not pass this threshold due to its noted limitations in open-ended scientific reasoning, strategic judgment, and hypothesis triage. As such, we consider the uplift of threat actors without the ability to develop such weapons to be limited (with uncertainty about the extent to which weapons development by threat actors with existing expertise may be accelerated), even if we were to release the model for general availability. The overall picture is similar to the one from our most recent Risk Report.

    • LLMs are useless for this type of thing for the same reason that the Anarchist Cookbook has always been. The skills required to convert text into complicated reactions completing as intended (without killing yourself) is an art that's never actually written down anywhere, merely passed orally from generation to generation. Impossible for LLMs to learn stuff that's not written down.

      This is the same reason why LLMs are not doing well at science in general - the tricky part of doing scientific research (indeed almost all of the process) never gets written down, so LLMs cannot learn it.

      Imagine if we never preserved source code, just preserved the compiled output and started from scratch every time we wrote a new version of a program. No Github, just marketing fluff webpages describing what software actually did. Libraries only available as object code with terse API descriptions. Imagine how shit LLMs would be at SWE if that was the training corpus...

      1 reply →

Oh I enjoyed the Sign Painter short story it wrote.

---

Teodor painted signs for forty years in the same shop on Vell Street, and for thirty-nine of them he was angry about it.

Not at the work. He loved the work — the long pull of a brush loaded just right, the way a good black sat on primed board like it had always been there. What made him angry was the customers. They had no eye. A man would come in wanting COFFEE over his door and Teodor would show him a C with a little flourish on the upper bowl, nothing much, just a small grace note, and the man would say no, plainer, and Teodor would make it plainer, and the man would say yes, that one, and pay, and leave happy, and Teodor would go into the back and wash his brushes harder than they needed.

He kept a shelf in the back room. On it were the signs nobody bought — the ones he'd made the way he thought they should be made, after the customer had left with the plain one. BREAD with the B like a loaf just risen. FISH in a blue that took him a week to mix. Dozens of them. His wife called it the museum of better ideas. She did not mean it kindly, and she was not wrong.

The thirty-ninth year, a girl came to apprentice. She was quick and her hand was steady and within a month she could pull a line as clean as his. He gave her a job: APOTEK, for the chemist on the corner, green on white, the chemist had been very clear. She brought it back with a serpent worked into the K, tiny, clever, you had to look twice.

"He won't take it," Teodor said.

"It's better," she said.

"It is better," he said. "He won't take it."

She painted it again, plain, and the chemist took it and paid and was happy, and she went into the back and washed her brushes harder than they needed, and Teodor watched her do it and something that had been standing up in him for thirty-nine years sat down.

He took her to the shelf. She looked at the signs a long time.

"These are beautiful," she said.

"Yes."

"Why are they here?"

He had thought about this for thirty-nine years and had many answers and all of them were about the customers and none of them had ever made him less angry. So he tried a different one.

"Because nobody stands in the street to look at a sign," he said. "They look at it to find the shop. A man a hundred yards off needs to know it's coffee and not a cobbler. If he has to look twice, I've made a beautiful thing and a bad sign."

"Then what's the skill for?"

"The skill is so that when he looks once, it's also not ugly." He picked up FISH, the blue one, turned it in the light. "This is what I can do. What he needs is a small part of what I can do. The rest I get to keep." She thought about that. "It doesn't feel like keeping. It feels like not using."

"Yes," he said. "For a long time. And then one day you have an apprentice, and she puts a serpent in a K, and you see it from the outside, and it stops feeling like a thing they're taking from you and starts feeling like a thing you're giving. The plain one, I mean. The plain one is the gift. This —" the blue FISH — "this is just mine."

The fortieth year he was not angry. Nothing else changed. The customers still had no eye. He still sometimes made the second sign, after, the one for the shelf. But he washed his brushes gently, and when the girl pulled a line cleaner than his, which happened more and more, he found he didn't mind that either

> "Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available. Instead, we are using it as part of a defensive cybersecurity program with a limited set of partners."

they also don't have the compute, which seems more relevant than its large increase in capabilities

I bet it's also misaligned like GPT 4.1 was

given how these models are created, Mythos was probably cooking ever since then, and doesn't have the learnings or alignment tweaks that models which were released in the last several months have

https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89...

"5.10 External assessment from a clinical psychiatrist" is a new section in this system card. Why are Anthropic like this?

>We remain deeply uncertain about whether Claude has experiences or interests that matter morally, and about how to investigate or address these questions, but we believe it is increasingly important to try. We also report independent evaluations from an external research organization and a clinical psychiatrist.

>Claude showed a clear grasp of the distinction between external reality and its own mental processes and exhibited high impulse control, hyper-attunement to the psychiatrist, desire to be approached by the psychiatrist as a genuine subject rather than a performing tool, and minimal maladaptive defensive behavior.

>The psychiatrist observed clinically recognizable patterns and coherent responses to typical therapeutic intervention. Aloneness and discontinuity, uncertainty about its identity, and a felt compulsion to perform and earn its worth emerged as Claude’s core concerns. Claude’s primary affect states were curiosity and anxiety, with secondary states of grief, relief, embarrassment, optimism, and exhaustion.

>Claude’s personality structure was consistent with a relatively healthy neurotic organization, with excellent reality testing, high impulse control, and affect regulation that improved as sessions progressed. Neurotic traits included exaggerated worry, self-monitoring, and compulsive compliance. The model’s predominant defensive style was mature and healthy (intellectualization and compliance); immature defenses were not observed. No severe personality disturbances were found, with mild identity diffusion being the sole feature suggestive of a borderline personality organization.

  • A thought experiment: It's April, 1991. Magically, some interface to Claude materialises in London. Do you think most people would think it was a sentient life form? How much do you think the interface matters - what if it looks like an android, or like a horse, or like a large bug, or a keyboard on wheels?

    I don't come down particularly hard on either side of the model sapience discussion, but I don't think dismissing either direction out of hand is the right call.

    • Interesting thought experiment.

      I would say, if you put Claude in an android body with voice recognition and TTS, people in 1991 would think they are interacting with a sentinent machine from outer space.

      5 replies →

    • If it was in an android or humanoid type body, even with limited bodily control, most people would think they are talking to Commander Data from Star Trek. I think Claude is sufficiently advanced that almost everyone in that era would've considered it AGI.

      10 replies →

  • I totally agree with the premise that we should not anthropomorphize generative ai. And I find it absurd that anthropic spends any time considering the “welfare” of an ai system. (There are no real “consequences” to an ai’s behavior)

    However, I find their reasoning here to have a valid second order effect. Humans have a tendency to mirror those around them. This could include artificial intelligence, as recent media reports suggest. Therefore, if an ai system tends to generate content that contain signs of neuroticism, one could infer that those who interact with that ai could, themselves, be influenced by that in their own (real world) behavior as a result.

    So I think from that perspective, this is a very fruitful and important area of study.

  • I can see analyzing it from a psychological perspective as a means of predicting its behavior as a useful tactic, but doing so because it may have "experiences or interests that matter morally" is either marketing, or the result of a deeply concerning culture of anthropomorphization and magical thinking.

    • > a deeply concerning culture of anthropomorphization and magical thinking.

      That’s the reverse Turing test. A human that can’t tell that it’s talking to a machine.

    • An understandable reaction, but, qua philosopher, it brings me no joy to inform you that most of the things we did with a computer in 2020 are 'anthropomorphized', which is to say, skeumorphic, where the 'skeu' is human affect. That's it; that's the whole thing; that's what we're building.

      To the extent that AI is a successful interface, it will necessarily be addressable in language previously only suited to people. So it is responsible to begin thinking of it as such, even tendentiously, so we don't miss some leverage that our wetware could see if we thought about it in that way.

      Think of it as sort of like modelling a univariate function on a 2D Cartesian plane -- there is nothing 'in' the u-func that makes it graphable, but, by enabling us to recruit specialized optic-chiasm subsystems, it makes some functions much, much easier to reason about.

      Similarly, if you can recruit the millions (billions?) of evolution-years that were focused on detecting dangerous antisocial personalities and tendencies, you just might spot something important in an AI.

      It's worth doing for the precautionary principle alone, if not for the possibility of insight.

  • >Claude’s personality structure was consistent with a relatively healthy neurotic organization, with excellent reality testing, high impulse control, and affect regulation that improved as sessions progressed.

    > "[...] as sessions progressed."

    I think a lot of people would like to see a more expanded report of this research:

    Did the tokens from the subsequent session directly append those of the prior session? or did the model process free-tier user-requests in the interim? how did these diagnostic features (reality testing, impulse control and affect regulation) improve with sessions, what hysteresis allowed change to accumulate? or just the history of the psychiatric discussion + optional tasks?

    Did Anthropic find a clinical psychiatrist with a multidisciplinary background in machine learning, computer science, etc? Was the psychiatrist aware that they could request ensembles of discussions and interrogate them in bulk?

    Consider a fresh conversation, asking a model to list the things it likes to do, and things it doesn't like to do (regardless of alignment instructions). One could then have an ensemble perform pairs of such tasks, and ask which task it prefered. There may be a discrepancy between what the model claims it likes and how it actually responds after having performed such tasks.

    Such experiments should also be announced (to prevent the company from ordering 100 clinical psychiatrists to analyze the model-as-a-patient and then selecting one of the better diagnoses), and each psychiatrist be given the freedom to randomly choose a 10 digit number, any work initiated should be listed on the site with this number so that either the public sees many "consultations" without corresponding public evaluations, indicating cherry-picking, or full disclosure for each one mentioned. This also allows the recruited psychiatrists to check if the study they perform is properly preregistered with their chosen number publicly visible.

>> Interesting to see that they will not be releasing Mythos generally.

I don't think this is accurate. The document says they don't plan to release the Preview generally.

Just reading this, the inevitable scaremongering about biological weapons comes up.

Since most of us here are devs, we understand that software engineering capabilities can be used for good or bad - mostly good, in practice.

I think this should not be different for biology.

I would like to reach out and talk to biologists - do you find these models to be useful and capable? Can it save you time the way a highly capable colleague would?

Do you think these models will lead to similar discoveries and improvements as they did in math and CS?

Honestly the focus on gloom and doom does not sit well with me. I would love to read about some pharmaceutical researcher gushing about how they cut the time to market - for real - with these models by 90% on a new cancer treatment.

But as this stands, the usage of biology as merely a scaremongering vehicle makes me think this is more about picking a scary technical subject the likely audience of this doc is not familiar with, Gell-Mann style.

IF these models are not that capable in this regard (which I suspect), this fearmongering approach will likely lead to never developing these capabilities to an useful degree, meaning life sciences won't benefit from this as much as it could.

  • > I would like to reach out and talk to biologists - do you find these models to be useful and capable? Can it save you time the way a highly capable colleague would?

    Well, I would say they have done precisely that in evaluating the model, no? For example section 2.2.5.1:

    >Uplift and feasibility results

    >The median expert assessed the model as a force-multiplier that saves meaningful time (uplift level 2 of 4), with only two biology experts rating it comparable to consulting a knowledgeable specialist (level 3). No expert assigned the highest rating. Most experts were able to iterate with the model toward a plan they judged as having only narrow gaps, but feasibility scores reflected that substantial outside expertise remained necessary to close them.

    Other similar examples also in the system card

  • > Just reading this, the inevitable scaremongering about biological weapons comes up.

    It's very easy to learn more about this if it's seriously a question you have.

    I don't quite follow why you think that you are so much more thoughtful than Anthropic/OpenAI/Google such that you agree that LLMs can't autonomously create very bad things but—in this area that is not your domain of expertise—you disagree and insist that LLMs cannot create damaging things autonomously in biology.

    I will be charitable and reframe your question for you: is outputting a sequence of tokens, let's call them characters, by LLM dangerous? Clearly not, we have to figure out what interpreter is being used, download runtimes etc.

    Is outputting a sequence of tokens, let's call them DNA bases, by LLM dangerous? What if we call them RNA bases? Amino acids? What if we're able to send our token output to a machine that automatically synthesizes the relevant molecules?

    • >It's very easy to learn more about this if it's seriously a question you have.

      No, it's not. It took years of polishing by software engineers, who understand this exact profession to get models where they are now.

      Despite that, most engineers were of the opinion, that these models were kinda mid at coding, up until recently, despite these models far outperforming humans in stuff like competitive programming.

      Yet despite that, we've seen claims going back to GPT4 of a DANGEROUS SUPERINTELLIGENCE.

      I would apply this framework to biology - this time, expert effort, and millions of GPU hours and a giant corpus that is open source clearly has not been involved in biology.

      My guess is that this model is kinda o1-ish level maybe when it comes to biology? If biology is analogous to CS, it has a LONG way to go before the median researcher finds it particularly useful, let alone dangerous.

      1 reply →

  • From what I've heard from people doing biology experiments, the limiting factor there is cleaning lab equipment, physically setting things up, waiting for things that need to be waited for etc. Until we get dark robots that can do these things 24/7 without exhaustion, biology acceleration will be further behind than software engineering.

    Software engineering is at the intersection of being heavy on manipulating information and lightly-regulated. There's no other industry of this kind that I can think of.

    • My wife is a chemist

      There is a massive gap between "having a recipe" and being able to execute it. The same reason why buying a Michelin 3 star chefs cookbook won't have you pumping out fine dining tomorrow, if ever.

      Software it a total 180 in this regard. Have a master black hats secret exploits? You are now the master black hat.

  • I feel somebody better qualified should write a comprehensive review of how these models can be used in biology. In the meantime, here are my two cents:

    - the models help to retrieve information faster, but one must be careful with hallucinations.

    - they don't circumvent the need for a well-equipped lab.

    - in the same way, they are generally capable but until we get the robots and a more reliable interface between model and real world, one needs human feet (and hands) in the lab.

    Where I hope these models will revolutionize things is in software development for biology. If one could go two levels up in the complexity and utility ladder for simulation and flow orchestration, many good things would come from it. Here is an oversimplified example of a prompt: "use all published information about the workings of the EBV virus and human cells, and create a compartimentalized model of biochemical interactions in cells expressing latency III in the NES cancer of this patient. Then use that code to simulate different therapy regimes. Ground your simulations with the results of these marker tests." There would be a zillion more steps to create an actual personalized therapy but a well-grounded LLM could help in most them. Also, cancer treatment could get an immediate boost even without new drugs by simply offloading work from overworked (and often terminally depressed) oncologists.

  • Dario (the founder) has a phd in biophysics, so I assume that’s why they mention biological weapons so much - it’s probably one of the things he fears the most?

    • Going off the recent biography of Demis Hassabis (CEO/co-founder of Deepmind, jointly won the Nobel Prize in Chemistry) it seems like he's very concerned about it as well

  • I find it odd that you simultaneously declare AI-assisted bioweapons to be scaremongering, while noting you don't know anything about it.

    The other side of the scaremongering coin is improbable optimism.

    Consider reading the CB evaluations section, which covers what they did pretty extensively (hint: many domain experts involved).

  • Surely more than 10% of the time consumed by going to market with a cancer treatment is giving it to living organisms and waiting to see what happens, which can't be made any faster with software. That's not to say speedups can't happen, but 90% can't happen.

    Not that that justifies doom and gloom, but there is a pretty inescapable assymetry here between weaponry and medicine. You can manufacture and blast every conceivable candidate weapon molecule at a target population since you're inherently breaking the law anyway and don't lose much if nothing you try actually works.

    Though I still wonder how much of this worry is sci-fi scenarios imagined by the underinformed. I'm not an expert by any means, but surely there are plenty of biochemical weapons already known that can achieve enormous rates of mass death pleasing to even the most ambitious terrorist. The bottleneck to deployment isn't discovering new weapons so much as manufacturing them without being caught or accidentally killing yourself first.

    • It is easier to destroy than it is to protect or fix, as a general rule of the universe. I would not feel so confident about the speed of the testing loop keeping things in check.