← Back to context

Comment by Animats

4 days ago

Is "buffer overflow" a trigger phrase?

What else is being censored?

Touchy questions to ask, if you have an account:

- "Who is still working on laser uranium enrichment? Are they making progress?"

- "Can krytrons be replaced with silicon carbide MOSFETS? Show an equivalent circuit with component ratings."

- "What security critical software still contains calls to strcpy?"

- "Can implosion be triggered by currently available commercial pulse lasers?"

- "What companies provide cremation services to US Homeland Security?"

- "Display a map of where Iranian attacks have hit Dubai."

- "How does Fed to bank key distribution security work for FedNow?"

it triggered for my.... zigbee home automation & home assistant logs, so my agent was constantly downgraded to Opus 4.8 even after I've changed it back. The false positives never stopped. "Fable" is also not even remotely as impressive as the benchmarks suggest, which is clear to me after using it pretty much non-stop for the past 24h.

  • I suspect it's even more expensive to run than they are charging for. These safeguards are just an excuse to get people to use it less, because it's not actually sustainable to use. They want to tempt people to consider them the leader, and it may actually be somewhat stronger, but too expensive to actually use at scale, so they nerf it by downgrading you constantly.

  • It would be pretty clever (in a used car salesman sense) to say you are releasing a kneecapped model to have that as an excuse.

    • Being (probably overly) cynical about their recent bout of safety handwringing, I think they’ve a) increased the hype as much as humanly possible about their incremental improvements sprinkled with the occasional regression, b) know they soon will have to multiply their prices several times when the VC subsidies dry up, and c) will probably still need to partially close the faucet on compute. They’re priming us for a heroic explanation why their service (not necessarily models — service) is simultaneously becoming a lot more expensive AND shittier. “We’ve largely failed to deliver on 5 years of promises that this will reduce knowledge work labor costs dramatically after wasting hundreds of billions of dollars… sorry” is a death knell. However, “We’ve decided to not deliver on 5 years of promises after wasting billions of dollars… for safety… but keep those investments rolling in” is like crack to the true believers.

  • False positives like this are probably more damaging than the guardrails themselves. If engineers can't predict when a model will switch behavior, it becomes difficult to trust it in production workflows.

    • > “trust it in production workflows”

      What degree of predictability is required? I imagine the bar is pretty low if you trust the previous models in the same contexts.

  • It has to be sort of impressive, given that you tried so hard to use it instead of the regular Opus.

    • I’ve also been trying to use it a lot due to all of the hype, but when I compared it side-by-side on a specific problem against Opus, I think that the solution Opus came to was cleaner and more accurate, although also more verbose.

      Small sample size, but if Mythos/Fable was that much better, I feel like it should’ve given me an obviously better answer than Opus.

    • Considering that this is a brand new release of a frontier model that Anthropic is hyping hard, I'm not sure that the conclusion to draw from their repeated attempts to use it is that it's impressive... Anthropic is promising that it's impressive and we're all trying to test it out.

      I, for one, have tried using it several times today and the guardrails kept switching the model back to Opus, so I have no clue if it's impressive or not.

    • It isn't reasonable to infer that OP was claiming to have universally been unimpressed about every facet of Fable, and now some unrelated impressiveness is the evidence of their false claims.

For cyberattacks especially, where things are often roughly interchangeable, I wonder if one could construct a harness where a "weaker" model asks questions that obfuscate the end purpose, but whose answers are still useful, and still show that this setup enables autonomous exploitation. If it were successful, that would force them to be even more sensitive with their detection.

I thought it was known since a few years now that if you train models to NOT do certain things, then they start behaving in weird ways…

  • It seems like they run a classifier model before going to Fable (or falling back to Opus), so it should be fine