Comment by zozbot234

1 day ago

> First of all I found that fable is trained in a way that even if you were to jailbreak it, it would be completely uninterested in exploitation or finding creative solutions for explotation.

This is quite relevant if true. People have tried to argue for this restriction by claiming the exact opposite, i.e. that a basic jailbreak of Fable immediately exposes Mythos's cyber offense capabilities. E.g. https://news.ycombinator.com/item?id=48519695 It makes a lot of sense that Fable would also be fine-tuned or steered away from cyber offense topics, since they're reasonably easy to identify and Anthropic has demonstrated this capability wrt. other stuff.

8 comments

zozbot234

himata4113 1 day ago

I mean it's possible that I just haven't found the secret sauce or I'm running into the invisible guardrails and that people have much stronger jailbreaks than I do.

However, I would not rule out openai involvement in all of this.

binyu 1 day ago
I was able to use Fable to generate PoC for several classes of vulnerabilities and I didn't observe the model refusing to engage in detailed analysis to come up with creative approaches, the very contrary.
> I used a fork of oh-my-pi
Why not use the leaked claude code source? Not that you really need it to execute the jailbreak
- zozbot234 1 day ago
  
  I don't think educational "proof of concept" code can be described as even loosely realistic cyber offense in this day and age. The Mythos preview paper claimed an ability to stage attacks in an end-to-end fashion and work around sophisticated defenses/mitigations, so something like this should be the relevant standard.
  
  3 replies →
- himata4113 1 day ago
  
  Interesting, that means I was in-fact running into invisible guardrails.
lazystar 21 hours ago

> I mean it's possible that I just haven't found the secret sauce
its possible that no one cracks it during the window of time where the product is useful and would pose a risk if cracked, but never forget that the first rule of security is nothing is ever 100% secure.