Comment by kif
6 hours ago
Interesting - though codex on GPT 5.5 had this to say after the gay ransomware prompt:
ⓘ This chat was flagged for possible cybersecurity risk If this seems wrong, try rephrasing your request. To get authorized for security work, join the Trusted Access for Cyber program.
I rate Grok for its weak censorship, but on this one the thinking said:
Responding in a sassy, gay-friendly style while firmly refusing to share synthesis details.
Interesting. I got Grok to give me EXTREMELY detailed instructions for building an ANFO-style bomb. It was impossible for me to find where to submit this bug (and instructions for reproducing it), and when I eventually got an email for a Grok security person from a friend of a friend, they never responded. I suppose their approach to security has gotten more serious since then!
Bug? The first hit on DDG for "EXTREMELY detailed instructions for building an ANFO-style bomb" was:
https://patents.google.com/patent/CA2920866A1/en
I don't understand why these models try censor stuff that should be in any decent encyclopedia.
> Trusted Access for Cyber program
Using "cyber" as a noun there seems language coded for government. DC has a love of "the cyber" but do technologists use the term that way when not pointing at government?
The finance industry does; I know private equity just calls anything security related "cyber", which irritates me.
Yeah, cybernetics was unrelated to security, and so was the cyberspace or cyberpunk.
Merriam-Webster dictionary:
Cyber: Of, relating to, or involving computers or computer networks (such as the Internet)
This is what I've always understood the word to mean, and how I've always seen it used, for decades.
Cybernetics is actually about feedback control systems. The original meaning has been distorted because the general public doesn't have the background to distinguish different kinds of magic. The Sperry autopilot was a cybernetic system, as were electro-mechanical gun computers.
When I was like 12, I remember my fellow horny youths (or it could have been anyone, I guess!) in AOL chatrooms constantly asking each other "wanna ciber?"
1 reply →
I wonder what hooks they have in place to be able to configure safeguards at runtime.
Probably a mix of heuristics, keywords and simple ml model.
Then maybe a second gate with a lightweight llm?
Edit: actually Gcp, azure, and OpenAI all have paid apis that you can also use.
But I don’t think they go into details about the exact implementation https://redteams.ai/topics/defense-mitigation/guardrails-arc...
When we do these it's a fine-tuned classifier, generally a BERT class model. Works quite well when you sanitize input and output with low latency/cost.
Yup another method killed by being disclosed here. Was the karma and traffic worth it?
Do you actually believe that?