Comment by robot-wrangler

3 months ago

Yeah, remember the whole semantic distance vector stuff of "king-man+woman=queen"? Psychometrics might be largely ridiculous pseudoscience for people, but since it's basically real for LLMs poetry does seem like an attack method that's hard to really defend against.

For example, maybe you could throw away gibberish input on the assumption it is trying to exploit entangled words/concepts without triggering guard-rails. Similarly you could try to fight GAN attacks with images if you could reject imperfections/noise that's inconsistent with what cameras would output. If the input is potentially "art" though.. now there's no hard criteria left to decide to filter or reject anything.

3 comments

robot-wrangler

ACCount37 3 months ago

I don't think humans are fundamentally different. Just more hardened against adversarial exploitation.

"Getting maliciously manipulated by other smarter humans" was a real evolutionary pressure ever since humans learned speech, if not before. And humans are still far from perfect on that front - they're barely "good enough" on average, and far less than that on the lower end.

wat10000 3 months ago

Walk out the door carrying a computer -> police called.
Walk out the door carrying a computer and a clipboard while wearing a high-vis vest -> "let me get the door for you."
seethishat 3 months ago

Maybe the models can learn to be more cynical.