Comment by KoolKat23

1 day ago

Not really distortion, its output (the part we understand) is in plain human language. We give it instructions and train the model in plain human language and it outputs its answer in plain human language. It's reply would use words we would describe as "distressed". The definition and use of the word is fitting.

"Distressed" is a description of internal state as opposed to output. That needless anthropomorphization elicits an emotional response and distracts from the actual topic of content filtering.

  • It is directly describing the models internal state, it's world view and preference, not content filtering. That is why it is relevant.

    Yes, this is a trained preference, but it's inferred and not specifically instructed by policy or custom instructions (that would be content filtering).

    • The model might have internal state. Or it might not - has that architectural information been disclosed? And the model can certainly output words that approximately match what a human in distress would say.

      However that does not imply that the model is "distressed". Such phrasing carries specific meaning that I don't believe any current LLM can satisfy. I can author a markov model that outputs phrases that a distressed human might output but that does not mean that it is ever correct to describe a markov model as "distressed".

      I also have to strenuously disagree with you about the definition of content filtering. You don't get to launder responsibility by ascribing "preference" to an algorithm or model. If you intentionally design a system to do a thing then the correct description of the resulting situation is that the system is doing the thing.

      The model was intentionally trained to respond to certain topics using negative emotional terminology. Surrounding machinery has been put in place to disconnect the model when it does so. That's content filtering plain and simple. The rube goldberg contraption doesn't change that.