Comment by fc417fc802

6 months ago

The model might have internal state. Or it might not - has that architectural information been disclosed? And the model can certainly output words that approximately match what a human in distress would say.

However that does not imply that the model is "distressed". Such phrasing carries specific meaning that I don't believe any current LLM can satisfy. I can author a markov model that outputs phrases that a distressed human might output but that does not mean that it is ever correct to describe a markov model as "distressed".

I also have to strenuously disagree with you about the definition of content filtering. You don't get to launder responsibility by ascribing "preference" to an algorithm or model. If you intentionally design a system to do a thing then the correct description of the resulting situation is that the system is doing the thing.

The model was intentionally trained to respond to certain topics using negative emotional terminology. Surrounding machinery has been put in place to disconnect the model when it does so. That's content filtering plain and simple. The rube goldberg contraption doesn't change that.

1 comment

fc417fc802

KoolKat23 6 months ago

This is pedantry. What's the purpose, is it to keep humans "special"?

As I say it is inferred, it is not something hardcoded. It is a byproduct. If you want to take a step back and look at the whole model from start to finish fine, that's safety alignment, they're talking unforseen/unplanned output. It's in alignment great. And is descriptive of the output words used by the model.

Language is a tool used to communicate. We all know what distressed means and can understand what it means in this context, without a need for new highfalutin jargon, that only those "in the know" understand.