← Back to context

Comment by Borealid

7 hours ago

> If you make an LLM more safe, you are going to shift the weight for defensive actions as well. > > There’s no physical way to assign weights to have one and not the other.

Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human?

If no, how does a cybersec firm train its employees?

If yes, how can you make the bold claim that it's possible for a human to differentiate between the two cases using incoming text as their basis for judgement, but IMpossible for an LLM to be configured to do the same? Note that if some hypothetical completely-determinstic LLM that always rejects "attack" requests and accepts "defense" ones can exist, the claim it's impossible is false. Providing nondeterministic output for a given input is not a hard requirement for language models.

> Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human? > If no, how does a cybersec firm train its employees?

In general, no, humans can’t be sure they are only helping with defensive and not offensive work unless they have more context. IRL, a security engineer would know who they’re working for. If they’re advising Apple, then they’d feel pretty confident that Apple is not turning around and hacking people.

  • If the task is ill-defined, then it's a bit unfair to make it sound like the problem is that an LLM can't be configured to do something, if a human would have an equally hard time with the same task. The statement "it's impossible to configure the weights to..." should really be something more broad like "it's impossible to...".

    I have no comment about whether it's impossible to determine the intentions of a person asking for assistance through a textual conversation with that person.

> IMpossible for an LLM to be configured to do the same?

Because that’s what I am seeing emerge from the various efforts to build LLM safety tools.

> Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human?

LLM != human? They don’t even use the same reasoning process.

  • > Because that’s what I am seeing emerge from the various efforts to build LLM safety tools.

    Something having not been obtained so far is not a logical argument it is impossible to obtain that thing.

    > LLM != human? They don’t even use the same reasoning process.

    There are a finite number of possible input strings of a given length. For any set of input strings, it is possible to build a deterministic mapping that produces "correct" answers, where those correct answers exist. Ergo anything a human can do correctly with a certain set of text inputs, it is possible to build an LLM that performs equally well. You can think of this as hardcoding the right answers into the model. The model itself can get very large, but it is always possible (not necessarily feasible).

    It's only impossible for an LLM to do something right if we cannot decide what it means for the answer to BE right in a stable way, or if it requires an unbounded amount of input. No real-world tasks require an unbounded input.