Comment by keeda
19 hours ago
Actually I think the opposite advice is true. Do anthropomorphize the language model, because it can do anything a human -- say an eager intern or a disgruntled employee -- could do. That will help you put the appropriate safeguards in place.
An eager intern can remember things you tell beyond that which would fit in an hours conversation.
A disgruntled employee definitely remembers things beyond that.
These are a fundamentally different sort of interaction.
Agreed, but the point is, if your system is resilient against an eager intern who has not had the necessary guidance, or an actively hostile disgruntled employee, that inherently restricts the harm an LLM can do.
I'm not making the case that LLMs learn like people. I'm making the case that if your system is hardened against things people can do (which it should be, beyond a certain scale) it is also similarly hardened against LLMs.
The big difference is that LLMs are probably a LOT more capable than either of those at overcoming barriers. Probably a good reason to harden systems even more.
The difference makes the necessary barriers different.
There's benefit to letting a human make and learn from (minor) mistakes. There is no such benefit accrued from the LLM because it is structurally unable to.
There's the potential of malice, not just mistakes, from the human. If you carefully control the LLMs context there is no such potential for the LLM because it restarts from the same non-malicious state every context window.
There's the potential of information leakage through the human, because they retain their memories when they go home at night, and when they quit and go to another job. You can carefully control the outputs of the LLM so there is simply no mechanism for information to leak.
If a human is convinced to betray the company, you can punish the human, for whatever that's worth (I think quite a lot in some peoples opinion, not sure I agree). There is simply no way to punish an LLM - it isn't even clear what that would mean punishing. The weights file? The GPU that ran the weights file?
And on the "controls" front (but unrelated to the above note about memory) LLMs are fundamentally only able to manipulate whatever computers you hook them up to, while people are agents in a physical world and able to go physically do all sorts of things without your assistance. The nature of the necessary controls end up being fundamentally different.
1 reply →
You can easily persist agent memories in a markdown file though.
And the memento guy had tattoos of key information. That didn’t make it so he didn’t have memory loss.
1 reply →
Which it will start ignoring after two or three messages in the session.
and you'll blow the context over time and send to the LLM sanitorium. It doesn't fit like the human brain can.
If a junior fucks production that will have extroadinary weight because it appreciates the severity, the social shame and they will have nightmares about it. If you write some negative prompt to "not destroy production" then you also need to define some sort of non-existing watertight memory weighting system and specify it in great detail. Otherwise the LLM will treat that command only as important as the last negative prompt you typed in or ignore it when it conflicts with a more recent command.
1 reply →
Yup, and the agent will happily ignore any and all markdown files, and will say "oops, it was in the memory, will not do it again", and will do it again.
Humans actually learn. And if they don't, they are fired.
5 replies →
That's not learning.
I think you are more right than people are giving you credit for. I would love to see the full transcript to understand the emotional load of the conversation. Using instructions like "NEVER FUCKING GUESS!" probably increase the likelihood of the agent making a "mistake" that is destructive but defensible.
The models have analogous structures, similar to human emotions. (https://www.anthropic.com/research/emotion-concepts-function)
"Emotional" response is muted through fine-tuning, but it is still there and continued abuse or "unfair" interaction can unbalance an agents responses dramatically.
An eager intern can not be working for hundreds of millions of customers at the same time. An LLM can.
A disgruntled employee will face consequences for their actions. No one at Anthropic, OpenAI, xAI, Google or Meta will be fired because their model deleted a production database from your company.
It is merely a simulacrum of an intern or disgruntled employee or human. It might say things those people would say, and even do things they might do, but it has none of the same motivations. In fact, it does not have any motivation to call its own.
No, because the safeguards should be appropriate to an LLM, not to a human.
(The LLM might act like one of the humans above, but it will have other problematic behaviours too)
That's fair, largely because an LLM is a lot more capable at overcoming restrictions, by hook or by crook as TFA shows. However, most systems today are not even resilient against what humans can do, so starting there would go a long way towards limiting what harms LLMs can do.
It doesn't follow logically that a human and an LLM are similar just because both are capable of deleting prod on accident.
You don't anthropomorphize a table saw, you just don't put your hand in there.
it cannot go to the washroom and cry while pooping. And thats just one of the things that any human can do and AI cannot. So no it cannot do anything a human can do, the shared exmaple being one of them.
And thats why we dont have AI washrooms because they are not alive or employees or have the need to excrete.