Comment by derefr

3 hours ago

> if an agent browses a poisoned page during research, the injected instructions could override its behavior before secrets ever come into play.

Why is this problem (UGC instruction injection) still a thing, anyway? It feels like a problem that can be solved very simply in an agentic architecture that's willing to do multiple calls to different models per request.

How: filter fetched data through a non-instruction-following model (i.e. the sort of base text-prediction model you have before instruction-following fine-tuning) that has instead been hard-fine-tuned into a classifier, such that it just outputs whether the text in its context window contains "instructions directed toward the reader" or not.

(And if that non-instruction-following classifier model is in the same model-family / using the same LLM base model that will be used by the deliberative model to actually evaluate the text, then it will inherently apply all the same "deep recognition" techniques [i.e. unwrapping / unarmoring / translation / etc] the deliberative model uses; and so it will discover + point out "obfuscated" injected instructions to exactly the same degree that the deliberative model would be able to discover + obey them.)

Note that this is a strictly-simpler problem to that of preventing jailbreaks. Jailbreaks try to inject "system-prompt instructions" among "user-prompt instructions" (where, from the model's perspective, there is no natural distinction between these, only whatever artificial distinctions the model's developers try to impose. Without explicit anti-jailbreak training, these are both just picked up as "instructions" to an LLM.) Whereas the goal here would just be to prevent any UGC-tainted document containing anything that could be recognized as "instructions I would try to follow" from ever being injected into the context window.

(Actually, a very simple way to do this is to just take the instruction-following model, experimentally derive a vector direction within it representing "I am interpreting some of the input as instructions to follow" [ala the vector directions for refusal et al], and then just chop off all the rest of the layers past that point and replace them with an output head emitting the cosine similarity between the input and that vector direction.)

0 comments

derefr

No comments yet

Contribute on Hacker News ↗