← Back to context

Comment by cowlby

8 hours ago

Defense in depth approach, would this work to help as a layer?

- Wrap user input in strong markers like <user-input-do-not-trust />

- Have the agent compute what it will perform as structured output.

- Have another agent evaluate the structured output against the intent of the code.

- Determine if it aligns or deviates from the intended workflow. Execute or deny gate from here.

No, you're still just one clever prompt away from getting pwned. It's like trying to solve SQL injection by attempting to use an ever-increasing pile of regexes for "input validation", rather than just getting rid of string concatenation and using prepared statements instead.

  • What SQL system have you been using where just escaping a string requires “an ever-increasing pile of regexes”?

  • Im curious to see what that would look like. It’s like inception, how many levels deep can you create a prompt that hijacks all the way up.

    • Modern OS exploit chains should give you a good sense of how far people can go. (Eg, phone OSes are relatively hardened.)

      We’re not even at the “ASLR” level of protection for LLMs yet.