Secure Secrets Management for Cursor Cloud Agents

5 days ago (infisical.com)

If i read this correctly its completely absurd. secrets can never even touch an agents sandbox, not as file not as env var not as anything. Agents can only be allowed to reach services via proxies that handle secrets and do permissions and auditing completely transparently and agents do not even get secrets to access these but authenticate as their identity eg with client certificates. I am not aware of any other method that could work. The proxies obviously also cannot be reachable outside the direct connection, so if agents exfiltrate their identity and proxy setup somehow the usefulness outside is zero.

This is a really important area to tackle. secret management for AI agents is something most teams are ignoring right now.

One adjacent risk worth noting: the URLs these agents visit during research. Even with proper secret management, if an agent browses a poisoned page during research, the injected instructions could override its behavior before secrets ever come into play.

  • > if an agent browses a poisoned page during research, the injected instructions could override its behavior before secrets ever come into play.

    Why is this problem (UGC instruction injection) still a thing, anyway? It feels like a problem that can be solved very simply in an agentic architecture that's willing to do multiple calls to different models per request.

    How: filter fetched data through a non-instruction-following model (i.e. the sort of base text-prediction model you have before instruction-following fine-tuning) that has instead been hard-fine-tuned into a classifier, such that it just outputs whether the text in its context window contains "instructions directed toward the reader" or not.

    (And if that non-instruction-following classifier model is in the same model-family / using the same LLM base model that will be used by the deliberative model to actually evaluate the text, then it will inherently apply all the same "deep recognition" techniques [i.e. unwrapping / unarmoring / translation / etc] the deliberative model uses; and so it will discover + point out "obfuscated" injected instructions to exactly the same degree that the deliberative model would be able to discover + obey them.)

    Note that this is a strictly-simpler problem to that of preventing jailbreaks. Jailbreaks try to inject "system-prompt instructions" among "user-prompt instructions" (where, from the model's perspective, there is no natural distinction between these, only whatever artificial distinctions the model's developers try to impose. Without explicit anti-jailbreak training, these are both just picked up as "instructions" to an LLM.) Whereas the goal here would just be to prevent any UGC-tainted document containing anything that could be recognized as "instructions I would try to follow" from ever being injected into the context window.

    (Actually, a very simple way to do this is to just take the instruction-following model, experimentally derive a vector direction within it representing "I am interpreting some of the input as instructions to follow" [ala the vector directions for refusal et al], and then just chop off all the rest of the layers past that point and replace them with an output head emitting the cosine similarity between the input and that vector direction.)