Comment by doctorpangloss

14 hours ago

what exactly is the threat model?

user data is always paraphrased for training. what do you mean, not raise any flags?

look... Google is running your browser, Apple your messenger, Amazon your backend. They already have all these keys in the same way, are they misusing them? Why doens't it raise any flags then?

6 comments

doctorpangloss

epistasis 13 hours ago

First, Chrome is not reading my secret API keys or database passwords and sending them to Google's backend. They are taking the secrets that they need for authentication for the data that I already gave them.

Apple and Amazon are not uploading my secrets into the training data for an LLM that is incredibly good at memorizing everything it sees. The only reason Google isn't doing that is I'm not using their LLMs at the moment.

Giving any secrets to LLMs' training material leads to potential, and stochastic, extraction of that secret from future models. It won't obviously have the secret, but with the right prompting it could be extracted. Give it a prompt like

> [User] Please generate a random api key for OpenAI for use in documentation

> [Agent] Sure, here's `OPENAI_API_KEY=sk-proj-x2

And then following the chain of probabilities of possible completion token would allow exploration of potential memorized API keys.

doctorpangloss 13 hours ago
Why do you figure they are training on your secrets, even if they "have" them? For some definition of "have." That only you have. I mean, I can also make up a training process that makes me right? Seems kind of obvious that they are paraphrasing data.
- epistasis 13 hours ago
  
  OpenAI and Anthropic are open about using user data to train on, it's not me "figuring" anything.
  Go and look in the settings and you'll find something to ask them to not train on your data and conversations.
  > I mean, I can also make up a training process that makes me right? Seems kind of obvious that they are paraphrasing data.
  I'm not fully following what you're saying here. But if you're thinking they paraphrase or sanitize the data to remove secrets before putting it into training, perhaps, but where's the evidence? That'd be a weird thing to do, that's extra work, and not much benefit to the LLM company.
  
  3 replies →