Comment by meowkit

4 days ago

We are not talking about inference.

The prompts and responses are used as training data. Even if your provider allows you to opt out they are still tracking your usage telemetry and using that to gauge performance. If you don’t own the storage and compute then you are training the tools which will be used to oppress you.

Incredibly naive comment.

21 comments

meowkit

Aurornis 4 days ago

> The prompts and responses are used as training data.

They show a clear pop-up where you choose your setting about whether or not to allow data to be used for training. If you don't choose to share it, it's not used.

I mean I guess if someone blindly clicks through everything and clicks "Accept" without clicking the very obvious slider to turn it off, they could be caught off guard.

Assuming everyone who uses Claude is training their LLMs is just wrong, though.

Telemetry data isn't going to extract your codebase.

lukan 4 days ago
"If you don't choose to share it, it's not used"
I am curious where your confidence that this is true, is coming from?
Besides lots of GPU's, training data seems the most valuable asset AI companies have. Sounds like strong incentive to me to secretly use it anyway. Who would really know, if the pipelines are set up in a way, if only very few people are aware of this?
And if it comes out "oh gosh, one of our employees made a misstake".
And they already admitted to train with pirated content. So maybe they learned their lesson .. maybe not, as they are still making money and want to continue to lead the field.
- simonw 4 days ago
  
  My confidence comes from the following:
  1. There are good, ethical people working at these companies. If you were going to train on customer data that you had promised not to train on there would be plenty of potential whistleblowers.
  2. The risk involved in training on customer data that you are contractually obliged not to train on is higher than the value you can get from that training data.
  3. Every AI lab knows that the second it comes out that they trained on paying customer data saying they wouldn't, those paying customers will leave for their competitors (and sue them int the bargain.)
  4. Customer data isn't actually that valuable for training! Great models come from carefully curated training data, not from just pasting in anything you can get your hands on.
  Fundamentally I don't think AI labs are stupid, and training on paid customer data that they've agreed not to train on is a stupid thing to do.
  
  6 replies →
- Aurornis 4 days ago
  
  > I am curious where your confidence that this is true, is coming from?
  My confidence comes from working in big startups and big companies with legal teams. There's no way the entire company is going to gather all of the engineers and everyone around, have them code up a secret system to consume customer data into a secret part of the training set, and then have everyone involved keep quiet about it forever.
  The whistleblowing and leaking would happen immediately. We've already seen LLM teams leak and and have people try to whistleblow over things that aren't even real, like the Google engineer who thought they had invented AGI a few years ago (lol). OpenAI had a public meltdown when the employees disagreed with Sam Altman's management style.
  So my question to you is: What makes you think they would do this? How do you think they'd coordinate the teams to keep it all a secret and only hire people who would take this secret to their grave?
  
  1 reply →
- theshrike79 3 days ago
  
  > I am curious where your confidence that this is true, is coming from?
  We have a legal binding contract with Anthropic. Checked and vetted by our laywers, who are annoying because they actually READ the contracts and won't let us use services with suspicious clauses in them - unless we can make amendments.
  If they're found to be in breach of said contract (which is what every paid user of Claude signs), Anthropic is going to be the target of SO FUCKING MANY lawsuits even the infinite money hack of AI won't save them.
  
  2 replies →
- ben_w 4 days ago
  
  > Besides lots of GPU's, training data seems the most valuable asset AI companies have. Sounds like strong incentive to me to secretly use it anyway. Who would really know, if the pipelines are set up in a way, if only very few people are aware of this?
  Could be, but it's a huge risk the moment any lawsuit happens and the "discovery" process starts. Or whistleblowers.
  They may well take that risk, they're clearly risk-takers. But it is a risk.
  
  6 replies →