Comment by Oarch

2 months ago

Earlier this year I thought that rare proprietary knowledge and IP was a safe haven from AI, since LLMs can only scrub public data.

Then it dawned on me how many companies are deeply integrating Copilot into their everyday workflows. It's the perfect Trojan Horse.

69 comments

Oarch

findjashua 2 months ago

providers' ToS explicitly states whether or not any data provided is used for training purposes. the usual that i've seen is that while they retain the right to use the data on free tiers, it's almost never the case for paid tiers

sotrusting 2 months ago
Right, so totally cool to ignore the law but our TOS is a binding contract.
- mc32 2 months ago
  
  Yes, they can be sued for breach of contract. And it’s not a regular ToS but a signed MSA and other legally binding documents.
  
  3 replies →
- protocolture 2 months ago
  
  Where are they ignoring the law?
  
  4 replies →
torginus 2 months ago

I bet companies are circumventing this in a way that allows them to derive almost all the benefit from your data, yet makes it very hard to build a case against them.
For example, in RL, you have a train set, and a test set, which the model never sees, but is used to validate it - why not put proprietary data in the test set?
I'm pretty sure 99% of ML engineers would say this would constitute training on your data, but this is an argument you could drag out in courts forever.
Or alternatively - it's easier to ask for forgiveness than permission.
I've recently had an apocalyptic vision, that one day we'll wake up, an find that AI companies have produced an AI copy of every piece of software in existence - AI Windows, AI Office, AI Photoshop etc.
Oarch 2 months ago

Given the conduct we've seen to date, I'd trust them to follow the letter - but not the spirit - of IP law.
There may very well be clever techniques that don't require directly training on the users' data. Perhaps generating a parallel paraphrased corpus as they serve user queries - one which they CAN train on legally.
The amount of value unlocked by stealing practically ~everyone's lunch makes me not want to put that past anyone who's capable of implementing such a technology.
bdangubic 2 months ago

it is amazing in almost 2026 there is anyone believing this… amazing
GCUMstlyHarmls 2 months ago
I wonder how much wiggle there is for collect now (to provide service, context history, etc), then later anonymise (some how, to some level) and then train on it?
Also I wonder if the ToS covers "queries & interaction" vs "uploaded data" - I could imagine some tricky language in there that says we wont use your word document, but we may at some time use the queries you put against it, not as raw corpus but as a second layer examining what tools/workflows to expand/exploit.
- danielheath 2 months ago
  
  “We don’t train on your data” doesn’t exclude metadata, training on derived datasets via some anonymisation process, etc.
  There’s a range of ways to lie by omission, here, and the major players have established a reputation for being willing to take an expansive view of their legal rights.

matt-p 2 months ago

Even if they're were doing this (I highly doubt it) so much would be lost to distillation I'm not convinced there would be much that actually got in, apart from perhaps internal codenames or whatever which will be obvious.

kankerlijer 2 months ago

Well, perhaps this is naive of me from the perspective of not fully understanding the training process. However, at some point, with all available training data having been exhausted, gains with synthetic data exhausted, and a large pool of publicly available AI generated code, at what point is it 'smart' to scrape codebases from what you identify as high quality code based, clean it up to remove identifiers, and use that for training a smaller model?

phendrenad2 2 months ago

Ironically (for you), copilot is the one provider that is doing a good job of provably NOT training on user data. The rest are not up to speed on that compliance angle, so many companies ban them (of course, people still use them).

Aurornis 2 months ago
Do you have a source for this?
There are claims all through this thread that “AI companies” are probably doing bad things with enterprise customer data but nobody has provided a single source for the claim.
This has been a theme on HN. There was a thread a few weeks back where someone confidently claimed up and down the thread that Gemini’s terms of service allowed them to train on your company’s customer data, even though 30 seconds of searching leads to the exact docs that say otherwise. There is a lot of hearsay being spread as fact, but nobody actually linking to ToS or citing sections they’re talking about.
- phendrenad2 2 months ago
  
  Sources aren't hard to find[1]. But getting software developers to look outside their idiot-savant caves and not dismiss the entire legal system as "unrealistic", is much harder to accomplish.
  [1] - https://www.microsoft.com/en-us/trust-center/privacy/data-ma...

gaigalas 2 months ago

What kind of rare proprietary knowledge?

Oarch 2 months ago

It could be a wide range of things depending on your field: highly particular materials, knowledge or processes that give your products or services a particular edge, and which a company has often incurred high R&D costs to discover.
Many businesses simply couldn't afford to operate without such an edge.

Aurornis 2 months ago

Using an LLM on data does not ingest that data into the training corpus. LLMs don’t “learn” from the information they operate on, contrary to what a lot of people assume.

None of the mainstream paid services ingest operating data into their training sets. You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.

Retric 2 months ago
Companies have already shifting from not using customer data to giving them an option to opt out ex:
“How can I control whether my data is used for model training?
If you are logged into Copilot with a Microsoft Account or other third-party authentication, you can control whether your conversations are used for training the generative AI models used in Copilot. Opting out will exclude your past, present, and future conversations from being used for training these AI models, unless you choose to opt back in. If you opt out, that change will be reflected throughout our systems within 30 days.” https://support.microsoft.com/en-us/topic/privacy-faq-for-mi...
At this point suggesting it has never and will her happen is wildly optimistic.
- Aurornis 2 months ago
  
  An enterprise Copilot contract will have already decided this for the organization.
  
  1 reply →
- olyjohn 2 months ago
  
  30 days to opt out? That's skeezy as fuck.
leptons 2 months ago
> LLMs don’t “learn” from the information they operate on, contrary to what a lot of people assume.
Nothing is really preventing this though. AI companies have already proven they will ignore copyright and any other legal nuisance so they can train models.
- lioeters 2 months ago
  
  They're already using synthetic data generated by LLMs to further train LLMs. Of course they will not hesitate to feed "anonymized" data generated by user interactions. Who's going to stop them? Or even prove that it's happening. These companies have already been allowed to violate copyright and privacy on a historic global scale.
- Archelaos 2 months ago
  
  How should they dinstinguish between real and fake data? It would be far to easy to pollute their models with nonesense.
  
  1 reply →
- tick_tock_tick 2 months ago
  
  I mean is it really ignoring copyright when copyright doesn't limit them in anyway on training?
  
  1 reply →
- Aurornis 2 months ago
  
  > Nothing is really preventing this though
  The enterprise user agreement is preventing this.
  Suggesting that AI companies will uniquely ignore the law or contracts is conspiracy theory thinking.
  
  2 replies →
lwhi 2 months ago
Information about the way we interact with the data (RLHF) can be used to refine agent behaviour.
While this isn't used specifically for LLM training, it can involve aggregating insights from customer behaviour.
- Aurornis 2 months ago
  
  That’s a training step. It requires explicitly collecting the data and using it in the training process.
  Merely using an LLM for inference does not train it on the prompts and data, as many incorrectly assume. There is a surprising lack of understanding of this separation even on technical forums like HN.
  
  1 reply →
AuthAuth 2 months ago
They are not directly ingesting the data into their trainning sets but they are in most cases collecting it and will be using it to train future models.
- Aurornis 2 months ago
  
  Do you have any source for this at all?
  
  1 reply →
nerdponx 2 months ago
If they weren't, then why would enterprise level subscriptions include specific terms stating that they don't train on user provided data? There's no reason to believe that they don't, and if they don't now then there's no reason to believe that they won't later whenever it suits them.
- Aurornis 2 months ago
  
  > then why would enterprise level subscriptions include specific terms stating that they don't train on user provided data?
  What? That’s literally my point: Enterprise agreements aren’t training on the data of their enterprise customers like the parent commenter claimed.
TheRoque 2 months ago
Just read the ToS of the LLM products please
- doctorpangloss 2 months ago
  
  This is so naive. The ToS permits paraphrasing of user conversations, by not excluding it, and then training on THAT. You’d never be able to definitively connected paraphrased data to yours, especially if they only train on paraphrased data that covers frequent, as opposed to rare, topics.
  
  2 replies →
- Aurornis 2 months ago
  
  I have. Have you? Can you quote the sections you’re talking about?
  
  1 reply →
fzeroracer 2 months ago
> You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.
It's not really a conspiracy when we have multiple examples of high profile companies doing exactly this. And it keeps happening. Granted I'm unaware of cases of this occuring currently with professional AI services but it's basic security 101 that you should never let anything even have the remote opportunity to ingest data unless you don't care about the data.
- james_marks 2 months ago
  
  > never let anything even have the remote opportunity to ingest data unless you don't care about the data
  This is objectively untrue? Giants swaths of enterprise software is based on establishing trust with approved vendors and systems.
- Aurornis 2 months ago
  
  > It's not really a conspiracy when we have multiple examples of high profile companies doing exactly this.
  Do you have any citations or sources for this at all?
- mulquin 2 months ago
  
  To be pedantic, it is still a conspiracy, just no longer a theory.
  
  1 reply →
popalchemist 2 months ago
Wrong, buddy.
Many of the top AI services use human feedback to continuously apply "reinforcement learning" after the initial deployment of a pre-trained model.
https://en.wikipedia.org/wiki/Reinforcement_learning_from_hu...
- Aurornis 2 months ago
  
  RLHF is a training step.
  Inference (what happens when you use an LLM as a customer) is separate from training.
  Inference and training are separate processes. Using an LLM doesn’t train it. That’s not what RLHF means.
  
  1 reply →
agumonkey 2 months ago

maybe prompts are enough to infer the rest ?
sotrusting 2 months ago
[flagged]
- protocolture 2 months ago
  
  >Ah yes, blindly trusting the corpo fascists that stole the entire creative output of humanity to stop now.
  Stealing implies the thing is gone, no longer accessible to the owner.
  People aren't protected from copying in the same way. There are lots of valid exclusions, and building new non competing tools is a very common exclusion.
  The big issue with the OpenAI case, is that they didn't pay for the books. Scanning them and using them for training is very much likely to be protected. Similar case with the old Nintendo bootloader.
  The "Corpo Fascists" are buoyed by your support for the IP laws that have thus far supported them. If anything, to be less "Corpo Fascist" we would want more people to have more access to more data. Mankind collectively owns the creative output of Humanity, and should be able to use it to make derivative works.
  
  5 replies →