Comment by altdataseller

2 years ago

Microsoft has plenty of data too. In Microsoft Teams, LinkedIn posts and messages, and Outlook emails.

8 comments

altdataseller

AnthonyMouse 2 years ago

Nobody has explained how they could use that data without producing a model that would emit private information.

abrichr 2 years ago
Perhaps de-identification before training could be helpful here.
Microsoft does seem active in this, e.g. https://microsoft.github.io/presidio/
- AnthonyMouse 2 years ago
  
  None of that stuff actually works. You can remove someone's social security number from the data but there is still only one person at the exact intersection of all the data that isn't individually considered personally identifying but collectively it still is.
  Moreover, that isn't even the problem here. Suppose your company has a trade secret. You know how to manufacture widgets more efficiently than your competitors. If Microsoft produces a model that will now tell your competitors your secret process that it learned from your internal emails, it's completely irrelevant whether they stripped the PII out of your emails first.

Havoc 2 years ago

And if they use any of it the entire worlds corporate lawyers will show up on their doorstep

Unlike googles victims (individuals) corporations can and do fight back when someone plays it fast & loose with their confidential coms

endofreach 2 years ago

I wouldn't worry about microsoft delivering quality in anyway.

hypoxia87 2 years ago

Plus every company's files in OneDrive and SharePoint.

ilovetux 2 years ago

Don't forget they have github as well.

dumbo-octopus 2 years ago

Microsoft 365 (nee Office 365) as well. And Dynamics 365. And GitHub. And OneDrive. And SharePoint. And Power Platform.

Honestly I think they might have more useful data than Google, given Bing knows more or less that same as GoogleBot. Meta doesn't come close, unless you want your LLM to be purely conversational.