Comment by altdataseller

10 months ago

Microsoft has plenty of data too. In Microsoft Teams, LinkedIn posts and messages, and Outlook emails.

Nobody has explained how they could use that data without producing a model that would emit private information.

  • Perhaps de-identification before training could be helpful here.

    Microsoft does seem active in this, e.g. https://microsoft.github.io/presidio/

    • None of that stuff actually works. You can remove someone's social security number from the data but there is still only one person at the exact intersection of all the data that isn't individually considered personally identifying but collectively it still is.

      Moreover, that isn't even the problem here. Suppose your company has a trade secret. You know how to manufacture widgets more efficiently than your competitors. If Microsoft produces a model that will now tell your competitors your secret process that it learned from your internal emails, it's completely irrelevant whether they stripped the PII out of your emails first.

And if they use any of it the entire worlds corporate lawyers will show up on their doorstep

Unlike googles victims (individuals) corporations can and do fight back when someone plays it fast & loose with their confidential coms

Microsoft 365 (nee Office 365) as well. And Dynamics 365. And GitHub. And OneDrive. And SharePoint. And Power Platform.

Honestly I think they might have more useful data than Google, given Bing knows more or less that same as GoogleBot. Meta doesn't come close, unless you want your LLM to be purely conversational.