Comment by eqvinox

1 day ago

The only data that cannot be stolen or leaked is data that doesn't exist. Hard lesson for both users and companies.

Germans (because of course) have a word for this: "Datensparsamkeit". Being frugal with your data.

> Germans (because of course)

I don't know if it's the reason you imply. In the 70s, there were big debates in Germany about privacy and data storage. They spoke of one's data shadow (Datenschatten). I suspect this word comes from that tradition. The reason the word exists would then be the reflection (Verwaltigung) on WW2.

  • I took the "because of course" to be about having a word for everything - a stereotypical idea about the German language.

    • My understanding was that it was more that words can be concatenated into new words in German which is not so much a stereotype as more a misunderstanding of fact. I.e. You wouldn't think much about something like enjoyable-comuppence but schadenfreude looks more impressive without the hyphen.

      1 reply →

    • There's also the other implication that the (East) Germans were Soviet just 35 years ago.

      But yes. We Americans know Germans more for their silly big words. But statements like that can be misinterpreted as the German perspective of themselves doesn't quite match the American stereotypes.

      3 replies →

    • That's like saying that English (because of course) is able to describe the concept by a combination of words.

  • The Stasi would be the obvious cultural context.

    In the US of course the government buys this sort of information legally from corporations.

    • > The Stasi would be the obvious cultural context.

      There is also the rather famous example of how earlier census data was used in the 40’s.

      Once the government has your data, they have it. The next generation of representatives may not follow all the same rules and norms

    • The West-German debate in the 70s came from the realization that the sheer size of the Holocaust/Shoah was in no small degree due to bureaucratic record keeping. Storing someone's ethnicity is potentially dangerous for that person.

  • Germany resisted Google Street View until 2023, which was something I thought was very impressive.

  • Love it, also love how Datenschatten can also imply that it disappears when someone shines light on it

    • If only our past 20 year old self data could be so ephemeral…

      Who doesn’t want that old post going extinct forever when they were shit faced outside of a bar in Nashville but now they are in their mid-life and are “respectable” members of society.

  • Yeah, so Germany had a ton of secret police files and of course learned very well what happens when a bunch of people start collecting dossiers.

    So yeah, of course they've developed that type of distrust. Americans should have also after the 50-60s paranoia of red scare, black people etc. Instead they just spend a few decades building a anti-social state.

I miss the pre-LLM days when you could make a decent argument that having any unnecessary data was just a liability. Now all anybody thinks is “more data for the AI!”

  • 10+ years ago companies were hoovering up data for ML - trying to find correlations in high-dimensionality data. Mostly the results were garbage but occasionally you hit on a real, unexpected phenomenon.

    Nowadays you just throw all the data into a black box and believe whatever it says blindly.

  • Were you not around for the Big Data heyday a decade ago?

    • Hell you mean a decade ago? I still see businesses running losses left right and center saying that they're gonna monetize user data, any day now.

      Related "monetizing user data" seems to just mean ads. Ads on everything, forever, until the userbase gets fed up and moves to a new service that definitely won't do that, and the cycle repeats about every 3 years.

  • Data hoarding predates LLMs. There where other machine learning methods which also needed data for training.

    • “Before LLM’s there was_____”

      I see this whenever an LLM’s impact is assessed. We know. The issue is scale and the ability for smaller and smaller groups (down to individuals) to execute at scale.

      Fake news always existed. Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.

      8 replies →

Data can never be stolen, because it is not a physical thing. Data can be copied, and it can be erased - sometimes both happens at the same time. Data can be lost, that is when its last existing copy was erased.

Or you could put it in a box with no connection to the internet.

Introducing… The Hooli Box!

Data that is publicly available also can't be stolen or leaked. Nobody can steal Mozilla's common voice dataset.

> The only data that cannot be stolen or leaked is data that doesn't exist. Hard lesson for both users and companies.

Except no company is learning this lesson.

The enterprise threat model includes "our own users", and the modus operandi is to maintain as much information on that threat as possible.

Seems a bit like blaming the victim? Your voice (like DNA) is kind of ambient data that's hard to hide.