Comment by gorgoiler

5 hours ago

Three things matter when it comes to eating my breakfast sandwich:

1/ Was the pork in my sausage reared on a farm that meets agricultural standards?

2/ Was the food handled safely by the kitchen that cooked my food?

3/ Does the owner of the diner pay kitchen wages in accordance with labor law?

By contrast, I have no idea what went into the models I use, what system prompts have prejudiced it, and whose IP has been exploited in pursuit of my answer.

That’s being charitable, really. In practice the open secret of the AI industry is that the vast majority of training data, for want of a better word even if it is likely to be the most precise description, is stolen data.

6 comments

gorgoiler

amelius 5 hours ago

Probably, yes, but the burden of proof is with us not them.

I'm already glad some companies have the guts to open their models because proving it for open models is probably a lot easier than for a model behind a service.

wartywhoa23 4 hours ago

The proof is the $stupid-billion infrastructure built and kept up to host mousetraps armed with free cheese made of virtue signalling about doing the right thing and sharing the code with the world for free.
tngranados 4 hours ago

That's a matter of changing a law, it's all up to the people and their representatives. We talk as if everything is set on stone but if there really is a will, there is a way.

ap99 4 hours ago

What's an example of data that might have been stolen?

devsda 5 hours ago

The media industry loves to quote ridiculous numbers on lost revenue due to piracy etc. May be a rough ballpark numbers will get them to do something about this theft.

Can someone put a rough estimate on potential revenue loss (direct and incidental) from training AI with industry wise breakup.

gorgoiler 5 hours ago

It’s wrong to stop progress. I just want to know what data went into my model and have access to the same data. The same way we have national libraries of books but with the caveat that I don’t really know how one is supposed to browse petabytes of OpenAI .zips like I browse old books.
If the data is proprietary (eg Meta’s stash of FB comments) then I am satisfied to be told it’s private and I can’t see it. If, however, the works were public then give me a URL if it’s live or a cached copy if it isn’t.