Once your video is out in the wild there’s as of yet no reliable way to discern whether it was AI-generated or not. All content posted to public forums will have this problem.
Training future models without experiencing signal collapse will thus require either 1) paying for novel content to be generated (they will never do this as they aren’t even licensing the content they are currently training on), 2) using something like mTurk to identify AI content in data sets prior to training (probably won’t scale), or 3) going after private sources of data via automated infiltration of private forums such as Discord servers, WhatsApp groups, and eventually private conversations.
There is the web of trust. If you really trust a person to say that their stuff isn't AI, then that's probably the most reliable way of knowing. For example, I have a few friends and I know their stuff isn't AI edited because they hate it too. Of course, there is no 100% certainty but it's as certain as knowing that they're your friend at least.
But the question is about whether or not AI can continue to be trained on these datasets. How are scrapers going to quantify trust?
E: Never mind, I didn’t read the OP. I had assumed it was to do with identifying sources of uncontaminated content for the purposes of training models.
Once your video is out in the wild there’s as of yet no reliable way to discern whether it was AI-generated or not. All content posted to public forums will have this problem.
Training future models without experiencing signal collapse will thus require either 1) paying for novel content to be generated (they will never do this as they aren’t even licensing the content they are currently training on), 2) using something like mTurk to identify AI content in data sets prior to training (probably won’t scale), or 3) going after private sources of data via automated infiltration of private forums such as Discord servers, WhatsApp groups, and eventually private conversations.
There is the web of trust. If you really trust a person to say that their stuff isn't AI, then that's probably the most reliable way of knowing. For example, I have a few friends and I know their stuff isn't AI edited because they hate it too. Of course, there is no 100% certainty but it's as certain as knowing that they're your friend at least.
But the question is about whether or not AI can continue to be trained on these datasets. How are scrapers going to quantify trust?
E: Never mind, I didn’t read the OP. I had assumed it was to do with identifying sources of uncontaminated content for the purposes of training models.