Comment by b1085436
2 days ago
I genuinely don't understand the "permissioned data" assumption. Presumably, all the current models that were trained on illegitimate scraping of vastly larger sources will always have the upper hand (in terms of raw power, obviously at the cost of regurgitating evil stuff too), because they just have absorbed way more diverse data in their training. So the models trained on ethical datasets only will not be able to compete, unless they too rely on a common base of "foundational sin" data and just add those datasets as an ethical layer to cover the rotten roots.
Is it really possible to start training from scratch at this stage and compete with the existing models, using only ethical datasets? Hasn't it been established that without the stolen data, those models could not exist or compete?
Personally I would rather use a 'bad' ai that's trained ethically and runs locally than a good ai trained on stolen data that requires me to surrender my thoughts to the cloud.
whether or not it's possible to compete I guess we'll see but I am hopeful and appreciative that Mozilla is trying, as I am getting tired of big tech trying to force everyone to hand over even more unhinged amounts of data than what they're already taking from us.
I strongly suspect that it is absolutely impossible to have an even remotely usable/useful "AI" trained on tiny datasets, and that instead of training only on ethical data, companies that want to sound ethical will use an extra post-training step for dirty foundation models to behave more ethically as if they'd only learned from ethical sources. I'd hate for this to become the norm, but I fear this is logically what annoucements like this one really mean. The difference in scale is so vast -- taking whatever you want from the entire internet -- vs hand-curated datasets with explicit authorisation and free to use. It's like trying to make a grain of sand gravitate around a marble in the playground, to mimic the moon around the Earth – won't work.