Comment by andai

2 years ago

I'm confused. Are you saying that removing low quality inputs from training data doesn't improve a model? (Or conversely, adding high quality inputs.) Or are you saying that we don't yet have the technology to reliably do this at scale?

2 comments

andai

wokwokwok 2 years ago

I again, can’t comprehend how this can possibly be ambiguous from my comment, but the second one.

We don’t (by all accounts, no one does) have a way to create this kind of dataset at scale, in this kind of complex user contributed content environment (specifically npm and other places like it).

andai 2 years ago

Microsoft's curation techniques for the Phi models remain proprietary. So we can't really criticize or praise their methods, because we don't know what they are. It might be GPT-4. It might be Artificial Artificial Intelligence (a warehouse in Pakistan). But the results speak for themselves.
The models are a bit janky in my testing (especially prone to leaking test materials, and highly specialized on a narrow domain), but fantastic for their size.
Intentional "under-generalization" seems like a fairly self-evident approach to making optimal (and economical, on the training side) use of smaller models.
As for whether it works for a general purpose model, my intuition says that it does (i.e. cutting off the "long tail of knowledge" in favour of a better handling of the mainstream, by the limited neurons available).
As for whether that tech exists, I reckon a simple tf-idf would get you 80% of those wins, but that might be ignorance/arrogance on my part.