Comment by andai

2 years ago

I'm confused. Are you saying that removing low quality inputs from training data doesn't improve a model? (Or conversely, adding high quality inputs.) Or are you saying that we don't yet have the technology to reliably do this at scale?

I again, can’t comprehend how this can possibly be ambiguous from my comment, but the second one.

We don’t (by all accounts, no one does) have a way to create this kind of dataset at scale, in this kind of complex user contributed content environment (specifically npm and other places like it).

  • Microsoft's curation techniques for the Phi models remain proprietary. So we can't really criticize or praise their methods, because we don't know what they are. It might be GPT-4. It might be Artificial Artificial Intelligence (a warehouse in Pakistan). But the results speak for themselves.

    The models are a bit janky in my testing (especially prone to leaking test materials, and highly specialized on a narrow domain), but fantastic for their size.

    Intentional "under-generalization" seems like a fairly self-evident approach to making optimal (and economical, on the training side) use of smaller models.

    As for whether it works for a general purpose model, my intuition says that it does (i.e. cutting off the "long tail of knowledge" in favour of a better handling of the mainstream, by the limited neurons available).

    As for whether that tech exists, I reckon a simple tf-idf would get you 80% of those wins, but that might be ignorance/arrogance on my part.