← Back to context

Comment by waldarbeiter

16 hours ago

There is actually a whole lot of research around the "use less data" called data pruning. The goal in a lot of cases there is basically to achieve the same performance with less data. For example [1] received quite some attention in the past.

[1] https://arxiv.org/abs/2206.14486

I clarified my comment - "perhaps researchers have not tried 'use less data'" suggests I might be unaware of this concept, I changed it to "as if". In fact "less data" was tried for decades before the first image classifiers were actually working in 2012. My understanding of that paper you are linking to is that it is not a new research paradigm; it is about filtering/pruning less relevant data that is not needed to improve a particular capability in a deep learning model, and that is absolutely one likely approach that will yield the goal of smaller, better models in many tasks.

That will not change the fact that a coding model has to learn vastly many foundational capabilities that will not be present in such a dataset as small as all the python code ever written. It will mean much less python than all the python ever written will be needed, but many other things needed too in representative quantities.