Comment by wokwokwok

2 years ago

> why would anyone think that this would affect anything, like AI? Why would anyone train on noname package, that noone uses?

…almost certainly for the same reason that any “train AI using only good data, reduce hallucinations!” suggestion is in the “daydream” rather than “great idea” category.

Creating high quality filtered datasets is enormously more time consuming and expensive than just dumping everything you can get you hands on in.

It seems obvious to ignore packages that are obviously unused and spam, but tldr; no idiot is going to be pouring spam into npm unless there’s some kind of benefit from it; people accidentally using it, mixing it into the dependency tree of legit packages, etc.

It’s more likely that the successful folk doing this aren’t being caught, and the ones being caught are “me too” idiots. Or, the spam is working and people are actually (for whatever incomprehensible reason) actually using at least some of the packages.

TLDR; if dependency auditing and supply chain attack were trivial to solve, it wouldn’t be a problem.

…but based on the fact that we continue endlessly to see these issues, you can assume that it’s probably more diff to solve than it trivially appears.

5 comments

wokwokwok

andai 2 years ago

Daydream? It worked for Phi.

wokwokwok 2 years ago
This is such a low effort insincere comment I can barely be bothered to respond to it… but tldr; no, it didn’t.
If it was easy, people would have done it. It’s not easy. Phi is not a state of the art model. It does not perform significantly better or even on par with larger models.
Yes, I’ve read the tech reports and used it. No, I don’t believe it has any kind of meaningful bearing on the problem, which is explicitly in question here, which I explicitly posit, again, is basically unsolvable:
Given a large user contributed repository of code (npm), it’s very hard to determine “good” from “bad” in terms of quality at scale, when you have malicious actors.
…I mean, it’s not impossible with enough time and effort I suppose, but if Microsoft, who own npm have a good way of filtering out bad content on it for their language models, you’ve really got to ask why the duck they’re using it for their language models, and you know, not to unduck npm…
- andai 2 years ago
  
  I'm confused. Are you saying that removing low quality inputs from training data doesn't improve a model? (Or conversely, adding high quality inputs.) Or are you saying that we don't yet have the technology to reliably do this at scale?
  
  2 replies →