Comment by renegat0x0
2 years ago
I am really interested if that really matters.
Package managers often comes with rating system. npmjs has weekly downloads, pull requests, and other popularity scores.
I am layman in AI, but why would anyone think that this would affect anything, like AI? Why would anyone train on noname package, that noone uses?
Stats for spam packages can have higher-than-none stats, but that also makes them vulnerable for sweep removal of all potential spam packages, since they are connected, etc. etc.
Any credible company will not use a noname spam package, will verify their contents. That is at least what happened in all companies I have worked for.
> why would anyone think that this would affect anything, like AI? Why would anyone train on noname package, that noone uses?
…almost certainly for the same reason that any “train AI using only good data, reduce hallucinations!” suggestion is in the “daydream” rather than “great idea” category.
Creating high quality filtered datasets is enormously more time consuming and expensive than just dumping everything you can get you hands on in.
It seems obvious to ignore packages that are obviously unused and spam, but tldr; no idiot is going to be pouring spam into npm unless there’s some kind of benefit from it; people accidentally using it, mixing it into the dependency tree of legit packages, etc.
It’s more likely that the successful folk doing this aren’t being caught, and the ones being caught are “me too” idiots. Or, the spam is working and people are actually (for whatever incomprehensible reason) actually using at least some of the packages.
TLDR; if dependency auditing and supply chain attack were trivial to solve, it wouldn’t be a problem.
…but based on the fact that we continue endlessly to see these issues, you can assume that it’s probably more diff to solve than it trivially appears.
Daydream? It worked for Phi.
This is such a low effort insincere comment I can barely be bothered to respond to it… but tldr; no, it didn’t.
If it was easy, people would have done it. It’s not easy. Phi is not a state of the art model. It does not perform significantly better or even on par with larger models.
Yes, I’ve read the tech reports and used it. No, I don’t believe it has any kind of meaningful bearing on the problem, which is explicitly in question here, which I explicitly posit, again, is basically unsolvable:
Given a large user contributed repository of code (npm), it’s very hard to determine “good” from “bad” in terms of quality at scale, when you have malicious actors.
…I mean, it’s not impossible with enough time and effort I suppose, but if Microsoft, who own npm have a good way of filtering out bad content on it for their language models, you’ve really got to ask why the duck they’re using it for their language models, and you know, not to unduck npm…
3 replies →
If you look at the purpose of this Tea protocol it is exactly to provide a chain of credibility. Though, by connecting ranking with monetization, tea has created perverse incentives, leading spammers to pump up their tea ranking, by linking and starring packages in circles. Their goal is to make it look like it’s a highly used package.
Luckily, nobody thinks that tea ranking matters, except for the spammers themselves.
They are with no doubt attempting to poke at other more established metrics as well. This could eventually fool an AI or even humans.
> Why would anyone train on noname package, that noone uses?
Not that I disagree, but in the same line of thinking: Why would anyone train an LLM on some random blog written in broken English? Why would you train an LLM on the absolute dumpster fire that is Reddit comments? Or why is my Github repos with half-finished projects and horribly insecure coding practises being used as input to CoPilot? Yet here we are, LLMs writing broken, insecure code (just like a real person) and telling people to eat rocks.
Agree! Not only in companies, but I have never seen anyone download a package, without looking at Github stars
The real fun would happen if the next incentive is to publish a package and get Github stars for that repo :-)