← Back to context

Comment by pbalau

1 day ago

> keep the commons clean [from the second link]

A glance at the r/python will show that almost every week there is a new pypi package generated by ai, with dubious utility.

I did a quick research using bigquery-public-data.pypi.distribution_metadata and out of 844719 package, 126527 have only 1 release, almost 15%.

While is not unfathomable that a chunk of those really only needed one release and/or were manually written, the number is too high. And pypi is struggling for resources.

I wonder how much crap there is on github and I think this is an even larger issue, with the new versions of LLMs being trained on crap generated by older versions.

Storage is relatively cheap. Packages with only one release and little usage in the wild will be a rounding error in cost. A few years ago, Pypi required an over million dollars equivalent in CDN traffic per month. Storing a million of small dead packages is not worth the concern.

  • While my research was very shallow, the issue is with the practice. And I didn't look at how large those packages are.

    It might not be a storage problem right now, but the practice of publishing crap is dangerous, because it can be easily abused. I think it is very easy to publish via pypi a lot of very heavy packages.

Same on r/rust. Post after post with a new project that does something groundbreaking.

Until you look at the source code and notice it's all held together by Duct tape.