Comment by scrollaway
5 years ago
Holy crap
https://github.com/MattIPv4/hacktoberfest-data#diving-in-pul...
> "Of the 483,127 PRs submitted during Hacktoberfest, only 23,299 (4.82%) were identified as spam"
That is insanely high noise for hacktoberfest, especially when tagging spam "correctly" takes a non-insignificant amount of effort from the maintainers.
I was ready to rant about this post but … no, wow, this is very much warranted.
That's only if the maintainers new to even flag them as spam or invalid, and didn't just close them. I'd fully expect more of the latter than tagging.
Here you go, you're an open source maintainer already in an often thankless role, take some extra work, with a side dollop of extra work labelling the crap extra work you're getting.
I saw one small unknown repo which had four pages full of PRs which just add "awesome" to the README, and more of that noise. The repo's owner hasn't been active on GitHub in a while, so those are four page that likely never get reported. And this is just the first day.
An important missing datapoint (as noted elsewhere on this thread) could be how many PRs were merged as a result of the 2019 Hacktoberfest; that'd help get a sense of the positive value contributed. I'm surprised it's not mentioned in the repository you link to.
Anecdotally, I'm getting a lot of obvious spam on my repos that are >4 years old and less spam on other repos. Also, my "org account" (from before orgs were a thing / fully featured) got more spam than my personal account (which I don't use often these days but does at least book a commit or two most months).
So, the spammers are probably intentionally targeting repos where folks aren't likely to bother marking as spam.
On a separate note, I do not understand why people care so much about mid-quality t-shirts...
> I do not understand why people care so much about mid-quality t-shirts...
Most likely a mix of "it's free", "it's easy", and "it looks nice on a CV". Perfect storm for a lot of students and juniors to spend an hour of their time on, without bothering to spend the extra two to actually make the contributions useful.
I found the T-Shirt really nice and comfortable. Sure, it's not the highest grade, but it's a damn nice thing to get for free!
Looks like that repo was made pretty much when the last hacktoberfest ended, which would be too early to tell IMO. But yes indeed, good datapoint to look at a year after (especially since PRs that take a longer time to merge are, I would think, likely to be higher quality PRs with a higher rate of dev conversion into contributor)
Ah, thanks, yep - that makes sense :)
> especially when tagging spam "correctly" takes a non-insignificant amount of effort from the maintainers.
No kidding. GitHub requires you to wait several minutes (they don't say how much time exactly, but in my experience it's definitely > 2 min) between reporting something as spam. So you can't just go through your spam PRs in the morning and report them easily, you need to leave the browser tabs open and come back from time to time to submit spam reports. Not reporting is much easier, so the real figures are certainly higher.
EDIT: Ah, they don't even mean reporting spam to GitHub. Maintainers need to "opt in" to Hacktoberfest's own rules and change their own PR labeling system according to Hacktoberfest's wishes. What a pile of nonsense.
Include the full sentence.
> Of the 483,127 PRs submitted during Hacktoberfest, only 23,299 (4.82%) were identified as spam, with 19,587 (84.07%) of those being in a repository that the Hacktoberfest team excluded from the competition for not following the shared values and 3,712 (15.93%) being labeled as "invalid" by project maintainers.
Spam submissions were sent to spam repositories, that hasn't been known for years.
From the article, they act like it's their burden alone:
> Their solution, per their FAQ, is to put the burden solely on the shoulders of maintainers.
But, as a contributor, I see plenty of links for me all over Hacktoberfest to report repositories that are also trying to skirt the system.
One of the repos that I manage actively participates in Hacktoberfest and I'm finding out about this "invalid" label counting now, via this article.
No, I didn't read the FAQ, but I imagine neither did a lot of maintainers, especially those that don't participate. The undercounting must be massive.
I mean this is subjective of course, but I think we can agree most reasonable people would say less than 5% is not "insanely high"?