Comment by harporoeder
5 years ago
For some context it's worth quoting directly from the published statistics available at (1). Although if this is based on manually tagging something as spam it is probably an understatement.
Of the 483,127 PRs submitted during Hacktoberfest, only 23,299 (4.82%) were identified as spam, with 19,587 (84.07%) of those being in a repository that the Hacktoberfest team excluded from the competition for not following the shared values and 3,712 (15.93%) being labeled as "invalid" by project maintainers.
They literally checked for a label with text "invalid" and that's it. The OP, for example, used the label "spam" so it doesn't count. Simply closing the PR without merging or commenting doesn't count. Any other text label doesn't count.
So yeah, I suspect it's massively undercounting.
Their FAQ (linked from the submitted article) says:
>[...] please give them an `invalid` or `spam` label and close them. Pull requests that contain a label with the word `invalid` or `spam` won’t be counted toward Hacktoberfest.
Since project maintainers don't have to opt in to Hacktoberfest, there's no reason for them to know that the FAQ exists. Most maintainers are unaware of what's going on and will just close the spammy PRs without tagging them.
1 reply →
The code in the linked repo with the stats is literally:
>const totalInvalidLabelPRs = await db.collection('pull_requests').find({'labels.name': 'invalid'}).count();
They also mention the label "invalid" multiple times and never the label "spam." So even if they count "spam" for making entries invalid for a reward their stats do not seem to take that into account.
2 replies →
It would seem like a much better idea to say that only PRs that are explicitly confirmed by the project maintainers as being valid will be counted.
2 replies →
Right, that relies on the maintainers knowing that they are "expected" to do the extra work to tag those with special tags, otherwise closed PRs count as "good".
"Only" 24k spam PRs. Face palm
Only 4.8% spam PRs better? The absolute number means nothing.
In a sea of 5B PRs, 24k would look impossibly good.
Does "only 24k MRs had maintainers that bothered to properly mark them as spam per hacktoberfest's guide" sound so "impossibly good" as well?
The absolute number definitely does mean something! Its work created for maintainers. Checking and tagging a PR takes non-zero time. Creating 24,000 invalid PR's is a massive burden on the open source community.
How about data on what % were merged?
Given that there's been a year since last year's event this should be a fairly clean piece of data at this point. Even longer or more involved valid PRs should have been merged in by now.
That's an insane amount of spam. The percentage is irrelevant, look at the real number: we're talking tens of thousands of instances of undue burden on the maintainers of open source packages.
And the number of "nice PRs" is essentially irrelevant here: this is not a zero sum game, a thousand good PRs don't cancel out a project getting flooded with bad PRs.
If your event can't prevent substantial abuse of the community you pretend to do this for, you should stop your event and figure out how to do better.
Wow, such an apple to oranges comparison. I scanned the page but didn't see anything better, so we should assume that 84.07% of all pull requests submitted during Hacktoberfest are to repositories excluded for not following the shared values. That implies that of the 483,127 PRs, 76,962 are to qualifying repositories, and in fact that 15.93% of ALL valid PRs are spam.
Why would they exclude a repo and then not exclude all PRs from that repo?
This is exactly my point. They listed the top-line, total number of all PRs without any filtering, then when they calculated the "spam ratio" they filtered out "invalid repos" as well as "spam PRs". It's a misleading statistic.