Comment by gojomo
15 years ago
Some weaknesses of this algorithm are:
(1) Wall-clock hours penalize an article even if no one is reading (overnight, for example). A time denominated in ticks of actual activity (such as views of the 'new' page, or even upvotes-to-all-submissions) might address this.
(2) An article that misses its audience first time through -- perhaps due to (1) or a bad headline -- may never recover, even with a later flurry of votes far beyond what new submissions are getting.
Without checking the exact numbers, consider a contrived example: Article A is submitted at midnight and 3 votes trickle in until 8am. Then at 8am article B is submitted. Over the next hour, B gets 6 votes and A gets 9 votes. (Perhaps many of those are duplicate-submissions that get turned into upvotes.) A has double the total votes, and 50% more votes even in the shared hour, but still may never rank above B, because of the drag of its first 8 hours.
(I think you'd need to timestamp each vote for an improved decay function.)
Wall-clock hours penalize an article even if no one is reading (overnight, for example)
I'd be interested to know what the hourly fluctuation for HN is actually like, on account of having readers all over the world.
I'm in Australia, so your example "submitted at midnight" California time[1] means submitted at 6pm my time. Also 8am London time, 11am Moscow time. :).
[1] I'm going to go ahead and assume you're in California. ;)
I'm in California, usually, but have often observed HN through the California night -- either because of my own odd online hours, or trips to distant time-zones.
It's true there's never total quiescence, but the pace of actions changes by a noticeable factor. (Without going to the data, I'd guess 5X from trough to peak over a day's cycle, and a somewhat smaller weekend-to-weekday difference. Holidays and nice bay area weather also play a factor.)
I've found optimal submission time to generally be midday london time, you catch the european lunch-time traffic and the US wake-up/get-into-work traffic. I don't think traffic from anywhere else is heavy enough to matter.
I'm in Iowa, so it's central time. But it's not uncommon to see 2 or 3 hour old stories on the front page by 10:30 or so.
I believe reddit has historically done something similar. Rather than having a given article's ranking decay over time, they simply hand out better ranks to newer articles, leading to continual inflation. A naive example of this sort of ranking would be (timestamp in hours + upvotes - downvotes).
This would make your (excellent) idea easy to implement, because you could just use an autoincrement key as the timestamp, ignoring any sort of decay calculations.
This inflation based approach also plays nicely with database indexes.
(1) "overnight" not exist in the Internet era. (from Spain)
You are right of course, but I suspect the vast majority of HN users are not only in the USA but specifically in California. So even though people are using the site around the clock, there is probably a lot more traffic during the "awake hours" of California.
> I suspect the vast majority of HN users are not only in the USA but specifically in California
I suspect that you have a nonstandard definition of "vast majority." I'd be surprised if even 1/4 of HN users are in California. There's a whole wide world out there! :)
4 replies →
The thing I like about HN's ranking algorithm is how simple it is. Implementing something more sophisticated would require more work and probably much better servers.
Recently, I have been implementing a ranking algorithm and I started in the wrong direction by taking all kinds of things into consideration. This is dangerous because you spend a lot of time on something that is mostly irrelevant.
The bottom line is that HN's algorithm works pretty well for the majority of cases. There are some edges, but solving them would not be trivial and would require a lot more work.
I'd agree simple is better.
Of the two suggestions, replacing hours with artificial ticks adds very little complexity: it's the same formula, with one value replaced, and the accompanying factor adjusted. (The ticks might be submission-count, upvote-count, visit-count, or anything else trivially tallied -- it may not make much difference, except that over time greater activity could require adjusting the tick-deflator-factor.)
visit count is not trivial. A db write for every page view is heavy.
2 replies →
I do agree that that the Wall-clock hours argument is somewhat weak due to the international crowd behind HN, but it would be interesting to see how weekends or really more general traffic patterns such as what may happen on holidays attribute to scores. I know that in these cases, I generally diverge from my normal HN-checking patterns.
(I think you'd need to timestamp each vote for an improved decay function.)
Well, somewhat. You'd need a timestamped sufficient statistic, but not necessarily each vote.
Consider ACO. Given +1:
HN does't really experience a night, since people from all over the planet or on the site. But maybe it would be better to divide by a number of impressions wighted with something.
Well, you can fix (1) by simply incrementing T not by actual time but just be the total # of votes/submissions since the beginning.