Comment by gojomo

15 years ago

Some weaknesses of this algorithm are:

(1) Wall-clock hours penalize an article even if no one is reading (overnight, for example). A time denominated in ticks of actual activity (such as views of the 'new' page, or even upvotes-to-all-submissions) might address this.

(2) An article that misses its audience first time through -- perhaps due to (1) or a bad headline -- may never recover, even with a later flurry of votes far beyond what new submissions are getting.

Without checking the exact numbers, consider a contrived example: Article A is submitted at midnight and 3 votes trickle in until 8am. Then at 8am article B is submitted. Over the next hour, B gets 6 votes and A gets 9 votes. (Perhaps many of those are duplicate-submissions that get turned into upvotes.) A has double the total votes, and 50% more votes even in the shared hour, but still may never rank above B, because of the drag of its first 8 hours.

(I think you'd need to timestamp each vote for an improved decay function.)

22 comments

gojomo

angusgr 15 years ago

Wall-clock hours penalize an article even if no one is reading (overnight, for example)

I'd be interested to know what the hourly fluctuation for HN is actually like, on account of having readers all over the world.

I'm in Australia, so your example "submitted at midnight" California time[1] means submitted at 6pm my time. Also 8am London time, 11am Moscow time. :).

[1] I'm going to go ahead and assume you're in California. ;)

gojomo 15 years ago
I'm in California, usually, but have often observed HN through the California night -- either because of my own odd online hours, or trips to distant time-zones.
It's true there's never total quiescence, but the pace of actions changes by a noticeable factor. (Without going to the data, I'd guess 5X from trough to peak over a day's cycle, and a somewhat smaller weekend-to-weekday difference. Holidays and nice bay area weather also play a factor.)
- ig1 15 years ago
  
  I've found optimal submission time to generally be midday london time, you catch the european lunch-time traffic and the US wake-up/get-into-work traffic. I don't think traffic from anywhere else is heavy enough to matter.
noahc 15 years ago

I'm in Iowa, so it's central time. But it's not uncommon to see 2 or 3 hour old stories on the front page by 10:30 or so.

fizx 15 years ago

I believe reddit has historically done something similar. Rather than having a given article's ranking decay over time, they simply hand out better ranks to newer articles, leading to continual inflation. A naive example of this sort of ranking would be (timestamp in hours + upvotes - downvotes).

This would make your (excellent) idea easy to implement, because you could just use an autoincrement key as the timestamp, ignoring any sort of decay calculations.

jasonwatkinspdx 15 years ago

This inflation based approach also plays nicely with database indexes.

pmarin 15 years ago

(1) "overnight" not exist in the Internet era. (from Spain)

cryptoz 15 years ago
You are right of course, but I suspect the vast majority of HN users are not only in the USA but specifically in California. So even though people are using the site around the clock, there is probably a lot more traffic during the "awake hours" of California.
- BrandonM 15 years ago
  
  > I suspect the vast majority of HN users are not only in the USA but specifically in California
  I suspect that you have a nonstandard definition of "vast majority." I'd be surprised if even 1/4 of HN users are in California. There's a whole wide world out there! :)
  
  4 replies →

amix 15 years ago

The thing I like about HN's ranking algorithm is how simple it is. Implementing something more sophisticated would require more work and probably much better servers.

Recently, I have been implementing a ranking algorithm and I started in the wrong direction by taking all kinds of things into consideration. This is dangerous because you spend a lot of time on something that is mostly irrelevant.

The bottom line is that HN's algorithm works pretty well for the majority of cases. There are some edges, but solving them would not be trivial and would require a lot more work.

gojomo 15 years ago
I'd agree simple is better.
Of the two suggestions, replacing hours with artificial ticks adds very little complexity: it's the same formula, with one value replaced, and the accompanying factor adjusted. (The ticks might be submission-count, upvote-count, visit-count, or anything else trivially tallied -- it may not make much difference, except that over time greater activity could require adjusting the tick-deflator-factor.)
- jshen 15 years ago
  
  visit count is not trivial. A db write for every page view is heavy.
  
  2 replies →

eculver 15 years ago

I do agree that that the Wall-clock hours argument is somewhat weak due to the international crowd behind HN, but it would be interesting to see how weekends or really more general traffic patterns such as what may happen on holidays attribute to scores. I know that in these cases, I generally diverge from my normal HN-checking patterns.

sesqu 15 years ago

(I think you'd need to timestamp each vote for an improved decay function.)

Well, somewhat. You'd need a timestamped sufficient statistic, but not necessarily each vote.

Consider ACO. Given +1:

  t0=last_change_timestamp
  s0=score_after_last_change
  s1=1+s0/exp(t1-t0)

ma2rten 15 years ago

HN does't really experience a night, since people from all over the planet or on the site. But maybe it would be better to divide by a number of impressions wighted with something.

EGreg 15 years ago

Well, you can fix (1) by simply incrementing T not by actual time but just be the total # of votes/submissions since the beginning.