Comment by dang
10 years ago
We worked on something like this for a while last year (code name "the archivist") with the intention of making Readability-style versions of stories with plain text, major images and no cruft. The purpose of the experiment was to see if it would speed up moderation. If we kept it, we hoped to share it with everybody (where by "hoped" I mean "would have unless we couldn't"). In the end, we didn't keep it because it didn't speed up moderation and it is one of those problems that turns out to be increasingly nontrivial the closer you get.
If we did anything like it again, I'd still hope to share it with everybody, but perhaps not by adding a third link. I already feel bad for adding two.
Sorry to see you've not kept up with this.
I access HN on a few devices, including some whose rendering of "modern" (that is: broken) site designs is at best poor. Frequently no content is visible, either due to text not appearing at all, or being completely obscured by other elements. A Readability view, stripped of cruft, would be excellent for this. I'm aware of issues with site referrals, copyright, etc., but really, it would be helpful.
Otherwise: Internet Archive and Coral Cache are both existing systems which can and do cache some content, on request. IA seems to like having hot stuff fed them, CC have been quite spotty in reliability over the past year or two (both not properly caching content, and simply not responding).
I'm open to working on it again. Qua user, I would love to be able to view Readability-style versions of stories quickly. And think of all the analytics people could do on a near-complete archive of all HN stories.
But it's a matter of priorities. Had it sped up moderation it would have both paid for itself and made certain campers happier. But it didn't turn out that way. Beyond that, technically it's a nontrivial problem to get working on the full range of content, and then there are the nontechnical obstacles. We wouldn't do it without being sure we could release it.
Sending requests to Internet Archive might be an option if they'd be ok with it, but that of course would only help with caching, not decrufting.
Totally understand on all points.
Caches on their own would be totally worthwhile, even without de-crufting. If IA are up for it, HN as signal for relevance would likely be worthwhile. Talk to Brewster.
As I said, decrufting/readability would be a really nice value-add!personally. Readability themselves have an API for this which might be one way to approach the concept, and they've done much of the heavy lifting in terms of sorting out sites' various CSS/HTML cruft and sanitizing it. I do my own pretty significant CSS restructuring locally (we've chatted about this before w/ HN), and with some 1800+ individual sites' CSS modified to some extent or anohter, I've got a really good idea of just how effed up the stuff can be.
I totally agree with Nic Bvacqua's "Stop Breaking the Web" posted yesterday.
But on an effort/reward basis as a greenfield project, likely not worth it. Going with Readability (or Instapaper, or Pocket) themselves could well be worth investigating.
As a suggestion: another consideration would be to simply reject submissions which aren't accessible via some putative minimal client. If enough aggregators started penalising sites for inaccessible content, they might start wising up.