Comment by spyrja

3 months ago

Well exactly. The only variability would be on a per-resource basis, so the server-side calculations would likely be quite manageable. The RESOURCE_ID could be a simple concatenation of the name, size, and last-modification-date of the resource, the ITERATIONS parameter would obviously be tuned to by experimentation, and the MEMORY_COST needed based on some sort of heuristic.

The real question is whether or not it would really be enough to discourage indiscriminate/unrestrained scraping. The disparity between the computing resources of your average user and a GPU-accelerated bot with tons of memory is after all so lop-sided that such an approach may not even be sufficient. For a user to compute a hash that requires 1024 iterations of an expensive function which demands 25 MB of memory might seem like a promising scraping deterrent at first glance. On the other hand, to a company which has numerous cores per processor running in separate threads and several terabytes of RAM at it's disposal (multiplied by scores of computer racks) it might just be like a drop in the bucket. In any case, it would definitely require a modicum of tuning/testing to see if it is even viable.

I have actually implemented this very kind of hash function in the past and can attest that the implementation is fairly trivial. With just a bit of number theory and some sponge-contruction tricks you can achieve a highly robust implementation with just a few dozen lines of Javascript code. Maybe when I have the time I should put something up on Github as a proof-of-concept for people to play with. =)

0 comments

spyrja

No comments yet

Contribute on Hacker News ↗