Comment by throwawayqqq11

3 months ago

Since there is varying but requester independent input into the hash function, doesnt this mean that the server has to calculate the entire value space too and that these resource hashes can be reused across different requester?

Binding a challange-response to a specific resource doesnt sound like such a bad idea though.

1 comment

throwawayqqq11

spyrja 3 months ago

Well exactly. The only variability would be on a per-resource basis, so the server-side calculations would likely be quite manageable. The RESOURCE_ID could be a simple concatenation of the name, size, and last-modification-date of the resource, the ITERATIONS parameter would obviously be tuned to by experimentation, and the MEMORY_COST needed based on some sort of heuristic.

The real question is whether or not it would really be enough to discourage indiscriminate/unrestrained scraping. The disparity between the computing resources of your average user and a GPU-accelerated bot with tons of memory is after all so lop-sided that such an approach may not even be sufficient. For a user to compute a hash that requires 1024 iterations of an expensive function which demands 25 MB of memory might seem like a promising scraping deterrent at first glance. On the other hand, to a company which has numerous cores per processor running in separate threads and several terabytes of RAM at it's disposal (multiplied by scores of computer racks) it might just be like a drop in the bucket. In any case, it would definitely require a modicum of tuning/testing to see if it is even viable.

I have actually implemented this very kind of hash function in the past and can attest that the implementation is fairly trivial. With just a bit of number theory and some sponge-contruction tricks you can achieve a highly robust implementation with just a few dozen lines of Javascript code. Maybe when I have the time I should put something up on Github as a proof-of-concept for people to play with. =)