Comment by DCoder
7 years ago
Many years ago, I was asked to look at why all the content had vanished from a site (not built by me). After digging in a bit, I found that:
1) the original developer's idea of handling an unauthorized /admin request was just to set a redirect header and continue processing the current request .
2) the /admin page had a grid of all the content on the site, with handy 'Delete' links that ran over GET without confirmation.
You can probably guess where this is going – some search bot hit the overview page, ignored the redirect header, saw the content, and dutifully crawled every single link on it…
There were at least two browser extensions which also discovered that poor design was widespread and to disable prefetching for similar reasons:
http://fasterfox.mozdev.org/index.html
https://signalvnoise.com/archives2/google_web_accelerator_he...
I think the state of the web has improved slightly over the last decade but this is a great example of why browser vendors are so conservative. You can do this now but only opt-in.
Was it blekko? We had a website owner email us about that issue when blekko's ScoutJet crawler was new... although I don't recall the bit about ignored redirect headers.
I'm pretty sure everyone with a crawler has hit this sort of problem before. The first startup I was at did with someone's wiki that had "delete" links everywhere with no auth.
Now that I've hit it once, I watch out for websites with this problem. I was surprised to notice that a Fortune50 tech company's internal employee-personal-webpages-maker-thingie had that issue. And then a week later they asked me if I could crawl their internal web. Uh, no, who knows what other internal systems had that problem?