← Back to context

Comment by psyc

5 years ago

Pardon my ignorance as I have few years of web dev experience. What exactly does it mean to store data on a domain? Does he mean serve data via a domain URL? And if so, how does Google have discovery of that data?

Author here. Yes, "serve" is the correct interpretation. It is not clear how Google gets ahold of offending URLs within blacklisted domains (like the article says, there were no offending URLs provided to us).

Theories:

* Obtained from users of Google Chrome that load specific URLs in their browsers

* Obtained from scanning GMail emails that contain links to URLs

* Obtained from third parties that report these URLs

  • The main way is via the Googlebot crawler.

    They also use user reports from Chrome, and links in "mark phishing" emails from Gmail. Those latter two cases the URL is considered private data, so won't be reported in webmaster tools.

We’re pretty sure they get reports from Chrome. A security researcher at my workplace was running an exploit against a dev instance as part of their secops role and got the domain flagged, despite the site being an isolated and firewalled instance not accessible to the internet.

  • Yes, I have noticed that creating a brand new dev domain with crawler blocking norobots file, it is not found on any search on Google, until I open the dev url in Chrome, then bam! watch as their crawler starts trying to search through the site just from opening the url in Chrome.

    This is why I never use Chrome. They scrape the Google Safe Browsing sent from chrome browsers and just do not care about privacy.

    • Maybe it's from search suggestion API? Anyway, I turn that off as soon as I create a new browser profile, along with the safe browsing list and automatic search when I type unrecognized URL. When I want to search I use search input of the browser. (ctrl+k) URL bar is for URLs only.

    • You realize that robots.txt is an "on your honor" system and that any one can write a script that doesn't look at robots.txt and post anything they find to the internet and that therefore other sites could find your site via 3rd party data.

      Chrome does not do what you claim it does

      1 reply →

  • But that means they can't verify it, right? Couldn't a malicious actor use this to attack their competitors?

    Add an internal DNS entry for your competitor's domain, spin up an internal server hosting some malware and open it from chrome.

We use a fair number of google products, and you can turn on a lot of enhanced protection, and many businesses do. This means even password protected / private URLs may generate scans from what I've seen. I'm not sure how they actually fingerprint files (maybe locally) but it seems pretty broad

This seems to work across a lot of google products (gmail, drive, chome etc) so it scoops up a ton.

More here:

https://security.googleblog.com/2020/05/enhanced-safe-browsi...

Not sure if this is related to safe browsing. We also can turn on more scanning and other features of all email users.

The key though, if you allow users to PUT files onto your S3 (even private / signed in) then google may scan them. That means if your user uploads a suspicious file to a trouble ticket system, if there IS a virus in there and google sees it, wham. Obviously most folks will segregate those uploads off into their own s3 bucket by user/account to avoid contamination, but you really have to be careful not to hose viruses AT ALL on your key domains.