Bucket Stream: Finding S3 Buckets by watching certificate transparency logs

8 years ago (github.com)

>Randomise your bucket names! There is no need to use company-backup.s3.amazonaws.com

This is really poor advice. It offers no real benefit, especially since any asset you access will betray your bucket name because it's part of the DNS resolution. Bucket names are emphatically public as much as a DNS name is public.

  • True.

    It can also create more problems. If you name something like companyname-production vs companyname-qa, you pretty much know right off the bat which environment you are about to mess up. Not so with random names or UUIDs.

    This is also security by obscurity. If all one needs to know is the bucket name, you have already lost.

    EDIT: As an exception to this, I randomize a portion of the bucket name when it is created by automation. But this is solely to avoid name clashes across separate clusters. The prefix will still be the same.

    • > This is also security by obscurity.

      I see this being claimed a lot, but isn’t all security by obscurity at the end of the day?

      A simplistic example, compare (A) with (B).

      A) I run telnet with no password on a random port. The changes for an attacker to guess my password are 1/65k.

      B) I run telnet in port 25 with password being a random number from 1 to 65k.

      How do A and B differ in security?

      19 replies →

  • I can think of one advantage.. it makes it difficult for somebody to attack you with a typo attack. If all your buckets having a consistent naming scheme that is very strict, then somebody else could make a bucket very similar to one of yours where a typo would be likely and your data starts going to them.

  • I wouldn't call it poor advice. It isn't a control, more security by obscurity, but it doesn't exactly hurt anything either. I saw a situation recently where a bucket was accidentally opened to the world, but the name was a UUID and in the entire history of the bucket no request was logged other than from the intended clients.

    • > but it doesn't exactly hurt anything either.

      It hurts me if I'm trying to remember the bucket I'm after.

      Is fc20d856-2a7e-41ab-b072-9bb9a68c6bda production or 193565ac-9121-4071-8aeb-62f3111c4c97 or is that the dev setup or the staging data for the other service or...

      To me the big question here is why these names have to be global. Why can't I have a UUID externally but a name and an account internally? Honest question, I assume there may be a significant issue as smarter people than me decided not to do it that way.

      2 replies →

    • > in the entire history of the bucket no request was logged other than from the intended clients

      This sounds sort of like dumb luck. It just means no one was looking for it, that doesn't mean it's secure. This all reminds of me of the xkcd about making passwords that are easy for computers to guess and hard for people to remember[0].

      Your security on buckets should be the bucket policy/permissions themselves, not the arbitrary naming of them. Security by obscurity is rarely secure and more about the illusion of security.

      [0] https://xkcd.com/936/

      2 replies →

Could be more general: finding subdomains by watching CT logs.

So what is the problem here?

How to "hide" private subdomains?

How to "securely" configure S3 buckets?

IMO, the problem is in the use of the CA system, where control over "names" (e.g. subdomains) is shared with third parties (certificate issuers) instead of being solely with the user who wants to reserve names.

It is possible to have a non-CA PKI system where the user controls both the issuance of the public key and the associated name she will use. In such a system, no third party has control over names. People learn the user's name and the user's key from the same source: the user.

Thus there is no issue of trust re: using third parties, and thus no need for monitoring what names the third parties are issuing, e.g. via "certificate transparency" logs. CT logs do not need to exist.

This is not a new idea and it has been proven to work. I can prepare a post with examples if anyone is interested.

  • > Could be more general: finding subdomains by watching CT logs.

    Yep. Can use crt.sh for this on a per domain level, I also wrote ausdomainledger.net as an experiment to index all subdomains in the .au TLD, querying the CT logs directly, which was a bunch of fun.

    > How to "hide" private subdomains?

    Symantec provides the option of label redaction (using the '?' symbol) for CT precerts with the certificates they issue. For example: https://crt.sh/?q=?.amazon.com.au . However I'm pretty sure its not supported by the CT RFC ...

    Otherwise, I'd say wildcards.

    Replacing the CA PKI with something else is very drastic and if possible, will probably take a very long time ...

> Randomise your bucket names! There is no need to use company-backup.s3.amazonaws.com.

I don't think this is a globally true statement. Random bucket names are hard, not everyone is using s3 with a code configuration and therefore remembering bucket name is actually important.

Passive DNS might be another good way to get S3 bucket names.

There doesn't seem to be a Wikipedia article on Passive DNS, but this article explains it quite well: https://help.passivetotal.org/passive_dns.html

Basically some resolvers submit all (some?) of their DNS query responses to a central database so that it can be searched later. It seems you can also install a passive "sensor" in your network that (presumably) passively MITMs DNS queries and then sends off the responses.

I don't know how hard it is to get access to the data, but:

> programs like RiskIQ's DNSIQ allow organizations to install a sensor on their network that reports back to RiskIQ and in exchange, the organization gains access to all the passive DNS traffic inside the central repository.

EDIT: VirusTotal has some passive DNS data publicly available: e.g. look in "observed subdomains" https://www.virustotal.com/en/domain/s3-us-west-2.amazonaws....

EDIT2: And a bunch of them appear to be unprotected...

I did some analysis a few months ago and collected the names of approximately 100,000 buckets in the wild. Rough numbers, about 5% are open to the public for anonymous read, and about 5% of those are open for anonymous write.

I'm convinced that Chris Vickery, the guy behind a good many of the open bucket finds this year, has access to enterprise firewall/proxy logs. Not because the buckets would have been hard to find, but because you could spend a lifetime looking through thousands upon thousands of open buckets before you find anything interesting.

This is concerning b/c there have been a number of high profile data breaches that have occurred due to over reliance on S3 bucket obscurity. Where the buckets have been left with minimal or misconfigured permissions and GBs of data there for the downloading.

  • How is this concerning? This is very good, because it makes it easy to do that, which means that's much harder to dismiss as "something that will never happen".

    • Concerning in the sense of "if you aren't sure why this is a story on HN" -> that you may be unaware that many large and generally technically competent firms are screwing this up and this repo/tool is yet one more reason to take this seriously.

  • At some point an organization living in the cloud needs to properly secure their cloud resources. This makes it easier to justify that effort up front.

  • Correct me if I’m wrong but last time I tried to make a new bucket’s contents public it was a real PITA. The default configuration is very locked down. So I think it’s never a case of minimal configuration and always misconfiguration.

I was curious so I've tried if I could find anything compromising with it and it's mostly just public buckets of some images used for websites so nothing strange. Maybe the README is a bit too dramatic.

I'm confused. Aren't S3 buckets secured by pre-existing wildcard certs?

  • Ignore any direct connection between S3 buckets themselves and particular certificates, and just think of the stream of domain names you get from CT as the seed for a dictionary to grind against S3.

  • The code takes the CT hostname and tries to access a bunch of different buckets that might exist related to that hostname. So if you get a cert for foo.example.com it will ask s3 if foo.example.com.s3.amazonaws.com and www-foo.example.com.s3.amazonaws.com exist.