← Back to context

Comment by masto

1 year ago

One of the things it took a little time to wrap my head around when I started working at Google was trading off reliability or correctness for scaling.

I had previously built things like billing systems and small-scale OLTP web applications, where the concept of even asking the question of whether there could be any acceptable data loss or a nonzero error rate hadn't occurred to me. It was eye-opening to see, not so much that if you're doing millions of qps, some are going to fail, but more the difference in engineering attitude. Right this instant, there are probably thousands of people who opened their Gmail and it didn't load right or gave them a 500 error. Nobody is chasing down why that happened, because those users will just hit reload and go on with their day. Or from another perspective, if your storage has an impressive 99.99999% durability over a year, when you have two billion customers, 200 people had a really miserable day.

It was a jarring transition from investigating every error in the logs to getting used to everything being a little bit broken all the time and always considering the cost before trying to do something about it.

Durability levels that poor aren’t state of the art any more.

The rule of thumb I’ve seen at most places (running at similar scale) is to target one data loss, fleet wide, per century.

That usually increases costs by << 10%, but you have to have someone that understands combinatorics design your data placement algorithms.

The copyset paper is a good place to start if you need to understand that stuff.

  • That sounds a lot better than some number of nines standing by itself.

    99.(9)x percent durability is almost meaningless without a description of what the unit of data is, what a loss looks like. There's too many orders of magnitude between a chunky file having an error, a transaction having an error, a block having an error, a bit having an error...

This all makes a lot of sense, but I have seen this a lot in the opposite direction. Specifically, people from social media companies with very squishy ideas around failures coming into financial applications. Generally does not go well.

While I was at Google I was loaned out to the Android team to work on contact synchronizing. There was a problem that I was running into, but it was an extremely rare situation. I pulled aggregated data from production, and it looked like it would affect 0.01% of people. When I presented the solution and mentioned this error rate I was asked how those 200,000 Android users could remediate the situation. They couldn't, their contact sync would just be broken, so I was told to go back to the drawing board.

But yeah, it was the number that was humbling.

There are definitely areas where 4 nines are good enough, but there are just as many areas where the aren't.