← Back to context

Comment by continuations

7 years ago

> The world’s most robust computer systems... used special hardware that could tolerate single bit-flips. But Google... bought cheaper computers that lacked that feature.

> Together, Jeff and Sanjay wrote code to compensate for the offending machines.

I'm guessing the article is referring to Google buying servers with non-ECC RAM.

How do you write code to compensate for non-ECC RAM?

> When a car goes around a turn, more ground must be covered by the outside wheels; likewise, the outer edge of a spinning hard disk moves faster than the inner one. Google had moved the most frequently accessed data to the outside, so that bits could flow faster under the read-head, but had left the inner half empty; Jeff and Sanjay used the space to store preprocessed data for common search queries

So Google was a company that was too cheap to buy ECC RAM, but willing to let half of their hard disks go to waste. I'm struggling to understand the logic behind this; can anyone explain?

By pushing this complexity up the stack and also having redundancy in storage. What's not correct can be an error and result in retries from the client that may be routed to good storage.