Comment by faizshah

3 years ago

In the paper on Twitter’s “Who to Follow” service they mention that they designed the service around storing the entire twitter graph in the memory of a single node:

> An interesting design decision we made early in the Wtf project was to assume in-memory processing on a single server. At first, this may seem like an odd choice, run- ning counter to the prevailing wisdom of “scaling out” on cheap, commodity clusters instead of “scaling up” with more cores and more memory. This decision was driven by two rationales: first, because the alternative (a partitioned, dis- tributed graph processing engine) is significantly more com- plex and dicult to build, and, second, because we could! We elaborate on these two arguments below.

> Requiring the Twitter graph to reside completely in mem- ory is in line with the design of other high-performance web services that have high-throughput, low-latency require- ments. For example, it is well-known that Google’s web indexes are served from memory; database-backed services such as Twitter and Facebook require prodigious amounts of cache servers to operate smoothly, routinely achieving cache hit rates well above 99% and thus only occasionally require disk access to perform common operations. However, the additional limitation that the graph fits in memory on a single machine might seem excessively restrictive.

I always wondered if they still do this and if this influenced any other architectures at other companies.

Paper: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69...

1 comment

faizshah

3pt14159 3 years ago

Yeah I think single machine has its place, and I once sped up a program by 10000x by just converting it to Cython and having it all fit in the CPU cache, but the cloud still does have a place! Even for non-bursty loads. Even for loads that theoretically could fit in a single big server.

Uptime.

Or are you going to go down as all your workers finish? Long connections? Etc.

It is way easier to gradually handover across multiple API servers as you do an upgrade than it is to figure out what to do with a single beefy machine.

I'm not saying it is always worth it, but I don't even think about the API servers when a deploy happens anymore.

Furthermore if you build your whole stack this way it will be non-distributed by default code. Easy to transition for some things, hell for others. Some access patterns or algorithms are fine when everything is in a CPU cache or memory but would fall over completely across multiple machines. Part of the nice part about starting with cloud first is that it is generally easier to scale to billions of people afterwards.

That said, I think the original article makes a nuanced case with several great points and I think your highlighting of the Twitter example is a good showcase for where single machine makes sense.