Comment by Jtsummers

3 days ago

Designing or intending a system to be used at massive scale is not the same as building and deploying it so that it only initially runs at that massive scale.

That's just a recipe for disaster, "We don't even know if we can handle 100 users, let's now force 1 million people to use the system simultaneously." Even WhatsApp couldn't handle hundreds of millions of users on the day it was first released, nor did it attempt to. You build out slowly and make sure things work, at least if you're competent and sane.

Sure, but if you did a good job, the gradual deployment can go relatively quickly and smoothly, which is how $FAANG roll out new features and products to very large audiences. The actual rollout is usually a bit of an implementation detail of what first needed to be architected to handle that larger scale.

  • The issue with FAANG is that they already have the infrastructure to make these large scale deployments. So any new system - by necessity - needs to conform to that large scale architecture.

    • The other nice thing about FAANG is that almost nothing they do is actually necessary. If Facebook rolls out a new feature and breaks something for a few hours, it doesn't actually matter. It's harder to move fast and break things if you're, say, a bank, and every minute of downtime is a minute where your customers can't access their money. Enough minutes go by and you may have a very, very expensive crisis on your hands.

      2 replies →

  • You get certain big pieces correct maybe but you’d be surprised how many mistakes get made. For example, I had designed the billing system for a large distributed product that the engineer ended up implementing not as described in the spec which fell down fairly quickly with even a modicum of growth.

  • Well, Google got good at large scale rollouts, because they are doing large scale rollouts all the time. _And_ most of the time, the system they are rolling out is a small iteration from the last system they rolled out: the new GMail servers look almost exactly like the last GMail servers, but they have on extra feature flag you can turn on (and which is disabled by default) or have one bug fixed.

    That's a very different challenge from rolling out a brand new system once.

  • FAANG tests first on test beds, and on subsets of their user base.

    • also, see what happened last week when Cloudflare pushed out a bad configuration without trying it on a subset

No but whatsapp was built by 2 guys that had previously worked at Yahoo, and they picked a very strong tech for the backend: erlang.

So while they probably didn't bother scaling the service to millions in the first version, they 1) knew what it would take, 2) chose already from the ground up a good technology to have a smoother transition to your "X millions users". The step "X millions to XYZ millions and then billions" required other things too.

At least they didn't have to write a php-to-C++ compiler for Php like Facebook had, given the initial design choice of Mark Zuckeberg, which shows exactly what it means to begin something already with the right tool and ideas in mind.

But this takes skills.

  • > No but whatsapp was built by 2 guys that had previously worked at Yahoo, and they picked a very strong tech for the backend: erlang.

    https://news.ycombinator.com/item?id=44911553

    Started as PHP, not as Erlang.

    > 1) knew what it would take, 2) chose already from the ground up a good technology to have a smoother transition to your "X millions users".

    No, as above, that was a pivot. They did not start from the ground up with Erlang or ejabberd, they adopted that later.

  • They took already existing protocol (XMPP) and already existing implementation in Erlang (ejabberd)—there weren't many alternatives at the time really.

  • Did they succeed because of Erlang or in spite of Erlang? We can't draw any reliable conclusions from a single data point. Maybe a different platform would have worked even better?

    • Erlang is uniquely suited to chat systems out of the box in a way that most other ecosystems aren't. Lightweight green threads via the BEAM vm, process scheduler so concurrent out of the box, immutable data structures, message passing as communication between processes.

      3 replies →

    • Yeah - the technology used is a seperate concern to their abilities as users (developers) of that technology and the effectiveness at handling the scale.

      I, for example, have always said that I am more than capable of writing code in C that is several orders of magnitude SLOWER than what I could write in.. say Python.

      My skillset would never be used as an example of the value of C for whatever