Comment by CaptainOfCoit

21 hours ago

I've seen startups killed because of one or two "influential" programmers deciding they need to start architecturing the project for 1000TPS and 10K daily users, as "that's the proper way to build scalable software", while the project itself hasn't even found product-market fit yet and barely has users. Inevitably, the project needs to make a drastic change which now is so painful to do because it no longer fits the perfect vision the lead(s) had.

Cue programmers blaming the product team for "always changing their mind" as they discover what users actually need, and the product team blaming developers for being hesitant to do changes, and when programmers agree, it takes a long time to undo the perfect architecture they've spent weeks fine-tuning against some imaginary future user-base.

I was part of a small team that built a $300M company on Ruby and MySQL that made every scaling mistake you can possibly make. This was also the right decision because it forced us to stay lean and focus on what we needed right now, as opposed to getting starry-eyed about what it was going to be like when we had 10 million users. At every order of magnitude, we had sudden emergencies where some new part of the system had become a bottleneck, and we scrambled like crazy to rearchitect things to accommodate. It was hard, and it was fun. And it was frugal. We eventually hit over 10 million users before I left, and I can’t say I regret the painful approach one bit.

  • I also imagine you were pretty agile by not having tons of complexity to grapple with every time you wanted to add a new feature.

In my opinion, if those influential programmers actually architected around some concrete metrics like 1,000 TPS and 10K daily users, they would end up with much simpler systems.

The problem I see is much more about extremely vague notions of scalability, trends, best practices, clean code, and so on. For example we need Kafka, because Kafka is for the big boys like us. Not because the alternatives couldn’t handle the actual numbers.

CV-driven development is a much bigger issue than people picking overly ambitious target numbers.

> 1000TPS and 10K daily users

I absolutely agree with your point, but I want to point out, like other commenters here, that the numbers should be much larger. We think that, because 10k daily users is a big deal for a product, they're also a big deal for a small server, but they really aren't.

It's fantastic that our servers nowadays can easily handle multiple tens of thousands of daily users on $100/mo.

  • Users/TPS aren't the right metric in the first place. I have a webhook glue side project that I didn't even realize had ~8k daily users/~300tps until I set up Cloudflare analytics. As a go program doing trivial work, the load is dwarfed by the cpu/memory usage of all my seedbox related software (which has 1 user, not even every day).

    • > Users/TPS aren't the right metric in the first place.

      This was my initial point :) Don't focus on trying to achieve some metrics, focus on making sure to build the right thing.

  • > We think that, because 10k daily users is a big deal for a product, they're also a big deal for a small server, but they really aren't.

    Yeah we seem to forget just how fast computers are now a days. Obviously varies with complexity of the app & what other tech you are using, but for simpler things 10k daily users could be handled by a reasonbly powerful desktop sitting under my desk without breaking a sweat.

I've seen senior engineers get fired and the business suffer a setback because they didn't have any way to scale beyond a single low spec VPS from a budget provider, and their system crashed when a hall full of students tried to sign up together during a demo and each triggered 200ms of bcrypt CPU activity.

  • This seems weird. I have a lot of experience with rails which is considered super slow. But the scenario you describe is trivial. Just get a bigger VPS and change a single env var. even if you fucked up everything else like file storage etc you can still to that. If you build your whole application in way where you can’t scale anything you should be fired. That is not even that easy

    • People screw up the bcrypt thing all the time. Pick a single threaded server stack (and run on one core, because Kubernetes), then configure bcrypt so brute forcing 8 character passwords is slow on an A100. Configure kubernetes to run on a medium range CPU because you have no load. Finally, leave your cloud provider's HTTP proxy's timeout set to default.

      The result is 100% of auth requests timeout once the login queue depth gets above a hundred or so. At that point, the users retry their login attempts, so you need to scale out fast. If you haven't tested scale out, then it's time to implement a bcrypt thread pool, or reimplement your application.

      But at least the architecture I described "scales".

      2 replies →

    • Of course you should be fired for doing that! I meant the example as an illustration of how "you don't need to scale" thinking turns into A-grade bullshit.

      You do, in fact, need to scale to trivial numbers of users. You may even need to scale to a small number of users in the near future.

      2 replies →

  • I will never forget the time my university's home-grown Web-based registration system crashed at the beginning of the semester, and the entirety of the university's student body had to form a line in order to have their registration entered manually. I waited a whole day, and they did not get round to me by night, so I had to wait the next day too.

    • “Knowing what’s reasonable” matters.

      If you have a product that’s being deployed for a new school year, yeah you should be prepared for any one-time load for that time period.

      Many products don’t have the “school year just started” spikes. But some do.

      It requires careful thought, pragmatism, and business sense to balance everything and achieve the most with the available resources.

  • Wonder which one happens more often? Personally I haven't worked in that kind of "find the person to blame" culture which would led to something like that, so I haven't witnessed what you're talking about, but I believe you it does happen in some places.

  • That’s a skill issue, not an indictment on the limitations of the architecture. You can spin up N servers and load-balance them, as TFA points out. If the server is a snowflake and has nothing in IaC, again, not an architectural issue, but a personnel / knowledge issue.

    • The architecture in TFA is fine, and sounds preferable to microservices for most use cases.

      I am worried by the talk of 10k daily users and a peak of 1000TPS being too much premature optimisation. Those numbers are quite low. You should know your expected traffic patterns, add a margin of error, and stress test your system to make sure it can handle the traffic.

      I disagree that self-inflicted architectural issues and personnel issues are different.

  • I frankly don't believe that in a workplace where an userbase can be characterized as a "hall full of students" anyone was fired overnight. Doesn't happen at these places. Reprimanded, maybe

    • More frequently, anyone that sounded the alarm about this was let go months ago, so the one that'd be fired is the one in charge of the firing.

      Instead, they celebrate "learning from running at scale" or some nonsense.

Something that does not scale to 10k users is likely so badly architected, it would be faster to iterate on it if it was more scalable hence better architected and more maintainable.

  • For reference, in 1999 10K was still considered a (doable) challenge ... but they were talking "simultaneous" not "per day".

    The modern equivalent challenge is 10 million simultaneous users per machine.

1000TPS isn't that much? Engineer for low latency and with a 10ms budget that'd be 10 cores if it were CPU-bound, less in practice since usually part of the time is spent in IO wait.

  • > 1000TPS isn't that much?

    Why does that matter? My argument is: Engineer for what you know, leave the rest for when you know better, which isn't before you have lots of users.

    • What I'm saying is that "building for 1000TPS" is not what gets you an overengineered 5-layer microservice architecture. If you build for a good user experience (which includes low latency) you get that not-that-big scale without sharding.

  • I doubt much time would be in I/O wait if this was really a scale up architecture. Ignoring the 100's of GB of page cache, it should be sitting on NVMe drives, where a write is just a PCIe round trip, and a read is < 1ms.

  • And with CPUs now being shipped with 100+ cores, you can brute force that sucker a long way.

Clearly this project failed for either

  1. scaling for a very specific use case, or because
  2. it hasn't even found product-market fit 

Blaming the failure or designing for scale seem misplaced, you can scale while remaining agile and open to change

> 1000TPS and 10K daily users

That is not a lot. You can host that on a Raspberry Pi.

  • That entirely depends on what these transactions are meant to do.

    I always find these debate weird. How can you compare one app’s TPS with another?

  • Not if you’re going to be “web scale” (tm) you can’t.

    • You can host it on 8 raspberry pi's: Three for etcd, three for minio/ceph, and two for Kubernetes workers.

      (16 if you need geo replication.)

On the flip side, I've seen a project fail because it was built on the unvalidated assumption that the naive architecture would scale to real world loads only to find that a modest real world workload was exceeding targets by a factor of 100X. You really do need technical leadership with good judgment and experience; we can't substitute it with facile "assume low scale" or "assume large scale" axioms.

You simply can't get the software or support for a lot of smaller solutions. It can be sometimes easier to do the seemingly more difficult thing, and sometimes because all the money goes to those more difficult seeming technical problems and solutions.