Comment by CaptainOfCoit

21 hours ago

I've seen startups killed because of one or two "influential" programmers deciding they need to start architecturing the project for 1000TPS and 10K daily users, as "that's the proper way to build scalable software", while the project itself hasn't even found product-market fit yet and barely has users. Inevitably, the project needs to make a drastic change which now is so painful to do because it no longer fits the perfect vision the lead(s) had.

Cue programmers blaming the product team for "always changing their mind" as they discover what users actually need, and the product team blaming developers for being hesitant to do changes, and when programmers agree, it takes a long time to undo the perfect architecture they've spent weeks fine-tuning against some imaginary future user-base.

39 comments

CaptainOfCoit

hamburglar 15 hours ago

I was part of a small team that built a $300M company on Ruby and MySQL that made every scaling mistake you can possibly make. This was also the right decision because it forced us to stay lean and focus on what we needed right now, as opposed to getting starry-eyed about what it was going to be like when we had 10 million users. At every order of magnitude, we had sudden emergencies where some new part of the system had become a bottleneck, and we scrambled like crazy to rearchitect things to accommodate. It was hard, and it was fun. And it was frugal. We eventually hit over 10 million users before I left, and I can’t say I regret the painful approach one bit.

stavros 11 hours ago

I also imagine you were pretty agile by not having tons of complexity to grapple with every time you wanted to add a new feature.

smoe 19 hours ago

In my opinion, if those influential programmers actually architected around some concrete metrics like 1,000 TPS and 10K daily users, they would end up with much simpler systems.

The problem I see is much more about extremely vague notions of scalability, trends, best practices, clean code, and so on. For example we need Kafka, because Kafka is for the big boys like us. Not because the alternatives couldn’t handle the actual numbers.

CV-driven development is a much bigger issue than people picking overly ambitious target numbers.

stavros 19 hours ago

> 1000TPS and 10K daily users

I absolutely agree with your point, but I want to point out, like other commenters here, that the numbers should be much larger. We think that, because 10k daily users is a big deal for a product, they're also a big deal for a small server, but they really aren't.

It's fantastic that our servers nowadays can easily handle multiple tens of thousands of daily users on $100/mo.

hamdingers 16 hours ago
Users/TPS aren't the right metric in the first place. I have a webhook glue side project that I didn't even realize had ~8k daily users/~300tps until I set up Cloudflare analytics. As a go program doing trivial work, the load is dwarfed by the cpu/memory usage of all my seedbox related software (which has 1 user, not even every day).
- CaptainOfCoit 16 hours ago
  
  > Users/TPS aren't the right metric in the first place.
  This was my initial point :) Don't focus on trying to achieve some metrics, focus on making sure to build the right thing.
thewebguyd 13 hours ago

> We think that, because 10k daily users is a big deal for a product, they're also a big deal for a small server, but they really aren't.
Yeah we seem to forget just how fast computers are now a days. Obviously varies with complexity of the app & what other tech you are using, but for simpler things 10k daily users could be handled by a reasonbly powerful desktop sitting under my desk without breaking a sweat.

strken 20 hours ago

I've seen senior engineers get fired and the business suffer a setback because they didn't have any way to scale beyond a single low spec VPS from a budget provider, and their system crashed when a hall full of students tried to sign up together during a demo and each triggered 200ms of bcrypt CPU activity.

nasmorn 20 hours ago
This seems weird. I have a lot of experience with rails which is considered super slow. But the scenario you describe is trivial. Just get a bigger VPS and change a single env var. even if you fucked up everything else like file storage etc you can still to that. If you build your whole application in way where you can’t scale anything you should be fired. That is not even that easy
- hedora 19 hours ago
  
  People screw up the bcrypt thing all the time. Pick a single threaded server stack (and run on one core, because Kubernetes), then configure bcrypt so brute forcing 8 character passwords is slow on an A100. Configure kubernetes to run on a medium range CPU because you have no load. Finally, leave your cloud provider's HTTP proxy's timeout set to default.
  The result is 100% of auth requests timeout once the login queue depth gets above a hundred or so. At that point, the users retry their login attempts, so you need to scale out fast. If you haven't tested scale out, then it's time to implement a bcrypt thread pool, or reimplement your application.
  But at least the architecture I described "scales".
  
  2 replies →
- strken 19 hours ago
  
  Of course you should be fired for doing that! I meant the example as an illustration of how "you don't need to scale" thinking turns into A-grade bullshit.
  You do, in fact, need to scale to trivial numbers of users. You may even need to scale to a small number of users in the near future.
  
  2 replies →
esafak 15 hours ago
I will never forget the time my university's home-grown Web-based registration system crashed at the beginning of the semester, and the entirety of the university's student body had to form a line in order to have their registration entered manually. I waited a whole day, and they did not get round to me by night, so I had to wait the next day too.
- dwaltrip 15 hours ago
  
  “Knowing what’s reasonable” matters.
  If you have a product that’s being deployed for a new school year, yeah you should be prepared for any one-time load for that time period.
  Many products don’t have the “school year just started” spikes. But some do.
  It requires careful thought, pragmatism, and business sense to balance everything and achieve the most with the available resources.
CaptainOfCoit 20 hours ago

Wonder which one happens more often? Personally I haven't worked in that kind of "find the person to blame" culture which would led to something like that, so I haven't witnessed what you're talking about, but I believe you it does happen in some places.
sgarland 20 hours ago
That’s a skill issue, not an indictment on the limitations of the architecture. You can spin up N servers and load-balance them, as TFA points out. If the server is a snowflake and has nothing in IaC, again, not an architectural issue, but a personnel / knowledge issue.
- strken 19 hours ago
  
  The architecture in TFA is fine, and sounds preferable to microservices for most use cases.
  I am worried by the talk of 10k daily users and a peak of 1000TPS being too much premature optimisation. Those numbers are quite low. You should know your expected traffic patterns, add a margin of error, and stress test your system to make sure it can handle the traffic.
  I disagree that self-inflicted architectural issues and personnel issues are different.
ipsento606 20 hours ago
> they didn't have any way to scale beyond a single low spec VPS from a budget provider
they couldn't redeploy to a high-spec VPS instead?
- andrewmcwatters 14 hours ago
  
  [dead]
kunley 20 hours ago
I frankly don't believe that in a workplace where an userbase can be characterized as a "hall full of students" anyone was fired overnight. Doesn't happen at these places. Reprimanded, maybe
- hedora 19 hours ago
  
  More frequently, anyone that sounded the alarm about this was let go months ago, so the one that'd be fired is the one in charge of the firing.
  Instead, they celebrate "learning from running at scale" or some nonsense.

jstimpfle 17 hours ago

Something that does not scale to 10k users is likely so badly architected, it would be faster to iterate on it if it was more scalable hence better architected and more maintainable.

o11c 16 hours ago

For reference, in 1999 10K was still considered a (doable) challenge ... but they were talking "simultaneous" not "per day".
The modern equivalent challenge is 10 million simultaneous users per machine.

the8472 21 hours ago

1000TPS isn't that much? Engineer for low latency and with a 10ms budget that'd be 10 cores if it were CPU-bound, less in practice since usually part of the time is spent in IO wait.

CaptainOfCoit 21 hours ago
> 1000TPS isn't that much?
Why does that matter? My argument is: Engineer for what you know, leave the rest for when you know better, which isn't before you have lots of users.
- the8472 20 hours ago
  
  What I'm saying is that "building for 1000TPS" is not what gets you an overengineered 5-layer microservice architecture. If you build for a good user experience (which includes low latency) you get that not-that-big scale without sharding.
hedora 19 hours ago

I doubt much time would be in I/O wait if this was really a scale up architecture. Ignoring the 100's of GB of page cache, it should be sitting on NVMe drives, where a write is just a PCIe round trip, and a read is < 1ms.
drob518 21 hours ago

And with CPUs now being shipped with 100+ cores, you can brute force that sucker a long way.

systems 19 hours ago

Clearly this project failed for either

  1. scaling for a very specific use case, or because
  2. it hasn't even found product-market fit

Blaming the failure or designing for scale seem misplaced, you can scale while remaining agile and open to change

otabdeveloper4 21 hours ago

> 1000TPS and 10K daily users

That is not a lot. You can host that on a Raspberry Pi.

byroot 14 hours ago

That entirely depends on what these transactions are meant to do.
I always find these debate weird. How can you compare one app’s TPS with another?
pja 19 hours ago
Not if you’re going to be “web scale” (tm) you can’t.
- hedora 19 hours ago
  
  You can host it on 8 raspberry pi's: Three for etcd, three for minio/ceph, and two for Kubernetes workers.
  (16 if you need geo replication.)
- moffkalast 15 hours ago
  
  You put one Mongo shard on each Pi, they are the secret ingredient in the web scale sauce.

throwaway894345 15 hours ago

On the flip side, I've seen a project fail because it was built on the unvalidated assumption that the naive architecture would scale to real world loads only to find that a modest real world workload was exceeding targets by a factor of 100X. You really do need technical leadership with good judgment and experience; we can't substitute it with facile "assume low scale" or "assume large scale" axioms.

th0ma5 16 hours ago

You simply can't get the software or support for a lot of smaller solutions. It can be sometimes easier to do the seemingly more difficult thing, and sometimes because all the money goes to those more difficult seeming technical problems and solutions.