← Back to context

Comment by strken

1 day ago

I've seen senior engineers get fired and the business suffer a setback because they didn't have any way to scale beyond a single low spec VPS from a budget provider, and their system crashed when a hall full of students tried to sign up together during a demo and each triggered 200ms of bcrypt CPU activity.

This seems weird. I have a lot of experience with rails which is considered super slow. But the scenario you describe is trivial. Just get a bigger VPS and change a single env var. even if you fucked up everything else like file storage etc you can still to that. If you build your whole application in way where you can’t scale anything you should be fired. That is not even that easy

  • People screw up the bcrypt thing all the time. Pick a single threaded server stack (and run on one core, because Kubernetes), then configure bcrypt so brute forcing 8 character passwords is slow on an A100. Configure kubernetes to run on a medium range CPU because you have no load. Finally, leave your cloud provider's HTTP proxy's timeout set to default.

    The result is 100% of auth requests timeout once the login queue depth gets above a hundred or so. At that point, the users retry their login attempts, so you need to scale out fast. If you haven't tested scale out, then it's time to implement a bcrypt thread pool, or reimplement your application.

    But at least the architecture I described "scales".

    • Fond memories of a job circa 2013 on a very large Rails app where CI times were sped up by a factor of 10 when someone realized bcrypt was misconfigured when running tests and slowing things down every time a user was created through a factory.

    • "because Kubernetes"? Is this assuming that you're running your server inside of a Kubernetes instance (and if so, is Kubernetes going to have problems with more than one thread?), or is there some other reason why it comes into this?

  • Of course you should be fired for doing that! I meant the example as an illustration of how "you don't need to scale" thinking turns into A-grade bullshit.

    You do, in fact, need to scale to trivial numbers of users. You may even need to scale to a small number of users in the near future.

    • I'm not seeing how your example proves that a beefy server/cloud free architecture cannot handle the workload that most companies will encounter. The example you give of an under specified VPS is not what is being discussed in the article.

      1 reply →

I will never forget the time my university's home-grown Web-based registration system crashed at the beginning of the semester, and the entirety of the university's student body had to form a line in order to have their registration entered manually. I waited a whole day, and they did not get round to me by night, so I had to wait the next day too.

  • “Knowing what’s reasonable” matters.

    If you have a product that’s being deployed for a new school year, yeah you should be prepared for any one-time load for that time period.

    Many products don’t have the “school year just started” spikes. But some do.

    It requires careful thought, pragmatism, and business sense to balance everything and achieve the most with the available resources.

Wonder which one happens more often? Personally I haven't worked in that kind of "find the person to blame" culture which would led to something like that, so I haven't witnessed what you're talking about, but I believe you it does happen in some places.

That’s a skill issue, not an indictment on the limitations of the architecture. You can spin up N servers and load-balance them, as TFA points out. If the server is a snowflake and has nothing in IaC, again, not an architectural issue, but a personnel / knowledge issue.

  • The architecture in TFA is fine, and sounds preferable to microservices for most use cases.

    I am worried by the talk of 10k daily users and a peak of 1000TPS being too much premature optimisation. Those numbers are quite low. You should know your expected traffic patterns, add a margin of error, and stress test your system to make sure it can handle the traffic.

    I disagree that self-inflicted architectural issues and personnel issues are different.

I frankly don't believe that in a workplace where an userbase can be characterized as a "hall full of students" anyone was fired overnight. Doesn't happen at these places. Reprimanded, maybe

  • More frequently, anyone that sounded the alarm about this was let go months ago, so the one that'd be fired is the one in charge of the firing.

    Instead, they celebrate "learning from running at scale" or some nonsense.