← Back to context

Comment by torginus

18 hours ago

I think the issue with this formulation is what drives the cost at cloud providers isn't necessarily that their hardware is too expensive (which it is), but that they push you towards overcomplicated and inefficient architectures that cost too much to run.

A core at this are all the 'managed' services - if you have a server box, its in your financial interest to squeeze as much per out of it as possible. If you're using something like ECS or serverless, AWS gains nothing by optimizing the servers to make your code run faster - their hard work results in less billed infrastructure hours.

This 'microservices' push usually means that instead of having an on-server session where you can serve stuff from a temporary cache, all the data that persists between requests needs to be stored in a db somewhere, all the auth logic needs to re-check your credentials, and something needs to direct the traffic and load balance these endpoint, and all this stuff costs money.

I think if you have 4 Java boxes as servers with a redundant DB with read replicas on EC2, your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.

These crazy AWS bills usually come from using every service under the sun.

The complexity is what gets you. One of AWS's favorite situations is

1) Senior engineer starts on AWS

2) Senior engineer leaves because our industry does not value longevity or loyalty at all whatsoever (not saying it should, just observing that it doesn't)

3) New engineer comes in and panics

4) Ends up using a "managed service" to relieve the panic

5) New engineer leaves

6) Second new engineer comes in and not only panics but outright needs help

7) Paired with some "certified AWS partner" who claims to help "reduce cost" but who actually gets a kickback from the extra spend they induce (usually 10% if I'm not mistaken)

Calling it it ransomware is obviously hyperbolic but there are definitely some parallels one could draw

On top of it all, AWS pricing is about to massively go up due to the RAM price increase. There's no way it can't since AWS is over half of Amazon's profit while only around 15% of its revenue.

  • One of the biggest problems with the self-hosted situations I’ve seen is when the senior engineers who set it up leave and the next generation has to figure out how to run it all.

    In theory with perfect documentation they’d have a good head start to learn it, but there is always a lot of unwritten knowledge involved in managing an inherited setup.

    With AWS the knowledge is at least transferable and you can find people who have worked with that exact thing before.

    Engineers also leave for a lot of reasons. Even highly paid engineers go off and retire, change to a job for more novelty, or decide to try starting their own business.

    • >With AWS the knowledge is at least transferable

      unfortunately it lot of things in AWS that also could be messed up so it might be really hard to research what is going on. For example, you could have hundreds of Lambdas running without any idea where original sources and how they connected to each-other, or complex VPCs network routing where some rules and security groups shared randomly between services so if you do small change it could lead to completely difference service to degrade (like you were hired to help with service X but after you changes some service Y went down and you even not aware that it existed)

      1 reply →

    • In my experience, a back end on the cloud isn't necessarily any less complex than something self hosted.

    • There are many great developers who are not also SREs. Building and operating/maintaining have their different mindsets.

  • The end result of all this is that the percentage of people who know how to implement systems without AWS/Azure will be a single digit. From that point on, this will be the only "economic" way, it doesn't matter what the prices are.

    • It already is like that, but not because of the cloud. Those of us who begun with computers in the era of the command line were forced to learn the internals of operating systems, and many ended up turning this hobby into a job.

      Youngsters nowadays start with very polished interfaces and smartphones, so even if the cloud wasn't there it would take them a decade to learn systems design on-the-job, which means it wouldn't happen anyway for most. The cloud nowadays mostly exists because of that dearth of system internals knowledge.

      While there still are around people who are able to design from scratch and operate outside a cloud, these people tend to be quite expensive and many (most?) tend to work for the cloud companies themselves or SaaS businesses, which means there's a great mismatch between demand and supply of experienced system engineers, at least for the salaries that lower tier companies are willing to pay. And this is only going to get worse. Every year, many more experienced engineers are retiring than the noobs starting on the path of systems engineering.

    • That's not a factual statement over reality, but more of a normative judgement to justify resignation. Yes, professionals that know how to actually do these things are not abundantly available, but available enough to achieve the transition. The talent exists and is absolutely passionate about software freedom and hence highly intrinsically motivated to work on it. The only thing that is lacking so far is the demand and the talent available will skyrocket, when the market starts demanding it.

      19 replies →

  • It’s all anecdotal but in my experiences it’s usually opposite. Bored senior engineer wants to use something new and picks a AWS bespoke service for a new project.

    I am sure it happens a multitude of ways but I have never seen the case you are describing.

    • I’ve seen your case more than the ransom scenario too. But also even more often: early-to-mid-career dev saw a cloud pattern trending online, heard it was a new “best practice,” and so needed to find a way to move their company to using it.

    • Is that what I should be doing? I'm just encouraging the devs on my team to read designing data intensive apps and setting up time for group discussions. Aside from coding and meetings that is.

  • > 7) Paired with some "certified AWS partner"

    What do you think RedHat support contracts are? This situation exists in every technology stack in existence.

  • > 3) New engineer comes in and panics

    > 4) Ends up using a "managed service" to relieve the panic

    It's not as though this is unique to cloud.

    I've seen multiple managers come in and introduce some SaaS because it fills a gap in their own understanding and abilities. Then when they leave, everyone stops using it and the account is cancelled.

    The difference with cloud is that it tends to be more central to the operation, so can't just be canceled when an advocate leaves.

  • > One of AWS's favorite situations

    I'll give you an alternative scenario, which IME is more realistic.

    I'm a software developer, and I've worked at several companies, big and small and in-between, with poor to abysmal IT/operations. I've introduced and/or advocated cloud at all of them.

    The idea that it's "more expensive" is nonsense in these situations. Calculate the cost of the IT/operations incompetence, and the cost of the slowness of getting anything done, and cloud is cheap.

    Extremely cheap.

    Not only that, it can increase shipping velocity, and enable all kinds of important capabilities that the business otherwise just wouldn't have, or would struggle to implement.

    Much of the "cloud so expensive" crowd are just engineers too narrowly focused on a small part of the picture, or in denial about their ability to compete with the competence of cloud providers.

    • > Much of the "cloud so expensive" crowd are just engineers too narrowly focused on a small part of the picture, or in denial about their ability to compete with the competence of cloud providers

      This has been my experience as well. There are legitimate points of criticism but every time I’ve seen someone try to make that argument it’s been comparing significantly different levels of service (e.g. a storage comparison equating S3 with tape) or leaving out entire categories of cost like the time someone tried to say their bare metal costs for a two server database cluster was comparable to RDS despite not even having things like power or backups.

Just this week a friend of mine was spinning up some AWS managed service, complaining about the complexity, and how any reconfiguration took 45 minutes to reload. It's a service you can just install with apt, the default configuration is fine. Not only is many service no longer cheaper in the cloud, the management overhead also exceed that of on-prem.

  • I'd gladly use (and maybe even pay for!) an open-source reimplementation of AWS RDS Aurora. All the bells and whistles with failover, clustering, volume-based snaps, cross-region replication, metrics etc.

    As far as I know, nothing comes close to Aurora functionality. Even in vibecoding world. No, 'apt-get install postgres' is not enough.

    • serverless v2 is one of the products that i was skeptical about but is genuinely one of the most robust solutions out there in that space. it has its warts, but I usually default to it for fresh installs because you get so much out of the box with it

    • Nitpick (I blame Amazon for their horrible naming): Aurora and RDS are separate products.

      What you’re asking for can mostly be pieced together, but no, it doesn’t exist as-is.

      Failover: this has been a thing for a long time. Set up a synchronous standby, then add a monitoring job that checks heartbeats and promotes the standby when needed. Optionally use something like heartbeat to have a floating IP that gets swapped on failover, or handle routing with pgbouncer / pgcat etc. instead. Alternatively, use pg_auto_failover, which does all of this for you.

      Clustering: you mean read replicas?

      Volume-based snaps: assuming you mean CoW snapshots, that’s a filesystem implementation detail. Use ZFS (or btrfs, but I wouldn’t, personally). Or Ceph if you need a distributed storage solution, but I would definitely not try to run Ceph in prod unless you really, really know what you’re doing. Lightbits is another solution, but it isn’t free (as in beer).

      Cross-region replication: this is just replication? It doesn’t matter where the other node[s] are, as long as they’re reachable, and you’ve accepted the tradeoffs of latency (synchronous standbys) or potential data loss (async standbys).

      Metrics: Percona Monitoring & Management if you want a dedicated DB-first, all-in-one monitoring solution, otherwise set up your own scrapers and dashboards in whatever you’d like.

      What you will not get from this is Aurora’s shared cluster volume. I personally think that’s a good thing, because I think separating compute from storage is a terrible tradeoff for performance, but YMMV. What that means is you need to manage disk utilization and capacity, as well as properly designing your failure domain. For example, if you have a synchronous standby, you may decide that you don’t care if a disk dies, so no messing with any kind of RAID (though you’d then miss out on ZFS’ auto-repair from bad checksums). As long as this aligns with your failure domain model, it’s fine - you might have separate physical disks, but co-locate the Postgres instances in a single physical server (…don’t), or you might require separate servers, or separate racks, or separate data centers, etc.

      tl;dr you can fairly closely replicate the experience of Aurora, but you’ll need to know what you’re doing. And frankly, if you don’t, even if someone built a OSS product that does all of this, you shouldn’t be running it in prod - how will you fix issues when they crop up?

      2 replies →

  • What managed service? Curious, I don’t use the full suite of aws services but wondering what would take 45mins, maybe it was a large cluster of some sort that needed rolling changes?

    • My observation is that all these services are exploding in complexity, and they justify saying that there are more features now, so everyone needs to accept spending more and more time and effort for the same results.

      1 reply →

> If you're using something like ECS or serverless, AWS gains nothing by optimizing the servers to make your code run faster - their hard work results in less billed infrastructure hours.

If ECS is faster, then you're more satisfied with AWS and less likely to migrate. You're also open to additional services that might bring up the spend (e.g. ECS Container Insights or X-Ray)

Source: Former Amazon employee

  • We did some benchmarks and ECS was definitely quite a bit more expensive for a given capacity than just running docker on our own EC2 instances. It also bears pointing out that a lot of applications (either in-house or off-the-shelf) expect a persistent mutable config directory or sqlite database.

    We used EFS to solve that issue, but it was very awkward, expensive and slow, its certainly not meant for that.

Also don’t forget the layers of REST services, ORMs, and other lunacy.

It runs slower than a bloated pig, especially on a shared hosting node, so now needs Kubernetes and cloud orchestration to make it “scalable” - beyond a few requests per second.

I don’t understand why most cloud backend designs seem to strive for maximizing the number of services used.

My biggest gripe with this is async tasks where the app does numerous hijinks to avoid a 10 minute lambda processing timeout. Rather than structure the process to process many independent and small batches, or simply using a modest container to do the job in a single shot - a myriad of intermediate steps are introduced to write data to dynamo/s3/kinesis + sqs/and coordination.

A dynamically provisioned, serverless container with 24 cores and 64 GB of memory can happily process GBs of data transformations.

Fully agree to this. I find the cost of cloud providers is mostly driven by architecture. If you're cost conscious, cloud architectures need to be up-front designed with this in mind.

Microservices is a killer with cost. For each microservices pod - you're often running a bunch of side cars - datadog, auth, ingress - you pay massive workload separation overhead with orchestration, management, monitoring and ofc complexity

I am just flabbergasted that this is how we operate as a norm in our industry.

It's about fitting your utilization to the model that best serves you.

If you can keep 4 "Java boxes" fed with work 80%+ of the time, then sure EC2 is a good fit.

We do a lot of batch processing and save money over having EC2 boxes always on. Sure we could probably pinch some more pennies if we managed the EC2 box uptime and figured out mechanisms for load balancing the batches... But that's engineering time we just don't really care to spend when ECS nets us most of the savings advantage and is simple to reason about and use.

Agreed. There is a wide price difference between running a managed AWS or Azure MySQL service and running MySQL on a VM that you spin up in AWS or Azure.

> your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.

You don’t need colocation to save 4x though. Bandwidth pricing is 10x. EC2 is 2-4x especially outside US. EBS for its iops is just bad.