> The markup cost of using RDS (or any managed database) is worth it.
Every so often I price out RDS to replace our colocated SQL Server cluster and it's so unrealistically expensive that I just have to laugh. It's absurdly far beyond what I'd be willing to pay. The markup is enough to pay for the colocation rack, the AWS Direct Connects, the servers, the SAN, the SQL Server licenses, the maintenance contracts, and a full-time in-house DBA.
Once you get past the point where the markup can pay for one or more full-time employees, I think you should consider doing that instead of blindly paying more and more to scale RDS up. You're REALLY paying for it with RDS. At least re-evaluate the choices you made as a fledgling startup once you reach the scale where you're paying AWS "full time engineer" amounts of money.
Some orgs are looking at moving back to on prem because they're figuring this out. For a while it was vogue to go from capex to opex costs, and C suite people were incentivized to do that via comp structures, hence "digital transformation" ie: migration to public cloud infrastructure. Now, those same orgs are realizing that renting computers actually costs more than owning them, when you're utilizing them to a significant degree.
I was once part of an acquisition from a much larger corporate entity. The new parent company was in the middle of a huge cloud migration, and as part of our integration into their org, we were required to migrate our services to the cloud.
Our calculations said it would cost 3x as much to run our infra on the cloud.
We pushed back, and were greenlit on creating a hybrid architecture that allowed us to launch machines both on-prem and in the cloud (via a direct link to the cloud datacenter). This gave us the benefit of autoscaling our volatile services, while maintaining our predictable services on the cheap.
After I left, apparently my former team was strong-armed into migrating everything to the cloud.
A few years go by, and guess who reaches out on LinkedIn?
The parent org was curious how we built the hybrid infra, and wanted us to come back to do it again.
Context: I build internal tools and platforms. Traffic on them varies, but some of them are quite active.
My nasty little secret is for single server databases I have zero fear of over provisioning disk iops and running it on SQLite or making a single RDBMS server in a container. I've never actually run into an issue with this. It surprises me the number of internal tools I see that depend on large RDS installations that have piddly requirements.
That’s made possible because of all the orchestration platforms such as Kubernetes being standardized, and as such you can get pretty close to a cloud experience while having all your infrastructure on-premise.
Same experience here. As a small organization, the quotes we got from cloud providers have always been prohibitively expensive compared to running things locally, even when we accounted for geographical redundancy, generous labor costs, etc. Plus, we get to keep know how and avoid lock-in, which are extremely important things in the long term.
Besides, running things locally can be refreshingly simple if you are just starting something and you don't need tons of extra stuff, which becomes accidental complexity between you, the problem, and a solution. This old post described that point quite well by comparing Unix to Taco Bell: https://news.ycombinator.com/item?id=10829512.
I am sure for some use-cases cloud services might be worth it, especially if you are a large organization and you get huge discounts. But I see lots of business types blindly advocating for clouds, without understanding costs and technical tradeoffs. Fortunately, the trend seems to be plateauing. I see an increasing demand for people with HPC, DB administration, and sysadmin skills.
It's not an either/or. Many business both own and rent things.
If price is the only factor, your business model (or executives' decision-making) is questionable. Buy only the cheapest shit, spend your time building your own office chair rather than talking to a customer, you aren't making a premium product, and that means you're not differentiated.
RDS pricing is deranged at the scales I've seen too.
$60k/year for something I could run on just a slice of one of my on-prem $20k servers. This is something we would have run 10s of. $600k/year operational against sub-$100k capital cost pays DBAs, backups, etc with money to spare.
Sure, maybe if you are some sort of SaaS with a need for a small single DB, that also needs to be resilient, backed up, rock solid bulletproof.. it makes sense? But how many cases are there of this? If its so fundamental to your product and needs such uptime & redundancy, what are the odds its also reasonably small?
> Sure, maybe if you are some sort of SaaS with a need for a small single DB, that also needs to be resilient, backed up, rock solid bulletproof.. it makes sense? But how many cases are there of this?
Most software startups these days? The blog post is about work done at a startup after all. By the time your db is big enough to cost an unreasonable amount on RDS, you’re likely a big enough team to have options. If you’re a small startup, saving a couple hundred bucks a month by self managing your database is rarely a good choice. There’re more valuable things to work on.
I have a small MySQL database that’s rather important, and RDS was a complete failure.
It would have cost a negligible amount. But the sheer amount of time I wasted before I gave up was honestly quite surprising. Let’s see:
- I wanted one simple extension. I could have compromised on this, but getting it to work on RDS was a nonstarter.
- I wanted RDS to _import the data_. Nope, RDS isn’t “SUPER,” so it rejects a bunch of stuff that mysqldump emits. Hacking around it with sed was not confidence-inspiring.
- The database uses GTIDs and needed to maintain replication to a non-AWS system. RDS nominally supports GTID, but the documented way to enable it at import time strongly suggests that whoever wrote the docs doesn’t actually understand the purpose of GTID, and it wasn’t clear that RDS could do it right. At least Azure’s docs suggested that I could have written code to target some strange APIs to program the thing correctly.
Time wasted: a surprising number of hours. I’d rather give someone a bit of money to manage the thing, but it’s still on a combination of plain cloud servers and bare metal. Oh well.
> Sure, maybe if you are some sort of SaaS with a need for a small single DB, that also needs to be resilient, backed up, rock solid bulletproof.. it makes sense? But how many cases are there of this?
Very small businesses with phone apps or web apps are often using it. There are cheaper options of course, but when there is no "prem" and there are 1-5 employees then it doesn't make much sense to hire for infra. You outsource all digital work to an agency who sets you up a cloud account so you have ownership, but they do all software dev and infra work.
> If its so fundamental to your product and needs such uptime & redundancy, what are the odds its also reasonably small?
Small businesses again, some of my clients could probably run off a Pentium 4 from 2008, but due to nature of the org and agency engagement it often needs to live in the cloud somewhere.
I am constantly beating the drum to reduce costs and use as little infra as needed though, so in a sense I agree, but the engagement is what it is.
Additionally, everyone wants to believe they will need to hyperscale, so even medium scale businesses over-provision and some agencies are happen to do that for them as they profit off the margin.
Lots of cases. It doesn't even have to be a tiny database. Within <1TB range there's a huge number of online companies that don't need to do more than hundreds of queries per second, but need the reliability and quick failover that RDS gives them. The $600k cost is absurd indeed, but it's not the range of what those companies spend.
Also, Aurora gives you the block level cluster that you can't deploy on your own - it's way easier to work with than the usual replication.
RDS is not so bulletproof as advertised, and the support is first arrogant then (maybe) helpful.
People pay for RDS because they want to believe in a fairy tale that it will keep potential problems away and that it worked well for other customers. But those mythical other customers also paid based on such belief. Plus, no one wants to admit that they pay money in such irrational way.
It's a bubble
> $600k/year operational against sub-$100k capital cost pays DBAs, backups, etc with money to spare.
One of these is not like the others (DBAs are not capex.)
Have you ever considered that if a company can get the same result for the same price ($100K opex for RDS vs same for human DBA), it actually makes much more sense to go the route that takes the human out of the loop?
The human shows up hungover,
goes crazy, gropes Stacy from HR, etc.
That's a huge instance with an enterprise license on top. Most large SaaS companies can run off of $5k / m or cheaper RDS deployments which isn't enough to pay someone. The amount of people running half a million a year RDS bills might not be that large. For most people RDS is worth it as soon as you have backup requirements and would have to implement them yourself.
> Most large SaaS companies can run off of $5k / m or cheaper RDS
Hard disagree. An r6i.12xl Multi-AZ with 7500 IOPS / 500 GiB io1 books at $10K/month on its own. Add a read replica, even Single-AZ at a smaller size, and you’re half that again. And this is without the infra required to run a load balancer / connection pooler.
I don’t know what your definition of “large” is, but the described would be adequate at best at the ~100K QPS level.
RDS is expensive as hell, because they know most people don’t want to take the time to read docs and understand how to implement a solid backup strategy. That, and they’ve somehow convinced everyone that you don’t have to tune RDS.
Definitely--I recommend this after you've reached the point where you're writing huge checks to AWS. Maybe this is just assumed but I've never seen anyone else add that nuance to the "just use RDS" advice. It's always just "RDS is worth it" full stop, as in this article.
>Most large SaaS companies can run off of $5k / m or cheaper RDS deployments which isn't enough to pay someone.
After initial setup, managing equivalent of $5k/m RDS is not full time job. If you add to this, that wages differ a lot around the world, $5k can take you very, very far in terms of paying someone.
Discount rates are actually much better too on the bigger instances. Therefore the "sticker price" that people compare on the public site is no where close to a fair comparison.
We technically aren't supposed to talk about pricing publically, but I'm just going to say that we run a few 8XL and 12Xl RDS instances and we pay ~40% off the sticker price.
If you switch to Aurora engine the pricing is absurdly complex (its basically impossible to determine without a simulation calculator) but AWS is even more aggressive with discounting on Aurora, not to mention there are some legit amazing feature benefits by switching.
I'm still in agreeance that you could do it cheaper yourself at a Data Center. But there are some serious tradeoffs made by doing it that way. One is complexity and it certainly requires several new hiring decisions. Those have their own tangible costs, but there are a huge amount of intangible costs as well like pure inconvenience, more people management, more hiring, split expertise, complexity to network systems, reduce elasticity of decisions, longer commitments, etc.. It's harder to put a price on that.
When you account for the discounts at this scale, I think the cost gap between the two solutions is much smaller and these inconveniences and complexities by rolling it yourself are sometimes worth bridging that smaller gap in cost in order to gain those efficiencies.
This is because you are using SQL Server. Microsoft has intentionally made cloud pricing for SQL server prohibitively expensive for non-Azure cloud workloads by requiring per-core licensing that is extremely punitive for the way EC2 and RDS is architected. This has the effect of making RDS vastly more expensive than running the same workload on bare metal or Azure.
Frankly, this is anti-competitive, and the FTC should look into it, however, Microsoft has been anti-competitive and customer hostile for decades, so if you're still using their products, you must have accepted the abuse already.
You don't get the higher end machines on AWS unless you're a big guy. We have Epyc 9684X on-prem. Cannot match that at the price on AWS. That's just about making the choices. Most companies are not DB-primary.
I think most people who’ve never experienced native NVMe for a DB are also unaware of just how blindingly fast it is. Even io2 Block Express isn’t the same.
Elsewhere today I recommended RDS, but was thinking of small startup cases that may lack infrastructure chops.
But you are totally right it can be expensive. I worked with a startup that had some inefficient queries, normally it would matter, but with RDS it cost $3,000 a month for a tiny user base and not that much data (millions of rows at most).
Also, it is often overlooked that you still need skilled people to run RDS. It's certainly not "2-clicks and forget" and "you don't need to pay anyone running your DB".
I haven't run a Postgres instance with proper backup and restore, but it doesn't seem like rocket science using barman or pgbackrest.
Data isn't cheap never was. Paying the licensing fees on top make it more expensive. It really depends on the circunstance a managed database usually has exended support from the compaany providing it. You have to weigh a team's expertise to manage a solution on your own and ensure you spent ample time making it resilient. Other half is the cost of upgrading hardware sometimes it is better to just outright pay a cloud provider if you business does not have enough income to outright buy hardware.There is always an upfront cost.
Small databases or test environment databases you can also leverage kubernetes to host an operator for that tiny DB. When it comes to serious data and it needs a beeline recovery strategy RDS.
Really it should be a mix self hosted for things you aren't afraid to break. Hosted for the things you put at high risk.
> Data is the most critical part of your infrastructure. You lose your network: that’s downtime. You lose your data: that’s a company ending event. The markup cost of using RDS (or any managed database) is worth it.
You need well-run, regularly tested, air gapped or otherwise immutable backups of your DB (and other critical biz data). Even if RDS was perfect, it still doesn't protect you from the things that backups protect you from.
After you have backups, the idea of paying enormous amounts for RDS in order to keep your company from ending is more far fetched.
I agree that RDS is stupidly expensive and not worth it provided that the company actually hires at least 2x full-time database owners who monitor, configure, scale and back up databases. Most startups will just save the money and let developers "own" their own databases or "be responsible for" uptime and backups.
Even for small workloads it's a difficult choice. I ran a small but vital db, and RDS was costing us like 60 bucks a month per env. That's 240/month/app.
DynamoDB as a replacement, pay per request, was essentially free.
I found Dynamo foreign and rather ugly to code for initially, but am happy with the performance and especially price at the end.
For big companies such as banks this cost comparison is not as straight forward. They have whole data centres just sitting there for disaster recovery. They periodically do switchovers to test DR. All of this expense goes away when they migrate to cloud.
From what I’ve read, a common model for mmorpg companies is to use on-prem or colocated as their primary and then provision a cloud service for backup or overage.
Seems like a solid cost effective approach for when a company reaches a certain scale.
Lots of companies, like Grinding Gear Games and Square Enix, just rent whole servers for a tiny fraction of the price compared to what the price gouging cloud providers would charge for the same resources. They get the best of both worlds. They can scale up their infrastructure in hours or even minutes and they can move to any other commodity hardware in any other datacenter at the drop of a hat if they get screwed on pricing. Migrating from one server provider (such as IBM) to another (such as Hetzner) can take an experienced team 1-2 weeks at most. Given that pricing updates are usually given 1-3 quarters ahead at a minimum, they have massive leverage over their providers because they an so easily switch. Meanwhile, if AWS decides to jack up their prices, well you're pretty much screwed in the short-term if you designed around their cloud services.
I know this is an unpopular opinion but I think google cloud is amazing compared to AWS. I use google cloud run and it works like a dream. I have never found an easier way to get a docker container running in the cloud. The services all have sensible names, there are fewer more important services compared to the mess of AWS services, and the UI is more intuitive. The only downside I have found is the lack of community resulting in fewer tutorials, difficulty finding experienced hires, and fewer third party tools. I recommend trying it. I'd love to get the user base to an even dozen.
The reasoning the author cites is that AWS has more responsive customer service and maybe I am missing out but it would never even occur to me to speak to someone from a cloud provider. They mention having "regular cadence meetings with our AWS account manager" and I am not sure what could be discussed. I must be doing simper stuff.
> "regular cadence meetings with our AWS account manager" and I am not sure what could be discusse.
As being on a number of those calls, its just a bunch of crap where they talk like a scripted bot reading from corporate buzzword bingo card over a slideshow. Their real intention is two fold. To sell you even more AWS complexity/services, and to provide "value" to their person of contact (which is person working in your company).
We're paying north of 500K per year in AWS support (which is a highway robbery), and in return you get a "team" of people supposedly dedicated to you, which sounds good in theory but you get a labirinth of irresponsiblity, stalling and frustration in reality.
So even when you want to reach out to that team you have to first to through L1 support which I'm sure will be replaced by bots soon (and no value will be lost) which is useful in 1 out of 10 cases. Then if you're not satisfied with L1's answer(s), then you try to escalate to your "dedicated" support team, then they schedule a call in three days time, or if that is around Friday, that means Monday etc.
Their goal is to stall so you figure and fix stuff on your own so they shield their own better quality teams. No wonder our top engineers just left all AWS communication and in cases where unavoidable they delegate this to junior people who still think they are getting something in return.
> We're paying north of 500K per year in AWS support (which is a highway robbery), and in return you get a "team" of people supposedly dedicated to you, which sounds good in theory but you get a labirinth of irresponsiblity, stalling and frustration in reality.
I’ve found a lot of the time the issues we run into are self-inflicted. When we call support for these, they have to reverse-engineer everything which takes time.
However when we can pinpoint the issue to AWS services, it has been really helpful to have them on the horn to confirm & help us come up with a fix/workaround. These issues come up more rarely, but are extremely frustrating. Support is almost mandated in these cases.
It’s worth mentioning that we operate at a scale where the support cost is a non-issue compared to overall engineering costs. There’s a balance, and we have an internal structure that catches most of the first type of issue nowadays.
In my experience all questions I've had for AWS were:
1. Their bugs, which won't be fixed in near future anyway.
2. Their transient failures, that will be fixed anyway soon.
So there's zero value in ever contacting AWS support.
We are a reasonably large AWS customer and our account manager sends out regular emails with NDA information on what's coming up, we have regular meetings with them about things as wide ranging as database tuning and code development/deployment governance.
They often provide that consulting for free, and we know their biases. There's nothing hidden about the fact that they will push us to use AWS services.
On the other hand, they will also help us optimize those services and save money that is directly measurable.
GCP might have a better API and better "naming" of their services, but the breadth of AWS services, the incorporation of IAM across their services, governance and automation all makes it worth while.
Cloud has come a long way from "it's so easy to spin up a VM/container/lambda".
> There's nothing hidden about the fact that they will push us to use AWS services.
Our account team don't even do that. We use a lot of AWS anyway and they know it, so they're happy to help with competitor offerings and integrating with our existing stack. Their main push on us has been to not waste money.
In a previous role I got all of these things from GCP – they ran training for us, gave us early access to some alpha/beta stage products (under NDA), we got direct onboarding from engineers on those, they gave us consulting level support on some things and offered much more of it than we took up.
I don’t have as much experience with aws but I do hate gcp. The ui is slow and buggy. The way they want things to authenticate is half baked and only implemented in some libraries and it isn’t always clear what library supports it. The gcloud command line tool regularly just doesn’t work; it just hangs and never times out forcing you to kill it manually wondering if it did anything and you’ll mess something up running it again. The way they update client libraries by running code generation means there’s tons of commits that aren’t relevant to the library you’re actually using. Features are not available across all client libraries. Documentation contradicts itself or contradicts support recommendations. Core services like bigquery lack any emulator or Docker image to facilitate CI or testing without having to setup a separate project you have to pay for.
Oh, friend, you have not known UI pain until you've used portal.azure.com. That piece of junk requires actual page reloads to make any changes show up. That Refresh button is just like the close-door elevator button: it's there for you to blow off steam, but it for damn sure does not DO anything. I have boundless screenshots showing when their own UI actually pops up a dialog saying "ok, I did what you asked but it's not going to show up in the console for 10 minutes so check back later". If you forget to always reload the page, and accidentally click on something that it says exists but doesn't, you get the world's ugliest error message and only by squinting at it do you realize it's just the 404 page rendered as if the world has fallen over
I suspect the team that manages it was OKR-ed into using AJAX but come from a classic ASP background, so don't understand what all this "single page app" fad is all about and hope it blows over one day
Totally agree, GCP is far easier to work with and get things up and running for how my brain works compared to AWS. Also, GCP name stuff in a way that tells me what it does, AWS name things like a teenage boy trying to be cool.
That's completely opposite to my experience. Do you have any examples of AWS naming that you think is "teenage boy trying to be cool"? I am genuinely curious.
I have had the experience of an AWS account manager helping me by getting something fixed (working at a big client). But more commonly, I think the account manager’s job at AWS or any cloud or SAAS is to create a reality distortion field and distract you from how much they are charging you.
> I think the account manager’s job at AWS or any cloud or SAAS is to create a reality distortion field and distract you from how much they are charging you.
Maybe your TAM is different, but our regularly do presentations about cost breakdown, future planning and possible reservations. There's nothing distracting there.
AWS enterprise support (basically first line support that you paid for) is actually really really good. they will look at your metrics/logs and share with you solid insights. anything more you can talk to a TAM who can then reach out to relevant engineering teams
Heartily seconded. Also don't forget the docs: Google Cloud docs are generally fairly sane and often even useful, whereas my stomach churns whenever I have to dive into AWS's labyrinth of semi-outdated, nigh-unreadable crap.
To be fair there are lots of GCP docs, but I cannot say they are as good as AWS. Everything is CLI-based, some things are broken or hello-world-useless. Takes time to go through multiple duplicate articles to find anything decent. I have never had this issue with AWS.
GCP SDK docs must be mentioned separately as it's a bizarre auto-generated nonsense. Have you seen them? How can you even say that GCP docs are good after that?
We're relatively small GCP users (low six figures) and have monthly cadence meetings with our Google account manager. They're very accommodating, and will help with contacts, events and marketing.
Oh I disagree - we migrated from azure to AWS, and running a container on Fargate is significantly more work than Azure Container Apps [0]. Container Apps was basically "here's a container, now go".
GCP support is atrocious. I've worked at one of their largest clients and we literally had to get executives into the loop (on both sides) to get things done sometimes. Multiple times they broke some functionality we depended on (one time they fixed it weeks later except it was still broken) or gave us bad advice that cost a lot of money (which they at least refunded if we did all the paperwork to document it). It was so bad that my team viewed even contacting GCP as an impediment and distraction to actually solving a problem they caused.
I also worked at a smaller company using GCP. GCP refused to do a small quota increase (which AWS just does via a web form) unless I got on a call with my sales representative and listened to a 30 minute upsell pitch.
> I’ve had technicians at both GCP and Azure debug code and spend hours on developing services.
Almost every time Google pulled in a specialist engineer working on a service/product we had issues with it was very very clear the engineer had no desire to be on that call or to help us. In other words they'd get no benefit from helping us and it was taking away from things that would help their career at Google. Sometimes they didn't even show up to the first call and only did to the second after an escalation up the management chain.
GCP's SDK and documentation is a mess compared to AWS. And looking at the source code I don't see how it can get better any time soon. AWS seems to have proper design in mind and uses less abstractions giving you freedom to build what you need. AWS CDK is great for IAC.
The only weird part I experienced with AWS is their SNS API. Maybe due to legacy reasons, but what a bizarre mess when you try doing it cross-account. This one is odd.
I have been trying GCP for a while and DevX was horrible. The only part that more-or-less works is CLI but the naming there is inconsistent and not as well-done as in AWS. But it's relative and subjective, so I guess someone likes it. I have experienced GCP official guides that broken, untested or utterly braindead hello-world-useless. And also they are numerous and spread so it takes time to find anything decent.
No dark mode is an extra punch. Seriously. Tried to make it myself with an extension but their page is Angular hell of millions embedded divs. No thank you.
And since you mentioned Cloud Run -- it takes 3 seconds to deploy a Lambda version in AWS and a minute or more for GCP Could Function.
The author leads infrastructure at Cresta. Cresta is a customer service automation company. His first point is about how happy he is to have picked AWS and their human-based customer service, versus Google's robot-based customer service.
I'm not saying there's anything wrong, and I'm oversimplifying a bit, but I still find this amusing.
Haha very good catch. I prefer GCP but I will admit any day of the week that their support is bad. Makes sense that they would value good support highly.
We used to use AWS and GCP at my previous company. GCP support was fine, and I never saw anything from AWS support that GCP didn't also do. I've heard horror stories about both, including some security support horror stories from AWS that are quite troubling.
Utter insanity. So much cost and complexity, and for what? Startups don’t think about costs or runway anymore, all they care about is “modern infrastructure”.
The argument for RDS seems to be “we can’t automate backups”. What on earth?
I see this argument a lot. Then most startups use that time to create rushed half-assed features instead of spending a week on their db that'll end up saving hundreds of thousands of dollars. Forever.
All that infra doesn’t integrate itself. Everywhere I’ve worked that had this kind of stack employed at least one if not a team of DevOps people to maintain it all, full time, the year round. Automating a database backup and testing it works takes half a day unless you’re doing something weird
> The argument for RDS seems to be “we can’t automate backups”. What on earth?
I can automate backups and I'm extremely happy they with some extra cost in RDS, I don't have to do that.
Also, at some size automating the database backup becomes non-trivial. I mean, I can manage a replica (which needs to be updated at specific times after the writer), then regularly stop replication for a snapshot, which is then encrypted, shipped to storage, then manage the lifecycle of that storage, then setup monitoring for all of that, then... Or I can set one parameter on the Aurora cluster and have all of that happen automatically.
The argument for RDS (and other services along those lines) is "we can't do it as good, for less".
And, when factoring in all costs and considering all things the service takes care of, it seems like a reasonable assumption that in a free market a team that specializes in optimizing this entire operation will sell you a db service at a better net rate than you would be able to achieve on your own.
Which might still turn out to be false, but I don't think it's obvious why.
I agree but also I'm not entirely sure how much of this is avoidable. Even the most simple web applications are full of what feels like needless complexity, but I think actually a lot of it is surprisingly essential. That said, there is definitely a huge amount of "I'm using this because I'm told that we should" over "I'm using this because we actually need it"
Everyone who says they can run a database better than Amazon is probably lying or Has a story about how they had to miss a family event because of an outage.
The point isn’t that you can’t do it, the point is that it’s less work for extremely high standards. It is not easy to configure multi region failover without an entire network team and database team unless you don’t give a shit about it actually working. Oh yea, and wait until you see how much SOC2 costs if you roll your own database.
One don’t necessarily need to run a DB better than Amazon. Just sufficiently good for the product/service you’re are working on. And depending on specifics it may costs much less (but your mileage may vary).
My contrarian view is that EC2 + ASG is so pleasant to use. It’s just conceptually simple: I launch an image into an ASG, and configure my autoscale policies. There are very few things to worry about. On the other hand, using k8s has always been a big deal. We built a whole team to manage k8s. We introduce dozens of concepts of k8s or spend person-years on “platform engineering” to hide k8s concepts. We publish guidelines and sdks and all kinds of validators so people can use k8s “properly”. And we still write 10s of thousands lines of YAML plus 10s of thousands of code to implement an operator. Sometimes I wonder if k8s is too intrusive.
K8S is a disastrous complexity bomb. You need millions upon millions of lines of code just to build a usable platform. Securing Kubernetes is a nightmare. And lock-in never really went away because it's all coupled with cloud specific stuff anyway.
Many of the core concepts of Kubernetes should be taken to build a new alternative without all the footguns. Security should be baked in, not an afterthought when you need ISO/PCI/whatever.
> K8S is a disastrous complexity bomb. You need millions upon millions of lines of code just to build a usable platform.
I don't know what you have been doing with Kubernetes, but I run a few web apps out of my own Kubernetes cluster and the full extent of my lines of code are the two dozen or so LoC kustomize scripts I use to run each app.
kubeadm + fabric + helm got me 99% of the way there. My direct report, a junior engineer, wrote the entire helm chart from our docker-compose. It will not entirely replace our remote environment but it is nice to have something in between our SDK and remote deployed infra. Not sure what you meant by security; could you elaborate? I just needed to expose one port to the public internet.
To me, it sounds like your company went through a complex re-architecturing exercise at the same time you moved to Kubernetes, and your problems have more to do with your (probably flawed) migration strategy than the tool.
Lifting and shifting an "EC2 + ASG" set-up to Kubernetes is a straightforward process unless your app is doing something very non-standard. It maps to a Deployment in most cases.
The fact that you even implemented an operator (a very advanced use-case in Kubernetes) strongly suggests to me that you're doing way more than just lifting and shifting your existing set-up. Is it a surprise then that you're seeing so much more complexity?
> My contrarian view is that EC2 + ASG is so pleasant to use.
Sometimes I think that managed kubernetes services like EKS are the epitome of "give the customers what they want", even when it makes absolutely no sense at all.
Kubernetes is about stitching together COTS hardware to turn it into a cluster where you can deploy applications. If you do not need to stitch together COTS hardware, you have already far better tools available to get your app running. You don't need to know or care in which node your app is suppose to run and not run, what's your ingress control, if you need to evict nodes, etc. You have container images, you want to run containers out of them, you want them to scale a certain way, etc.
I tend to agree that for most things on AWS, EC2 + ASG is superior. It's very polished. EKS is very bare bones. I would probably go so far as to just run Kubernetes on EC2 if I had to go that route.
But in general k8s provides incredibly solid abstractions for building portable, rigorously available services. Nothing quite compares. It's felt very stable over the past few years.
Sure, EC2 is incredibly stable, but I don't always do business on Amazon.
At first I thought your "in general" statement was contradicting your preference for EC2 + ASG. I guess AWS is such a large part of my world that "in general" includes AWS instead of meaning everything but AWS.
So by and large I agree with the things in this article. It's interesting that the points I disagree with the author on are all SaaS products:
> Moving off JIRA onto linear
I don't get the hype. Linear is fine and all but I constantly find things I either can't or don't know how to do. How do I make different ticket types with different sets of fields? No clue.
> Not using Terraform Cloud No Regrets
I generally recommend Terraform Cloud - it's easy for you to grow your own in house system that works fine for a few years and gradually ends up costing you in the long run if you don't.
> GitHub actions for CI/CD Endorse-ish
Use Gitlab
> Datadog Regret
Strong disagree - it's easily the best monitoring/observability tool on the market by a wide margin.
Cost is the most common complaint and it's almost always from people who don't have it configured correctly (which to be fair Datadog makes it far too easy to misconfigure things and blow up costs).
> Pagerduty Endorse
Pagerduty charges like 10x what Opsgenie does and offers no better functionality.
When I had a contract renewal with Pagerduty I asked the sales rep what features they had that Opsgenie didn't.
He told me they're positioning themselves as the high end brand in the market.
Cool so I'm okay going generic brand for my incident reporting.
Every CFO should use this as a litmus test to understand if their CTO is financially prudent IMO.
> Cost is the most common complaint and it's almost always from people who don't have it configured correctly (which to be fair Datadog makes it far too easy to misconfigure things and blow up costs).
I loved Datadog 10 years ago when I joined a company that already used it where I never once had to think about pricing. It was at the top of my list when evaluating monitoring tools for my company last year, until I got to the costs. The pricing page itself made my head swim. I just couldn’t get behind subscribing to something with pricing that felt designed to be impossible to reason about, even if the software is best in class.
> Datadog makes it far too easy to misconfigure things and blow up costs
I'll give you a fun example. It's fresh in my mind because i just got reamed out about it this week.
In our last contract with DataDog, they convinced us to try out the CloudSIEM product, we put in a small $600/mo committment to it to try it out. Well, we never really set it up and it sat on autopilot for many months. We fell under our contract rate for it for almost a year.
Then last month we had some crazy stuff happen and we were spamming logs into DataDog for a variety of reasons. I knew I didn't want to pay for these billions of logs to be indexed, so I made an exclusion filter to keep them out of our log indexes so we didn't have a crazy bill for log indexing.
So our rep emailed me last week and said "Hey just a heads up you have $6,500 in on-demand costs for CloudSIEM, I hope that was expected". No, it was NOT expected. Turns out excluding logs from indexing does not exclude them from CloudSIEM. Fun fact, you will not find any documented way to exclude logs from CloudSIEM ingestion. It is technically possible, but only through their API and it isn't documented. Anyway, I didn't do or know this, so now i had $6,500 of on-demand costs plus $400-500 misc on-demand costs that I had to explain to the CTO.
I should mention my annual review/pay raise is also next week (I report to the CTO), so this will now be fresh in their mind for that experience.
Their pricing setup is evil. Breaking out by SKUs and having 10+ SKUs is fine, trialing services with “spot” prices before committing to reserved capacity is also fine.
But (for some SKUs, at least) they make it really difficult to be confident that the reserved capacity you’re purchasing will cover your spot use cases. Then, they make you contact a sales rep to lower your reserved capacity.
It all feels designed to get you to pay the “spot” rate for as long as possible, and it’s not a good look.
I understand the pressures on their billing and sales teams that lead to these patterns, but they don’t align with their customers in the long term. I hope they clean up their act, because I agree they’re losing some set of customers over it.
Linear has a lot going for it. It doesn't support custom fields, so if that's a critical feature for you, I can see it falling short. In my experience though, custom fields just end up being a mess anytime a manager changes and decides to do things differently, things get moved around teams, etc.
- It's fast. It's wild that this is a selling point, but it's actually a huge deal. JIRA and so many other tools like it are as slow as molasses. Speed is honestly the biggest feature.
- It looks pretty. If your team is going to spend time there, this will end up affecting productivity.
- It has a decent degree of customization and an API. We've automated tickets moving across columns whenever something gets started, a PR is up for review, when a change is merged, when it's deployed to beta, and when it's deployed to prod. We've even built our own CLI tools for being able to action on Linear without leaving your shell.
- It has a lot of keyboard shortcuts for power users.
- It's well featured. You get teams, triaging, sprints (cycles), backlog, project management, custom views that are shareable, roadmaps, etc...
OpsGenie’s cheapest is $9 per user month but arbitrarily crippled, the plan anybody would want to use is $19 per user month
So instead of a factor of ten it’s ten percent cheaper. And i just kind of expect Atlassian to suck.
Datadog is ridiculously expensive and on several occasions I’ve run into problems where an obvious cause for an incident was hidden by bad behavior of datadog.
Grafana OnCall can be self hosted for free or you can pay $20 a month, and still always have the option to migrate to self hosting if you want to save money
I just started building out on-call rotation scheduling to fit teams that already have an alerting solution and need simple automated scheduling. I’d love to get some feedback: https://majorpager.com
DatDog is a freaking beast. NY wife works in workday (a huge employee management system) and they have a very large number of tutorials, videos, "working hours" and other tools to ensure their customers are making the best use of it.
Datadog on the other side... their "DD University" is a shame and we as paying customers are overwhelmed and with no real guidance. DD should assign some time for integration for new customers, even if it is proportional to what you pay annually. (I think I pay around 6000 usd annually.
In terms of Datadog - the per host pricing on infrastructure in a k8/microservices world is perhaps the most egregious of pricing models across all datadog services. Triply true if you use spot instances for short lived workloads.
For folks running k8 at any sort of scale, I generally recommend aggregating metrics BEFORE sending them to datadog, either on a per deployment or per cluster level. Individual host metrics tend to also matter less once you have a large fleet.
You can use opensource tools like veneur (https://github.com/stripe/veneur) to do this. And if you don't want to set this up yourself, third party services like Nimbus (https://nimbus.dev/) can do this for you automatically (note that this is currently a preview feature). Disclaimer also that I'm the founder of Nimbus (we help companies cut datadog costs by over 60%) and have a dog in this fight.
I mostly agreed with OP's article, but you basically nailed all of the points of disagreement I did have.
Jira: Its overhyped and overpriced. Most HATE jira. I guess I don't care enough. I've never met a ticket system that I loved. Jira is fine. Its overly complex sure. But once you set it up, you don't need to change it very often. I don't love it, I don't hate it. No one ever got fired for choosing Jira, so it gets chosen. Welcome to the tech industry.
Terraform Cloud: The gains for Terraform Cloud are minimal. We just use Gitlab for running Terraform pipelines and have a super nice custom solution that we enjoy. It wasn't that hard to do either. We maintain state files remotely in S3 with versioning for the rare cases when we need to restore a foobar'd statefile. Honestly I like having Terraform pipelines in the same place as the code and pipelines for other things.
GitHub Actions: Yeah switch to GitLab. I used to like Github Actions until I moved to a company with Gitlab and it is best in class, full stop. I could rave about Gitlab for hours. I will evangelize for Gitlab anywhere I go that is using anything else.
DataDog: As mentioned, DataDog is the best monitoring and observability solution out there. The only reason NOT to use it is the cost. It is absurdly expensive. Yes, truly expensive. I really hate how expensive it is. But luckily I work somewhere that lets us have it and its amazing.
Pagerduty: Agree, switch to OpsGenie. Opsgenie is considerably cheaper and does all the pager stuff of Pager duty. All the stuff that PagerDuty tries to tack on top to justify its cost is stuff you don't need. OpsGenie does all the stuff you need. Its fine. Similar to Jira, its not something anyone wants anyway. No ones going to love it, no one loves being on call. So just save money with OpsGenie. If you're going to fight for the "brand name" of something, fight for DataDog instead, not a cooler pager system.
I'm right there with you on Jira. The haters are wrong - it's a decent enough ticket system, no worse than anything else I've used. You can definitely torture Jira into something horrible, but that's not Jira's fault. Bad managers will ruin any ticket system if they have the customization tools to do so.
I'll be dead in the ground before I use TFC. 10 cents per resource per month my ass. We have around 100k~ resources at an early-stage startup I'm at, our AWS bill is $50~/mo and TFC wants to charge me $10k/mo for that? We can hire a senior dev to maintain an in-house tool full time for that much.
Agreed on PagerDuty
It doesn't really do a lot, administrating it is fairly finicky, and most shops barely use half the functionality it has anyway.
To me its whole schedule interface is atrocious for its price, given from an SRE/dev perspective, that's literally its purpose - scheduled escalations.
Why gitlab? GitHub actions are a mess but gitlab online's ci cd is not much better at all, and for self hosted it opens a whole different can of worms. At least with GitHub actions you have a plugin ecosystem that makes the super janky underlying platform a bit more bearable.
I've found GitLab CI's "DAG of jobs" model has made maintenance and, crucially for us, optimisation relatively easy. Then I look into GitHub Actions and... where are the abstraction tools? How do I cache just part of my "workflow"? Plugins be damned. GitLab CI is so good that I'm willing to overlook vendor lock-in and YAML, and use it for our GitHub project even without proper integration. (Frankly the rest of GitLab seems to always be a couple features ahead, but no-one's willing to migrate.)
> Cost is the most common complaint and it's almost always from people who don't have it configured correctly (which to be fair Datadog makes it far too easy to misconfigure things and blow up costs).
Datadog's cheapest pricing is $15/host/month. I believe that is based on the largest sustained peak usage you have.
We run spot instances on AWS for machine learning workflows. A lot of them if we're training and none otherwise. Usually we're using zero. Using DataDog at it's lowest price would basically double the cost of those instances.
That would totally be my preference if business users didn't want access.
Getting them to use Github/Gitlab is an argument I've never won. Typically it goes the other way and I end up needing to maintain a Monday or Airtable instance in addition to my ticketing system.
Interesting. Atlassian also just launched an integration with OpsGenie. I have the same opinion of JIRA. I've tried many competitors (not Linear so far) and regretted it every time.
I'm not sure they just launched anything. OpsGenie has been an Atlassian product for 5 or more years now. I've been using it for 3-4 myself and its been integrated with Jira the whole time.
In fact, OpsGenie has mostly been on Auto-pilot for a few years now.
I agree. I’m afraid I’m one of those 00s developers and can relate. Back then many startups were being launched on super simple stacks.
With all of that complexity/word salad from TFA, where’s the value delivered? Presumably there’s a product somewhere under all that infrastructure, but damn, what’s left to spend on it after all the infrastructure variable costs?
I get it’s a list of preferences, but still once you’ve got your selection that’s still a ton of crap to pay for and deal with.
Do we ever seek simplicity in software engineering products?
I think that far too many companies get sold on the vision of "it just works, you don't need to hire ops people to run the tools you need for your business". And that is true! And while you're starting, it may be that you can't afford to hire an ops guy and can't take the time to do it yourself. But it doesn't take that much scale before you get to the point it would be cheaper to just manage your own tools.
Cloud and SaaS tools are very seductive, but I think they're ultimately a trap. Keep your tools simple and just run them yourselves, it's not that hard.
Look, the thing is - most of infra decisions are made by devops/devs that have a vested interest in this.
Either because they only know how to manage AWS instances (it was the hotness and thats what all the blogs and YT videos were about) and are now terrified from losing their jobs if the companies switch stacks. Or because they needed to put the new thing on their CV so they remain employable. Also maybe because they had to get that promotion and bonus for doing hard things and migrating things. Or because they were pressured into by bean counters which were pressured by the geniuses of Wall Street to move capex to opex.
In any case, this isn't by necessity these days. This is because, for a massive amount of engineers, that's the only way they know how to do things and after the gold rush of high pay, there's not many engineers around that are in it to learn or do things better. It's for the paycheck.
It is what it is. The actual reality of engineering the products well doesn't come close to the work being done by the people carrying that fancy superstar engineer title.
You know the old adage "fast, cheap, good: pick two"? With startups, you're forced to pick fast. You're still probably not gonna make it, but if you don't build fast, you definitely won't.
For simplicity, software must be well built. Unfortunately, the software development practice is perpetually underskilled so we release buggy crap which we compensate for in infrastructure.
> Do we ever seek simplicity in software engineering products?
Doubtfully. Simplicity of work breakdown structure - maybe. Legibility for management layers, possibly. Structural integrity of your CYA armor? 100%.
The half-life of a software project is what now, a few years at most these days? Months, in webdev? Why build something that is robust, durable, efficient, make all the correct engineering choices, where you can instead race ahead with a series of "nobody ever got fired for using ${current hot cloud thing}" choices, not worrying at all about rapidly expanding pile of tech and organizational debt? If you push the repayment time far back enough, your project will likely be dead by then anyway (win), or acquired by a greater fool (BIG WIN) - either way, you're not cleaning up anything.
Nobody wants to stay attached to a project these days anyway.
There's an easy bent towards designing everything for scale. It's optimistic. It's feels good. It's safe, defendable, and sound to argue that this complexity, cost, and deep dependency is warranted when your product is surely on the verge of changing the course of humanity.
The reality is your SaaS platform for ethically sourced, vegan dog food is below inconsequential and the few users that you do have (and may positively affect) absolutely do not not need this tower of abstraction to run.
We had FB up to 6 figures in servers and a billion MAUs (conservatively) before even tinkering with containers.
The “control plane” was ZooKeeper. Everything had bindings to it, Thrift/Protobuf goes in a znode fine. List of servers for FooService? znode.
The packaging system was a little more complicated than a tarball, but it was spiritually a tarball.
Static link everything. Dependency hell: gone. Docker: redundant.
The deployment pipeline used hypershell to drop the packages and kick the processes over.
There were hundreds of services and dozens of clusters of them, but every single one was a service because it needed a different SKU (read: instance type), or needed to be in Java or C++, or some engineering reason. If it didn’t have a real reason, it goes in the monolith.
This was dramatically less painful than any of the two dozen server type shops I’ve consulted for using kube and shit. It’s not that I can’t use Kubernetes, I know the k9s shortcuts blindfolded. But it’s no fun. And pros built these deployments and did it well, serious Kubernetes people can do everything right and it’s complicated.
After 4 years of hundreds of elite SWEs and PEs (SRE) building a Borg-alike, we’d hit parity with the bash and ZK stuff. And it ultimately got to be a clear win.
But we had an engineering reason to use containers: we were on bare metal, containers can make a lot of sense on bare metal.
In a hyperscaler that has a zillion SKUs on-demand? Kubernetes/Docker/OCI/runc/blah is the friggin Bezos tax. You’re already virtualized!
Some of the new stuff is hot shit, I’m glad I don’t ssh into prod boxes anymore, let alone run a command on 10k at the same time. I’m glad there are good UIs for fleet management in the browser and TUI/CLI, and stuff like TailScale where mortals can do some network stuff without a guaranteed zero day. I’m glad there are layers on top of lock servers for service discovery now. There’s a lot to keep from the last ten years.
But this yo dawg I heard you like virtual containers in your virtual machines so you can virtualize while you virtualize shit is overdue for its CORBA/XML/microservice/many-many-many repos moment.
You want reproducibility. Statically link. Save Docker for a CI/CD SaaS or something.
You want pros handing the datacenter because pets are for petting: pay the EC2 markup.
You can’t take risks with customer data: RDS is a very sane place to splurge.
Half this stuff is awesome, let’s keep it. The other half is job security and AWS profits.
The funny thing is a lot of smaller startups are seeing just how absurdly expensive these service are, and are just switching back to basic bare metal server hosting.
For 99% of businesses it's a wasteful, massive overkill expense. You dont NEED all the shiny tools they offer, they don't add anything to your business but cost. Unless you're a Netflix or an Apple who needs massive global content distribution and processing services theres a good chance you're throwing money away.
I am 10s developer/systems engineer and my eyes kept getting wider with each new technology on the list. I don't know if its overkill or just the state of things right now.
There is no way one person can thoroughly understand so many complex pieces of technology. I have worked for 10 years more or less at this point, and I would only call myself confident on 5 technical products, maybe 10 if I being generous to myself.
Not really, it's just like counting: awk, grep, sed, uniq, tail, etc.
"CloudOS" is in it's early days right now.
You need to be careful on what tool or library you pick.
No, not at all. Maybe baffled by the use of expensive cloud services instead of running on your own bare metal where the cost is in datacenter space and bandwidth. The loss of control coupled with the cost is baffling.
Reading this I couldn’t help but think: yeah all of these points make sense in isolation, but if you look at the big picture, this is an absurd level of complexity.
Why do we need entire teams making 1000s of micro decisions to deploy our app?
I’m hungry for a simpler way, and I doubt I’m alone in this.
You’re not alone. There is a constant undercurrent of pushback against this craziness. You see it all the time here on hacker news and with people I talk to irl.
Does not mean each of these things don’t solve problems. The issue as always about complexity-utility tradeoff. Some of these things have too much complexity for too little utility. I’m not qualified to judge here, but if the suspects have Turing-complete-yaml-templates on their hands, it probably ties them to the crime scene.
The problem was: too much money, too few consequences for burning it.
The existence of the uber-wealthy means that markets can no longer function efficiently. Every market remains irrational longer than anyone who's not uber-wealthy can remain solvent.
I've climbed the mountain of learning the basics of kubernetes / EKS, and I'm thinking we're going to switch to ECS. Kubernetes is way too complicated for our needs. It wants to be in control and is hard to direct with eg CloudFormation. Load balancers are provisioned from the add-on, making it hard to reference them outside kubernetes. Logging on EKS Fargate to Cloudwatch appears broken, despite following the docs. CPU/Memory metrics don't work like they do on EKS EC2, it appears to require ADOT.
I recreated the environment in ECS in 1/10th the time and everything just worked.
I've been running ECS for about 5 years now. It has come a long way from a "lightweight" orchestration tool into something thats actually pretty impressive. The recent new changes to the GUI are also helpful for people that don't have a ton of experience with orchestration.
We have moved off of it though, you can eventually need more features than it provides. Of course that journey always ends up in Kubernetes land, so you eventually will find your way back there.
Logging to Cloudwatch from kubernetes is good for one thing... audit logs. Cloudwatch in general is a shit product compared to even open source alternatives. For logging you really need to look at Fluentd or Kibana or DataDog or something along those lines. Trying to use Cloudwatch for logs is only going to end in sadness and pain.
GKE is a much better product to me still than EKS but at least in the last two years or so EKS has become a usable product. Back in like 2018 though? Hell no, avoid avoid avoid.
I started with ECS (because I wanted to avoid the complexity of K8s) and regret it. I feel I wasted a lot of time there.
In ECS, service updates would take 15 min or more (vs basically instant in K8s).
ECS has weird limits on how many containers you can run on one instance [0]. And in the network mode where you can run more containers on a host, then the DNS is a mess (you need to lookup SRV records to find out the port).
Using ECS with CDK/Cloudformation is very painful. They don't support everything (specially regarding Blue/Green deployments), and sometimes they can't apply changes you do to a service. When initially setting up everything, I had to recreate the whole cluster from scratch several times. You can argue that's because I didn't know enough, but if that ever happened to me on prod I'd be screwed.
I haven't used EKS (I switched to Azure), so maybe EKS has their own complex painful points. I'm trying to keep my K8s as vanilla as possible to avoid the cloud lock-in.
Interesting that you say you worry about re-creating the cluster from scratch because I've experienced exactly the opposite. Our EKS cluster required so many operations outside CloudFormation to configure access control, add-ons, metrics server, ENABLE_PREFIX_DELEGATION, ENABLE_POD_ENI... It would be a huge risk to rebuild the EKS cluster. And applications hosted there are not independent because of these factors. It makes me very anxious working on the EKS cluster. Yes you can pay an extra $70/month to have a dev cluster, but it will never be equal to prod.
On the other hand, I was able to spin up an entire ECS cluster in a few minutes time with no manual operations and entirely within CloudFormation. ECS costs nothing extra, so creating multiple clusters is very reasonable, though separate clusters would impact packing efficiency. The applications can be fully independent.
> ECS has weird limits on how many containers you can run on one instance
Interesting. With ECS it says for c5.large the task limit is 2 with without trunking, 10 with.
Why not dump your application server and dependencies into rented data center (or EC2 if you must) and setup a coarse DR? Maybe start with a monolith in PHP or Rails.
None of that word salad sounds like startup to me, but then again everyone loves to refer to themselves as a startup (must be a recruiting tool?), so perhaps muh dude is spot on.
I don't want to be negative, but this post reads like a list of things that I want to avoid in my career. I did a brief stint in cloud stuff at a FAANG and I don't care to go back to it.
Right now I'm engineer No. 1 at a current startup just doing DDD with a Django monolith. I'm still pretty Jr. and I'm wondering if there's a way to scale without needing to get into all of the things the author of this article mentions. Is it possible to get to a $100M valuation without needing all of this extra stuff? I realize it varies from business to business, but if anyone has examples of successes where people just used simple architecture's I'd appreciate it.
You can scale to any valuation with any architecture. Whether or not you need sophisticated scaling solutions depends on the characteristics of your product, mostly how pure of a software play it is. Pure software means you will run into scaling challenges quicker, since likely part of your value add is in fact managing the complexity of scaling.
If you are running a marketplace app and collect fees you're going to be able go much further on simpler architectures than if you're trying to generate 10,000 AI images per second.
Don't need any of it. Start simple. Some may be useful though. The list makes good points. Keep it around and if you find yourself suffering from the lack of something, look through the list and see if anything there would be good ROI. But don't adopt something just because this list says you should.
One thing though, I'd start with go. It's no more complex than python, more efficient, and most importantly IMO since it compiles down to binary it's easier to build, deploy, share, etc. And there's less divergence in the ecosystem; generally one simple way to do things like building and packaging, etc. I've not had to deal with versions or tooling or environmental stuff nearly as much since switching.
You don't need this many tools, especially really early. It also depends on the particulars of your business. E.g. if you are B2B SaaS, then you need a ton of stuff automatically to get SOC2 and generally appease the security requirements of your customers.
That said, anything that's set-and-forget is great to start with. Anything that requires it's own care and feeding can wait unless it's really critical. I think we have a project each quarter to optimize our datadog costs and renegotiate our contract.
Also if you make microservices, you are going to need a ton of tools.
I'm currently early in my career and "the software guy" in a non-software team and role, but I'm looking to move into a more engineering direction. You've pretty much got my dream next job at the moment — if you don't mind me asking, how did you manage to find your role, especially being "still pretty Jr."?
Currently working at a $100M valuation tech company that fundamentally is built on a Django monolith with some other fluffy stuff lying around it. You can go far with a Django monolith and some load balancing.
I work at a startup and most of the stuff in the article covers things we use and solve real world problems.
If you're looking for successful businesses, indie hackers like levelsio show you how far you can get with very simple architectures. But that's solo dev work - once you have a team and are dealing with larger-scale data, things like infrastructure as code, orchestration, and observability become important. Kubernetes may or may not be essential depending on what you're building; it seems good for AI companies, though.
I would like to know what you’re being downvoted for. It’s not bad advice, necessarily… this was the way 20 years ago. I mean isn’t hacker news running kind of like this as a monolith on a single server? People might be surprised how far you can get with a simple setup.
The kitchen sink database used by everybody is such a common problem, yet it is repeated over and over again. If you grow it becomes significant tech debt and a performance bottleneck.
Fortunately, with managed DBs like RDS it is really easy to run individual DB clusters per major app.
Management problem masquerading as a tech problem.
Being shared between applications is literally what databases were invented to do. That’s why you learn a special dsl to query and update them instead of just doing it in the same language as your application.
The problem is that data is a shared resource. The database is where multiple groups in an organization come together to get something they all need. So it needs to be managed. It could be a dictator DBA or a set of rules designed in meetings and administered by ops, or whatever.
But imagine it was money. Different divisions produce and consume money just like data. Would anyone imagine suggesting either every team has their own bank account or total unfettered access to the corporate treasury? Of course not. You would make a system. Everyone would at least mildly hate it. That’s how databases should generally be managed once the company is any real size.
Why would you make it a shared resource if you don’t have to?
Decades of experience have shown us the massive costs of doing so - the crippled velocity and soul crushing agony of dba change control teams, the overhead salary of database priests, the arcane performance nightmares, the nuclear blast radius, the fundamental organizational counter-incentives of a shared resource .
Why on earth would we choose to pay those terrible prices in this day and age, when infrastructure is code, managed databases are everywhere and every team can have their own thing. You didn’t have a choice previously, now you do.
...I worked at a large software organization where larger teams had their own bank account, and there was a lot of internal billing, etc, mixed with plenty of funny-money to go along with it. That's not a contradiction, though, it perfectly illustrated your point for me.
The moment you have two databases is the moment you need to deal with data consistency problems.
If you can't do something like determine if you can delete data, as the article mentions, you won't be able to produce an answer to how to deal with those problems.
The downside is then you have many, many DBs to fight with, to monitor, to tune, etc.
This is rarely a problem when things are small, but as they grow, the bad schema decisions made by empowering DBA-less teams to run their own infra become glaringly obvious.
Not a downside to me. Each team maintains their own DB and pays for their own choices.
In the kitchen sink model all teams are tied together for performance and scalability, and some bad apple applications can ruin the party for everyone.
Seen this countless times doing due diligence on startups. The universal kitchen sink DB is almost always one of the major tech debt items.
It's because I hate databases and programming separately. I would rather slow code then have to dig into some database procdure. Its just another level of separation thats too mentally hard to manage. Its like... my queries go into a VM and now I have to worry about how the VM is performing.
I wish and maybe there is a programming language with first class database support. I mean really first class not just let me run queries but almost like embedded into the language in a primal way where I can both deal with my database programming fancyness and my general development together.
Sincerely someone who inherited a project from a DBA.
Lots of interesting comments on this one. Anyone have any good resources for learning how not to fuck up schema/db design for those of us who will probably never have a DBA on the team?
Good question. We don't have a DBA either. I've learned SQL as needed and while I'm not terrible, it's still daunting when making the schema for a new module that might require 10-20 tables or more.
One thing that has worked well for us is to alway include the top-most parent key in all child tables down yhe hierarchy. This way we can load all the data for say an order without joins/exists.
Oh and never use natural keys. Each time I thought finally I had a good use-case, it has bitten me in some way.
Apart from that we just try to think about the required data access and the queries needed. Main thing is that all queries should go against indexes in our case, so we make sure the schema supports that easily. Requires some educated guesses at times but mostly it's predictable IME.
Anyway would love to see a proper resource. We've made some mistakes but I'm sure there's more to learn.
Because I can go from main.go to a load balanced, autoscaling app with rolling deploys, segeregated environments, logging & monitoring in about 30 minutes, and never need to touch _any_ of that again. Plus, if I leave, the guy who comes after me can look at a helm chart, terraform module + pipeline.yml and figure out how it works. Meanwhile, our janq shell script based task scheduler craps out on something new every month. What started as 15 lines of "docker run X, sleep 30 docker kill x" is now a polyglot monster to handle all sorts of edge cases.
I have spent vanishingly close to 0 hours on maintaining our (managed) kubernetes clusters in work over the past 3 years, and if I didn't show up tomorrow my replacement would be fine.
If you can do all that in 30 minutes (or even a few hours), I would love to read an article/post about your setup, or any resources you might recommend.
Why wouldn't you use Kubernetes? There are basically 3 classes of deployments:
1) We don't have any software, so we don't have a prod environment.
2) We have 1 team that makes 1 thing, so we just launch it out of systemd.
3) We have between 2 and 1000 teams that make things and want to self-manage when stuff gets rolled out.
Kubernetes is case 3. Like it or not, teams that don't coordinate with each other is how startups scale, just like big companies. You will never find a director of engineering that says "nah, let's just have one giant team and one giant codebase".
On AWS, at least, there are alternatives such as ECS and even plain old EC2 auto scaling groups. Teams can have the autonomy to run their infrastructure however they like (subject to whatever corporate policy and compliance regime requirements they might have to adhere to).
Kubernetes is appealing to many, but it is not 100% frictionless. There are upgrades to manage, control plane limits, leaky abstractions, different APIs from your cloud provider, different RBAC, and other things you might prefer to avoid. It's its own little world on top of whatever world you happen to be running your foundational infrastructure on.
One giant codebase is fine. Monorepo is better than lots of scattered repos linked together with git hashes. And it doesn't really get in the way of each team managing when stuff gets rolled out.
This is my case. I’m one man show ATM so no DBA. I’m still using Kubernetes. Many things can be automated as simply as helm apply. Plus you get the benefit of not having a hot mess of systemd services, ad hoc tools which you don’t remember how you configured, plethora bash scripts to do common tasks and so on.
I see Kubernetes as one time (mental and time) investment that buys me somehow smoother sailing plus some other benefits.
Of course it is not all rainbows and unicorns. Having a single nginx server for a single /static directory would be my dream instead of MinIO and such.
Because it works, the infra folks you hired already know how to use it, the API is slightly less awful than working with AWS directly, and your manifests are kinda sorta portable in case you need to switch hosting providers for some reason.
Helm is the only infrastructure package manager I've ever used where you could reliably get random third party things running without a ton of hassle. It's a huge advantage.
We adopted TFC at the start of 2023 and it was problematic right from the start; stability issues, unforeseen limitations, and general jankiness. I have no regrets about moving us away from local execution, but Terraform Cloud was a terrible provider.
When they announced their pricing changes, the bill for our team of 5 engineers would have been roughly 20x, and more than hiring an engineer to literally sit there all day just running it manually. No idea what they’re thinking, apart from hoping their move away from open source would lock people in?
We ended up moving to Scalr, and although it hasn’t been a long time, I can’t speak highly enough of them so far. Support was amazing throughout our evaluation and migration, and where we’ve hit limits or blockers, they’ve worked with us to clear them very quickly.
I would love to see this type of thing from multiple sources. This reflects a lot of my own experience.
I think the format of this is great. I suppose it would take a motivated individual to go around and ask people to essentially fill out a form like this to get that.
One suggestion if we're gonna standardize around this format. Avoid the double negatives. In some cases author says "avoided XYZ" and then the judgment was "no regrets". Too many layers for me to parse there. Instead, I suggest each section being the product that was used. If you regret that product, in the details is where you mention the product you should have used. Or you have another section for product ABC and you provide the context by saying "we adopted ABC after we abandoned XYZ".
I don't recommend trying to categorize into general areas like logging, postmortems, etc. Just do a top-level section for each product.
For people who enjoyed this post but want to see the other side of the spectrum where self hosted is the norm I'll point to the now classic series of posts on how Stack Overflow runs its infra: https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...
If anyone has newer posts like the above, please reply with links as I would love to read them.
Disagree on the point and reasoning about the single database.
Sounds like they experienced badly managed and badly constrained database. The described fks and relations: that's what the key constraints and other guard rails and cascades are for - so that you are able to manage a schema. That's exactly how you do it: add in new tables that reference old data.
I think the regret is actually not managing the database, and not so much about having a single database.
"database is used by everyone, it becomes cared for by no one". How about "database is used by everyone, it becomes cared for by everyone".
Can you explain?
Having a tool to detect changes and create a migration doesn’t sound bad?
In a nutshell thats how django migrations work as well, which works really well.
> How about "database is used by everyone, it becomes cared for by everyone".
So every one needs to know every use case of that database? Seems very unlikely if there are multiple teams using same DB.
FKs? Unique constraints? Not null colums? If not added at the creation of the table they will never be added - the moment DB is part of a public API you cannot do a lot of things safely.
The only moment when you want to share DB is when you really need to squeeze every last bit of performance - and even then, you want to have one owner and severly limited user accounts (with white list of accessible views and stored procedures).
The database should never ever become part of a public API.
You don’t share a DB for performance reasons (rather the opposite), you do it to ensure data integrity and consistency.
And no, not everyone needs to know every use case. But every team needs to have someone who coordinates any overlapping schema concerns with the other teams. This needs to be managed, but it’s also not rocket science.
This is fabulous. I keep lists like this in my notebook(s). The critical thing here is that you shouldn't dwell on your "wrong" choices, instead document the choice, what you thought you were getting, what you got, and what information would have been helpful to know at the time of decision (or which information you should have given more weight at the time of the decision.) If you do this, you will consistently get better and better.
And by far "automate all the things" is probably my number one suggestion for DevOps folks. Something that saves you 10 minutes a day pays for itself in a month when you have a couple of hours available to diagnose and fix a bug that just showed up. (5 days a week X 4 weeks X 10 minutes = 200 minutes) The exponential effect of not having to do something is much larger than most people internalize (they will say, "This just takes me a couple of minutes to do." when in fact it takes 20 to 30 minutes to do and they have to do it repeatedly.)
As a machine learning platform engineer these sound like technology choices as opposed to infrastructure decisions. I would love to read this post but really with the infrastructure trade-offs that were made. But thanks for the post.
Side node: There is a small typo repeated twice "Kuberentes"
Awesome writeup! Just had a couple comments/questions.
> Not adopting an identity platform early on
The reason for not adopting an IDP early is because almost every vendor price gouges for SAML SSO integration. Would you say it's worth the cost even when you're a 3-5 person startup?
> Datadog
What would you recommend as an alternative? Cloudwatch? I love everything about Datadog, except for their pricing....
> Nginx load balancer for EKS ingress
Any reason for doing this instead of an Application Load Balancer? Or even HA Proxy?
For datadog, unfortunately there's no obvious altnernative despite many companies trying to take marketshare. This is to say, datadog both has second to none DX and a wide breadth of services.
Grafana Labs comes closest in terms of breadth but their DX is abysmal (I say this as a heavy grafana/prometheus user)
Same comments about new relic though they have better dx than grafana.
Chronosphere has some nice DX around prometheus based metrics but lack the full product suite.
I could go on but essentially, all vendors either lack breadth, DX, or both.
Almost every time I read someone's insights who works in an environment with IaaS buy-in, my takeaway is the same: oh boy, what an alphabet soup.
The initial promise of "we'll take care of this for you, no in-house knowledge needed" has not materialized. For any non-trivial use case, all you do is replace transferrable, tailored knowledge with vendor-specific voodoo.
People who are serious about selling software-based services should do their own infrastructure.
Even if others disagree with your endorsements or regrets, this record shows you're actually aware of the important decisions you made over the past four years and tracked outcomes. Did you record the decisions when you made them and revisit later?
> Code is of course powerful, but I’ve found the restrictive nature of Terraform’s HCL to be a benefit with reduced complexity.
No way. We used Terraform before and the code just got unreadable. Simple things like looping can get so complex. Abstraction via modules is really tedious and decreases visibility.
CDKTF allowed us to reduce complexity drastically while keeping all the abstracted parts really visible. Best choice we ever made!
Great post. I do wonder - what are the simplest K8s alternatives?
Many say in the database world, "use Postgres", or "use sqlite." Similarly there are those databases that are robust that no one has heard of, but are very limited like FoundationDB. Or things that are specialized and generally respected like Clickhouse.
It’s mainly running your own control plane that is complex. Managed k8s (EKS, AKS, GKE) is not difficult at all. Don’t listen to all the haters. It’s the same crowd who think they can replace systemd with self hacked init scripts written in bash, because they don’t trust abstractions and need to see everything the computer does step-by-step.
I also stayed away for a long time due to all the fear spread here, after taking the leap, I’m not looking back.
The lightweight “simpler” alternative is docker-compose. I put simpler in quotes because once you factor in all the auxiliary software needed to operate the compose files in a professional way (IaC, Ansible, monitoring, auth, VM provisioning, ...), you will accumulate the same complexity yourself, only difference is you are doing it with tools that may be more familiar to what you are used to. Kubernetes gives you a single point of control plane for all this. Does it come with a learning curve? Yes, but once you get over it there is nothing inherent about it that makes it unnecessary complex. You don’t need autoscaler, replicasets and those more advanced features just because you are on k8s.
If you want to go even simpler, the clouds have offerings to just run a container, serverless, no fuzz around. I have to warn everyone though that using ACI on Azure was the biggest mistake of my career. Conceptually it sounds like a good idea but Azures execution of it is just a joke. Updating a very small container image taking upwards of 20-30 minutes, no logs on startup crashes, randomly stops serving traffic, bad integration with storage.
It's just that, you should start with a handful of backed-up pet servers. Then manually automate their deployment when you need it. And only then go for a tool that abstracts the automated deployment when you need it.
But I fear the simplest option on the Kubernetes area is Kubernetes.
I shunned k8s for a long time because of the complexity, but the managed options are so much easier to use and deploy than pet servers that I can’t justify it any more. For anything other than truly trivial cases, IMO kubernetes or (or similar, like nomad) is easier than any alternative.
The stack I use is hosted Postgres and VKS from Vultr. It’s been rock solid for me, and the entire infrastructure can be stored in code.
This is good advice, if you haven't experienced the pain of doing it yourself, you won't know what the framework does for you. There are limits to this reasoning of course, we don't reimplement everything on the stack just for the learning experience. But starting with just docker might be a good idea.
> Multiple applications sharing a database [regret]
The industry has known this to be a stereotypically bad idea for generations now. It lead to things like the enterprise sevice bus, service-oriented architectures, and finally "micro services". Recently I've seen "micro services" that share the same database, so we've come full-circle.
Yet, every place I've worked was either laboring under a project to decouple two or more applications that were conjoined at the DB, or were still at the "this sucks but no one wants to fix it" stage.
How do we keep making this same mistake in industry?
Something I’ve noticed with PaaS services like RDS or Azure SQL is that people arguing against it are assuming that the alternative is “competence”.
Even in a startup, it’s difficult to hire an expert in every platform that can maintain a robust, secure system. It’s possible, but not guaranteed, and may require a high pay to retain the right staff.
Many government agencies on the other hand are legally banned from offering a competitive wage, so they can literally never hire anyone that competent.
This cap on skill level means that if they do need reliable platforms, the only way they can get one is by paying 10x the real market rate for an over-priced cloud service.
These are the “whales” that are keeping the cloud vendors fat and happy.
Props to the author for writing up the results from his exercise. But I think he should focused on a few controversial ones, and not the rotes ones.
Many of the decisions presented are not disagreeable (choosing slack) and some lack framing that clarifies the associated loss (Not adopting an identity platform early on). I think they're all good choices worth mentioned; I would have preferred a deeper look into the few that seemed easy and turned out to be hard, or the ones that were hard and got even harder.
The Bazel one made me chuckle - I worked at a company with an scm & build setup clearly inspired by Google’s setup. As a non-ex-Googler, I found it obviously insane, but there was just no way to get traction on that argument. I love that the rest of this list is pretty cut and dry, but Bazel is the one thing that the author can’t bring themself to say “don’t regret” even though they clearly don’t regret not using it.
I've seen Bazel reduce competent engineers to tears. There was a famous blog post a half-decade ago called something like "Bazel is the worst build system, except for all the others" and this still seems to ring true for me today.
There are some teams I work with that we'll never bother to make use Bazel because we know in advance that it would cripple them.
Having led a successful Bazel migration, I'd still recommend many projects to stick to the native or standard supported toolchain until there's a good reason to migrate to a build system (And I don't consider GitHub actions to be a build system).
I’m curious, what do you find insane about Bazel? In my experience it makes plenty of sense. And after using it for some months, I find more insane how build systems like CMake depend on you having some stuff preinstalled in your system and produce a different result depending on which environment they’re run on.
> Discourage private messages and encourage public channels.
I wish my current company did this. It's infuriating. The other day, I asked a question about how to set something up, and a manager linked me to a channel where they'd discussed that very topic - but it was private, and apparently I don't warrant an invite, so instead I have to go bother some other engineers (one of whom is on vacation.)
Private channels should be for sensitive topics (legal, finance, etc) or for "cozy spaces" - a team should have a private channel that feels like their own area, but for things like projects and anything that should be searchable, please keep things public.
I think kubernetes is a mistake and should have went with AWS ECS (using fargate or backed by autoscaling ec2), if single change he wouldn't need to even thing about a bunch of other topics on his list. Something to think about, AWS Lambda first then fallback to AWS ECS for everything else that needs to really be on 100% of the time.
I love this write-up and the way it's presented. I disagree with some of the decisions and recommendations, but it's great to read through the reasoning even in those cases.
It'd be amazing if more people published similar articles and there was a way to cross-compare them. At the very least, I'm inspired to write a similar article.
> There are no great FaaS options for running GPU workloads, which is why we could never go fully FaaS.
I keep wondering when this is going to show up. We have a lot of service providers, but even more frameworks, and every vendor seems to have their own bespoke API.
Right, I still find it faster to manually provision a specific instance type, install PyTorch on it, and deploy a little flask app for an inference server.
I am doing that. I am part of a research group, and don’t have the $$ or ability to pay so much for all these services.
So we got a $90k server with 184TB of raw storage (SAS SSD), 64 cores, and 1TB of memory. Put it on a 10GB line at our university and it is rock solid. We probably have less downtime than Github, even with reboots every few months.
Have some large (multi-TB) databases on it and web APIs for accessing the data. Would be hugely expensive in the cloud with, especially with egress costs.
You have to be comfortable sys-admining though. Fortunately I am.
I didn't understand this section. Ubuntu servers as dev environment, what do you mean? As in an environment to deploy things onto, or a way for developers to write code like with VSCode Remote?
seems like the latter given "Originally I tried making the dev servers the same base OS that our Kubernetes nodes ran on, thinking this would make the development environment closer to prod"
But I thought the whole point of the container ecosystem was to abstract away the OS layer. Given that the kernel is backwards compatible to a fault, shouldn't it be enough to have a kernel that is as least as recent as the one on your k8s platform (provided that you're running with the default kernel or something close to it)?
I'm using Pulumi in production pretty heavily for a bunch of different app types (ECS, EKS, CloudFront, CloudFlare, Vault, Datadog monitors, Lambdas of all types, EC2s with ASGs, etc.), it's reasonably mature enough.
As mentioned in the other comment, the most commonly used providers for terraform are "bridged" to pulumi, so the maturity is nearly identical to Terraform. I don't really use Pulumi's pre-built modules (crossroads), but I don't find I've ever missed them.
I really like both Pulumi and Terraform (which I also used in production for hundreds of modules for a few years), which it seems like isn't always a popular opinion on HN, but I have and you absolutely can run either tool in production just fine.
My slight preference is for Pulumi because I get slightly more willing assistance from devs on our team to reach in and change something in infra-land if they need to while working on app code.
Say what you want. The tool then builds that, or changes whats there to match.
I've tried Pulumi and understanding the bit that runs before it tries to do stuff and the bit that runs after it tries to do stuff and working out where the bugs are is a PITA. It lulls you into a false sense of security that you can refer to your own variables in code, but that doesn't get carried over to when it is actually running the plan on the cloud service (ie actually creating the infrastructure) because you can only refer to the outputs of other infrastructure.
CFN is too far in the other direction, primarily because it's completely invisible and hard to debug.
Terraform has enough programmability (eg for_each, for-expressions etc) that you can write "here is what I want and how the things link together" and terraform will work out how to do it.
The language is... sometimes painful, but it works.
The provider support is unmatched and the modules are of reasonable quality.
I understand, but I think they don’t have the luxury of not having a DBA. Data is important; it’s arguably more important than code. Someone needs to own thinking about data, whether it is stored in a hierarchical, navigation-based database such as a filesystem, a key-value stored like S3 (which, sure, can emulate a filesystem), or in a relational database. Or, for that matter, in vendor systems such as Google Workspace email accounts or Office365 OneDrive.
Early on, depending on what you're building, you don't need a fully fleshed DBA and can get away with at least one person that knows DB fundamentals.
But if you only want to hire React developers (or swap for the framework of the week) then you'll likely end up with zero understanding of the DB. Down the line you have a mess with inconsistent or corrupted data that'll come back with a vengeance.
we have dotnet webapp deployed on Ubuntu and it leaves a lot to be desired. The package for .net6 from default repo didn't recognise other dotnet components installed, net8 is not even coming to 22.04 - you have to install from the ms repo. But that is not compatible with the default repo's package for net6 so you have to remove that first and faff around with exact versions to get it installed side by side...
At least I don't have to deal with rhel Why is renewing a dev subscription so clunky?!
I don't get why all startups don't just start with a PaaS like Render, Fly.io or Heroku. Why spend time setting up your own infra and potentially have to hire dedicated staff to manage it when you can do away with all that and get on with trying to move your business forward?
If and when you start experiencing scaling problems (great!), that's the time to think about migrating to setting up infra.
Because like every service-oriented offering, each platform differentiates as hard as it can to lock you in to their way of doing things.
Things largely look the same on the surface; this takes the most effect at the implementation-detail level, where adjusting and countercorrecting down the track is fiddly and uses an adrenally-draining level of attention span - right when you're at the point where you're scaling and you no longer have the time to deal with implementation detail level stuff.
You're on <platform> and you're doing things their way and pivoting the architecture will only be prioritised if the alternative would be bankruptcy.
> Very intuitive to configure and has worked well with no issues. Highly recommend using it to create your Let’s Encrypt certificates for Kubernetes.
> The only downside is we sometimes have ANCIENT (SaaS problems am I right?) tech stack customers that don’t trust Let’s Encrypt, and you need to go get a paid cert for those.
Cert-manager allows you to use any CA you like including paid ones without automation.
It is a shame karpenter is AWS only. I was thinking about how our k8s autoscaler could be better and landed on the same kind of design as karpenter where you work from unschedulable pods backwards. Right now we have an autoscaler which looks at resource utilization of a node pool but that doesn’t take into account things like topology spread constraints and resource fragmentation.
Terraform is great but it's so frustrating sometimes. You just pray that the provider has a specific configuration of whatever resources you're working with, because else when them resources are up on multiple env then you'll have to edit those configs somehow.
I've seen a lot of comments about how bad DataDog is because of cost but surprisingly I haven't seen open-source alternatives like OpenTelemetry/Prometheus/Grafana/Tempa mentioned.
Is it because most people are willing to pay someone else to manage monitoring infrastructure or other reasons?
the way I think of datadog is that datadog it provides a second to none DX combined with a wide suite of product offerings that is good enough for most companies most of the time. does it have opaque pricing that can be 100x more expensive than alternatives? absolutely! will people continue to use it? yes!
something to keep in mind is that most companies are not like the folks in this thread. they might not have the expertise, time or bandwidth to build invest in observability.
the vast majority of companies just want something that basically works and doesn’t take a lot of training to use.
I think of Datadog as the Apple of observability vendors - it doesn’t offer everything and there are real limitations (and price tags) for more precise use cases but in the general case, it just works (especially if you stay within its ecosystem)
> There are no great FaaS options for running GPU workloads
This hits hard. Someone please take my (client's) money and provide sane GPU FaaS. Banana.dev is cool but not really enterprise ready. I wish there was a AWS/GCP/Azure analogue that the penny pinchers and MBAs in charge of procurement can get behind.
Definitely. But the sad reality is that in some corporate environments (incumbent finance, government) if it's not a button click in portal.azure.com away, you can spend 6-12 months in meetings with low energy gloomboys to get your access approved.
This guy gets it, I agree with it all. The exception being, use Fargate without K8s and lean on Terraform and AWS services rather than the K8s alternatives. When you have no choice left and you have to use K8s, then I would pick it up. No sense going down into the mines if you don't have to.
As someone who isn't a developer, readingthis was eye opening. It's interesting just how unbundled the state of running a software company is. And this is only your selection of the tools and options, not imagining the entire landscape.
Interesting read, I agree with adopting an identity platform but this can definitely be contentious if you want to own your data.
I wonder how much one should pay attention to future problems at the start of a startup versus "move fast and break things." Some of this stuff might just put you off finishing.
I was hoping there would be a section for Search Engines. It's one of those things you tend to get locked in to, and it's hard to clearly know your requirements well enough early on.
Any references to something like this with a Search slant would be greatly appreciated.
After reading through this entire post, I'm pleasantly surprised that there isn't one item where I don't mirror the same endorse/regret as the author. I'm not sure if this is coincidence or popular opinion.
What’s the right way to manage npm installs and deploy it to an AWS ec2 instance from github? Kubernetes? GitOps? EKS? I roll my own solution now with cron and bash because everything seems so bloated.
Using k8s over ECS and raw-dogging Terraform instead of using the CDK? It's no wonder you end up needing to hire entire teams of people just to manage infra
I struggle with the type system in both, but today I was going through obscure go code and wishing interfaces were explicitly implemented. Lack of sum types is making me sad
"Since the database is used by everyone, it becomes cared for by no one. Startups don’t have the luxury of a DBA, and everything owned by no one is owned by infrastructure eventually"
I think adding a DBA or hiring one to help you layout your database should not be considered a
'luxury'...
Yeah I mean, hiring one person to own that for 5-10 teams is pretty cheap... Cheaper than each team constantly solving the same problems and relearning the same gotchas/operational stuff that doesn't add much value when writing your application code.
VPNs can be wonderful, and you can use use Tailscale or AWS VPN or OpenVPN or IPSEC and you can authenticate using Okta or GSuite or Auth0 or Keycloak or Authelia.
But since when is this Zero
Trust? It takes a somewhat unusual firewall scheme to make a VPN do anything that I would seriously construe as Zero Trust, and getting authz on top of that is a real PITA.
No, just no. I see this cropping up now and then. Homebrew is unsafe for Linux, and is only recommended by Mac users that don't want to bother to learn about existing package management.
> The markup cost of using RDS (or any managed database) is worth it.
Every so often I price out RDS to replace our colocated SQL Server cluster and it's so unrealistically expensive that I just have to laugh. It's absurdly far beyond what I'd be willing to pay. The markup is enough to pay for the colocation rack, the AWS Direct Connects, the servers, the SAN, the SQL Server licenses, the maintenance contracts, and a full-time in-house DBA.
https://calculator.aws/#/estimate?id=48b0bab00fe90c5e6de68d0...
Total 12 months cost: 547,441.85 USD
Once you get past the point where the markup can pay for one or more full-time employees, I think you should consider doing that instead of blindly paying more and more to scale RDS up. You're REALLY paying for it with RDS. At least re-evaluate the choices you made as a fledgling startup once you reach the scale where you're paying AWS "full time engineer" amounts of money.
Some orgs are looking at moving back to on prem because they're figuring this out. For a while it was vogue to go from capex to opex costs, and C suite people were incentivized to do that via comp structures, hence "digital transformation" ie: migration to public cloud infrastructure. Now, those same orgs are realizing that renting computers actually costs more than owning them, when you're utilizing them to a significant degree.
Just like any other asset.
Funny story time.
I was once part of an acquisition from a much larger corporate entity. The new parent company was in the middle of a huge cloud migration, and as part of our integration into their org, we were required to migrate our services to the cloud.
Our calculations said it would cost 3x as much to run our infra on the cloud.
We pushed back, and were greenlit on creating a hybrid architecture that allowed us to launch machines both on-prem and in the cloud (via a direct link to the cloud datacenter). This gave us the benefit of autoscaling our volatile services, while maintaining our predictable services on the cheap.
After I left, apparently my former team was strong-armed into migrating everything to the cloud.
A few years go by, and guess who reaches out on LinkedIn?
The parent org was curious how we built the hybrid infra, and wanted us to come back to do it again.
I didn't go back.
28 replies →
Context: I build internal tools and platforms. Traffic on them varies, but some of them are quite active.
My nasty little secret is for single server databases I have zero fear of over provisioning disk iops and running it on SQLite or making a single RDBMS server in a container. I've never actually run into an issue with this. It surprises me the number of internal tools I see that depend on large RDS installations that have piddly requirements.
4 replies →
That’s made possible because of all the orchestration platforms such as Kubernetes being standardized, and as such you can get pretty close to a cloud experience while having all your infrastructure on-premise.
1 reply →
Same experience here. As a small organization, the quotes we got from cloud providers have always been prohibitively expensive compared to running things locally, even when we accounted for geographical redundancy, generous labor costs, etc. Plus, we get to keep know how and avoid lock-in, which are extremely important things in the long term.
Besides, running things locally can be refreshingly simple if you are just starting something and you don't need tons of extra stuff, which becomes accidental complexity between you, the problem, and a solution. This old post described that point quite well by comparing Unix to Taco Bell: https://news.ycombinator.com/item?id=10829512.
I am sure for some use-cases cloud services might be worth it, especially if you are a large organization and you get huge discounts. But I see lots of business types blindly advocating for clouds, without understanding costs and technical tradeoffs. Fortunately, the trend seems to be plateauing. I see an increasing demand for people with HPC, DB administration, and sysadmin skills.
5 replies →
Keep in mind, there is an in between..
I would have a hard time doing servers as cheap as hetzner for example including the routing and everything
14 replies →
It's not an either/or. Many business both own and rent things.
If price is the only factor, your business model (or executives' decision-making) is questionable. Buy only the cheapest shit, spend your time building your own office chair rather than talking to a customer, you aren't making a premium product, and that means you're not differentiated.
Yep. This.
i would imagine that cloud infrastructure has the ability for fast scale up, unlike self-owned infrastructure.
For example, how long does it take to rent another rack that you didnt plan for?
And not to mention that the cost of cloud management platforms that you have to deploy to manage these owned assets is not free.
I mean, how come even large consumers of electricity does not buy and own their own infrastructure to generate it?
7 replies →
RDS pricing is deranged at the scales I've seen too. $60k/year for something I could run on just a slice of one of my on-prem $20k servers. This is something we would have run 10s of. $600k/year operational against sub-$100k capital cost pays DBAs, backups, etc with money to spare.
Sure, maybe if you are some sort of SaaS with a need for a small single DB, that also needs to be resilient, backed up, rock solid bulletproof.. it makes sense? But how many cases are there of this? If its so fundamental to your product and needs such uptime & redundancy, what are the odds its also reasonably small?
> Sure, maybe if you are some sort of SaaS with a need for a small single DB, that also needs to be resilient, backed up, rock solid bulletproof.. it makes sense? But how many cases are there of this?
Most software startups these days? The blog post is about work done at a startup after all. By the time your db is big enough to cost an unreasonable amount on RDS, you’re likely a big enough team to have options. If you’re a small startup, saving a couple hundred bucks a month by self managing your database is rarely a good choice. There’re more valuable things to work on.
19 replies →
I have a small MySQL database that’s rather important, and RDS was a complete failure.
It would have cost a negligible amount. But the sheer amount of time I wasted before I gave up was honestly quite surprising. Let’s see:
- I wanted one simple extension. I could have compromised on this, but getting it to work on RDS was a nonstarter.
- I wanted RDS to _import the data_. Nope, RDS isn’t “SUPER,” so it rejects a bunch of stuff that mysqldump emits. Hacking around it with sed was not confidence-inspiring.
- The database uses GTIDs and needed to maintain replication to a non-AWS system. RDS nominally supports GTID, but the documented way to enable it at import time strongly suggests that whoever wrote the docs doesn’t actually understand the purpose of GTID, and it wasn’t clear that RDS could do it right. At least Azure’s docs suggested that I could have written code to target some strange APIs to program the thing correctly.
Time wasted: a surprising number of hours. I’d rather give someone a bit of money to manage the thing, but it’s still on a combination of plain cloud servers and bare metal. Oh well.
1 reply →
> Sure, maybe if you are some sort of SaaS with a need for a small single DB, that also needs to be resilient, backed up, rock solid bulletproof.. it makes sense? But how many cases are there of this?
Very small businesses with phone apps or web apps are often using it. There are cheaper options of course, but when there is no "prem" and there are 1-5 employees then it doesn't make much sense to hire for infra. You outsource all digital work to an agency who sets you up a cloud account so you have ownership, but they do all software dev and infra work.
> If its so fundamental to your product and needs such uptime & redundancy, what are the odds its also reasonably small?
Small businesses again, some of my clients could probably run off a Pentium 4 from 2008, but due to nature of the org and agency engagement it often needs to live in the cloud somewhere.
I am constantly beating the drum to reduce costs and use as little infra as needed though, so in a sense I agree, but the engagement is what it is.
Additionally, everyone wants to believe they will need to hyperscale, so even medium scale businesses over-provision and some agencies are happen to do that for them as they profit off the margin.
1 reply →
Lots of cases. It doesn't even have to be a tiny database. Within <1TB range there's a huge number of online companies that don't need to do more than hundreds of queries per second, but need the reliability and quick failover that RDS gives them. The $600k cost is absurd indeed, but it's not the range of what those companies spend.
Also, Aurora gives you the block level cluster that you can't deploy on your own - it's way easier to work with than the usual replication.
15 replies →
RDS is not so bulletproof as advertised, and the support is first arrogant then (maybe) helpful.
People pay for RDS because they want to believe in a fairy tale that it will keep potential problems away and that it worked well for other customers. But those mythical other customers also paid based on such belief. Plus, no one wants to admit that they pay money in such irrational way. It's a bubble
1 reply →
> $600k/year operational against sub-$100k capital cost pays DBAs, backups, etc with money to spare.
One of these is not like the others (DBAs are not capex.)
Have you ever considered that if a company can get the same result for the same price ($100K opex for RDS vs same for human DBA), it actually makes much more sense to go the route that takes the human out of the loop?
The human shows up hungover, goes crazy, gropes Stacy from HR, etc.
RDS just hums along without all the liabilities.
3 replies →
The US DoD for sure.
Out of curiosity, who is your onprem provider?
That's a huge instance with an enterprise license on top. Most large SaaS companies can run off of $5k / m or cheaper RDS deployments which isn't enough to pay someone. The amount of people running half a million a year RDS bills might not be that large. For most people RDS is worth it as soon as you have backup requirements and would have to implement them yourself.
> Most large SaaS companies can run off of $5k / m or cheaper RDS
Hard disagree. An r6i.12xl Multi-AZ with 7500 IOPS / 500 GiB io1 books at $10K/month on its own. Add a read replica, even Single-AZ at a smaller size, and you’re half that again. And this is without the infra required to run a load balancer / connection pooler.
I don’t know what your definition of “large” is, but the described would be adequate at best at the ~100K QPS level.
RDS is expensive as hell, because they know most people don’t want to take the time to read docs and understand how to implement a solid backup strategy. That, and they’ve somehow convinced everyone that you don’t have to tune RDS.
2 replies →
Definitely--I recommend this after you've reached the point where you're writing huge checks to AWS. Maybe this is just assumed but I've never seen anyone else add that nuance to the "just use RDS" advice. It's always just "RDS is worth it" full stop, as in this article.
4 replies →
>Most large SaaS companies can run off of $5k / m or cheaper RDS deployments which isn't enough to pay someone.
After initial setup, managing equivalent of $5k/m RDS is not full time job. If you add to this, that wages differ a lot around the world, $5k can take you very, very far in terms of paying someone.
The problem you have here is by the time you reach the size of this DB, you are on a special discount rate within AWS.
Discount rates are actually much better too on the bigger instances. Therefore the "sticker price" that people compare on the public site is no where close to a fair comparison.
We technically aren't supposed to talk about pricing publically, but I'm just going to say that we run a few 8XL and 12Xl RDS instances and we pay ~40% off the sticker price.
If you switch to Aurora engine the pricing is absurdly complex (its basically impossible to determine without a simulation calculator) but AWS is even more aggressive with discounting on Aurora, not to mention there are some legit amazing feature benefits by switching.
I'm still in agreeance that you could do it cheaper yourself at a Data Center. But there are some serious tradeoffs made by doing it that way. One is complexity and it certainly requires several new hiring decisions. Those have their own tangible costs, but there are a huge amount of intangible costs as well like pure inconvenience, more people management, more hiring, split expertise, complexity to network systems, reduce elasticity of decisions, longer commitments, etc.. It's harder to put a price on that.
When you account for the discounts at this scale, I think the cost gap between the two solutions is much smaller and these inconveniences and complexities by rolling it yourself are sometimes worth bridging that smaller gap in cost in order to gain those efficiencies.
6 replies →
This is because you are using SQL Server. Microsoft has intentionally made cloud pricing for SQL server prohibitively expensive for non-Azure cloud workloads by requiring per-core licensing that is extremely punitive for the way EC2 and RDS is architected. This has the effect of making RDS vastly more expensive than running the same workload on bare metal or Azure.
Frankly, this is anti-competitive, and the FTC should look into it, however, Microsoft has been anti-competitive and customer hostile for decades, so if you're still using their products, you must have accepted the abuse already.
Totally agree. It's cherry-picking some weird case that's not even close to a typical for startup.
Cloud was supposed to be a commodity. Instead it is priced like at burger at the ski hill.
If it is such a golden goose, then there will be other competitors come in and compete the price down.
2 replies →
You don't get the higher end machines on AWS unless you're a big guy. We have Epyc 9684X on-prem. Cannot match that at the price on AWS. That's just about making the choices. Most companies are not DB-primary.
I think most people who’ve never experienced native NVMe for a DB are also unaware of just how blindingly fast it is. Even io2 Block Express isn’t the same.
9 replies →
Elsewhere today I recommended RDS, but was thinking of small startup cases that may lack infrastructure chops.
But you are totally right it can be expensive. I worked with a startup that had some inefficient queries, normally it would matter, but with RDS it cost $3,000 a month for a tiny user base and not that much data (millions of rows at most).
That sounds like the app needs some serious surgery.
Also, it is often overlooked that you still need skilled people to run RDS. It's certainly not "2-clicks and forget" and "you don't need to pay anyone running your DB".
I haven't run a Postgres instance with proper backup and restore, but it doesn't seem like rocket science using barman or pgbackrest.
People, who use MSFT SQL server in 2024 should suffer. For everybody else there's always Neon.
Data isn't cheap never was. Paying the licensing fees on top make it more expensive. It really depends on the circunstance a managed database usually has exended support from the compaany providing it. You have to weigh a team's expertise to manage a solution on your own and ensure you spent ample time making it resilient. Other half is the cost of upgrading hardware sometimes it is better to just outright pay a cloud provider if you business does not have enough income to outright buy hardware.There is always an upfront cost.
Small databases or test environment databases you can also leverage kubernetes to host an operator for that tiny DB. When it comes to serious data and it needs a beeline recovery strategy RDS.
Really it should be a mix self hosted for things you aren't afraid to break. Hosted for the things you put at high risk.
I'd add another criticism to the whole quote:
> Data is the most critical part of your infrastructure. You lose your network: that’s downtime. You lose your data: that’s a company ending event. The markup cost of using RDS (or any managed database) is worth it.
You need well-run, regularly tested, air gapped or otherwise immutable backups of your DB (and other critical biz data). Even if RDS was perfect, it still doesn't protect you from the things that backups protect you from.
After you have backups, the idea of paying enormous amounts for RDS in order to keep your company from ending is more far fetched.
In another section , they mentioned they don't have DBA, no app team own the database and the infra team is overwhelmed.
RDS make perfect sense for them
I agree that RDS is stupidly expensive and not worth it provided that the company actually hires at least 2x full-time database owners who monitor, configure, scale and back up databases. Most startups will just save the money and let developers "own" their own databases or "be responsible for" uptime and backups.
For a couple hundred grand you can get a team of 20 fully trained people working full time in most parts of the world.
Even for small workloads it's a difficult choice. I ran a small but vital db, and RDS was costing us like 60 bucks a month per env. That's 240/month/app.
DynamoDB as a replacement, pay per request, was essentially free.
I found Dynamo foreign and rather ugly to code for initially, but am happy with the performance and especially price at the end.
For big companies such as banks this cost comparison is not as straight forward. They have whole data centres just sitting there for disaster recovery. They periodically do switchovers to test DR. All of this expense goes away when they migrate to cloud.
> All of this expense goes away when they migrate to cloud.
They need to replicate everything in multiple availability zones, which is going to be more expensive than replicating data centres.
They still need to test their cloud infrastracuture works.
> All of this expense goes away when they migrate to cloud.
Just to pay someone else enough money to provide the same service and make a profit while do it
4 replies →
From what I’ve read, a common model for mmorpg companies is to use on-prem or colocated as their primary and then provision a cloud service for backup or overage.
Seems like a solid cost effective approach for when a company reaches a certain scale.
Lots of companies, like Grinding Gear Games and Square Enix, just rent whole servers for a tiny fraction of the price compared to what the price gouging cloud providers would charge for the same resources. They get the best of both worlds. They can scale up their infrastructure in hours or even minutes and they can move to any other commodity hardware in any other datacenter at the drop of a hat if they get screwed on pricing. Migrating from one server provider (such as IBM) to another (such as Hetzner) can take an experienced team 1-2 weeks at most. Given that pricing updates are usually given 1-3 quarters ahead at a minimum, they have massive leverage over their providers because they an so easily switch. Meanwhile, if AWS decides to jack up their prices, well you're pretty much screwed in the short-term if you designed around their cloud services.
While I agree that RDS is expensive, you're making two false claims here:
1. Hiring someone full time to work on the database means migrating off RDS
2. Database work is only about spend reduction
In your case it sounds more viable to move to VMs instead of RDS, which some cloud providers also recommend.
That's the cost of two people.
[dead]
> Picking AWS over Google Cloud
I know this is an unpopular opinion but I think google cloud is amazing compared to AWS. I use google cloud run and it works like a dream. I have never found an easier way to get a docker container running in the cloud. The services all have sensible names, there are fewer more important services compared to the mess of AWS services, and the UI is more intuitive. The only downside I have found is the lack of community resulting in fewer tutorials, difficulty finding experienced hires, and fewer third party tools. I recommend trying it. I'd love to get the user base to an even dozen.
The reasoning the author cites is that AWS has more responsive customer service and maybe I am missing out but it would never even occur to me to speak to someone from a cloud provider. They mention having "regular cadence meetings with our AWS account manager" and I am not sure what could be discussed. I must be doing simper stuff.
> "regular cadence meetings with our AWS account manager" and I am not sure what could be discusse.
As being on a number of those calls, its just a bunch of crap where they talk like a scripted bot reading from corporate buzzword bingo card over a slideshow. Their real intention is two fold. To sell you even more AWS complexity/services, and to provide "value" to their person of contact (which is person working in your company).
We're paying north of 500K per year in AWS support (which is a highway robbery), and in return you get a "team" of people supposedly dedicated to you, which sounds good in theory but you get a labirinth of irresponsiblity, stalling and frustration in reality.
So even when you want to reach out to that team you have to first to through L1 support which I'm sure will be replaced by bots soon (and no value will be lost) which is useful in 1 out of 10 cases. Then if you're not satisfied with L1's answer(s), then you try to escalate to your "dedicated" support team, then they schedule a call in three days time, or if that is around Friday, that means Monday etc.
Their goal is to stall so you figure and fix stuff on your own so they shield their own better quality teams. No wonder our top engineers just left all AWS communication and in cases where unavoidable they delegate this to junior people who still think they are getting something in return.
> We're paying north of 500K per year in AWS support (which is a highway robbery), and in return you get a "team" of people supposedly dedicated to you, which sounds good in theory but you get a labirinth of irresponsiblity, stalling and frustration in reality.
I’ve found a lot of the time the issues we run into are self-inflicted. When we call support for these, they have to reverse-engineer everything which takes time.
However when we can pinpoint the issue to AWS services, it has been really helpful to have them on the horn to confirm & help us come up with a fix/workaround. These issues come up more rarely, but are extremely frustrating. Support is almost mandated in these cases.
It’s worth mentioning that we operate at a scale where the support cost is a non-issue compared to overall engineering costs. There’s a balance, and we have an internal structure that catches most of the first type of issue nowadays.
What questions do you even ask?
In my experience all questions I've had for AWS were: 1. Their bugs, which won't be fixed in near future anyway. 2. Their transient failures, that will be fixed anyway soon.
So there's zero value in ever contacting AWS support.
This rings so true from experience it hurts.
This. This is the reality.
I am so tired of the support team having all the real metrics, especially in io and throttling, and not surfacing it to us somehow.
And cadence is really an opportunity for them to sell to you, the parent is completely right.
We are a reasonably large AWS customer and our account manager sends out regular emails with NDA information on what's coming up, we have regular meetings with them about things as wide ranging as database tuning and code development/deployment governance.
They often provide that consulting for free, and we know their biases. There's nothing hidden about the fact that they will push us to use AWS services.
On the other hand, they will also help us optimize those services and save money that is directly measurable.
GCP might have a better API and better "naming" of their services, but the breadth of AWS services, the incorporation of IAM across their services, governance and automation all makes it worth while.
Cloud has come a long way from "it's so easy to spin up a VM/container/lambda".
> There's nothing hidden about the fact that they will push us to use AWS services.
Our account team don't even do that. We use a lot of AWS anyway and they know it, so they're happy to help with competitor offerings and integrating with our existing stack. Their main push on us has been to not waste money.
3 replies →
In a previous role I got all of these things from GCP – they ran training for us, gave us early access to some alpha/beta stage products (under NDA), we got direct onboarding from engineers on those, they gave us consulting level support on some things and offered much more of it than we took up.
Fwiw as a medium spend (250k to eventually grow into 1M+ year) GCP customer we had the same deal with product roadmaps shared up-front under NDA etc.
And never did I miss something in GCP that I could find in AWS. Not sure the breadth is adding much compared to a simpler product suite in GCP.
I don’t have as much experience with aws but I do hate gcp. The ui is slow and buggy. The way they want things to authenticate is half baked and only implemented in some libraries and it isn’t always clear what library supports it. The gcloud command line tool regularly just doesn’t work; it just hangs and never times out forcing you to kill it manually wondering if it did anything and you’ll mess something up running it again. The way they update client libraries by running code generation means there’s tons of commits that aren’t relevant to the library you’re actually using. Features are not available across all client libraries. Documentation contradicts itself or contradicts support recommendations. Core services like bigquery lack any emulator or Docker image to facilitate CI or testing without having to setup a separate project you have to pay for.
Oh, friend, you have not known UI pain until you've used portal.azure.com. That piece of junk requires actual page reloads to make any changes show up. That Refresh button is just like the close-door elevator button: it's there for you to blow off steam, but it for damn sure does not DO anything. I have boundless screenshots showing when their own UI actually pops up a dialog saying "ok, I did what you asked but it's not going to show up in the console for 10 minutes so check back later". If you forget to always reload the page, and accidentally click on something that it says exists but doesn't, you get the world's ugliest error message and only by squinting at it do you realize it's just the 404 page rendered as if the world has fallen over
I suspect the team that manages it was OKR-ed into using AJAX but come from a classic ASP background, so don't understand what all this "single page app" fad is all about and hope it blows over one day
3 replies →
aws is even worse yet somehow people love them, maybe because they get to talk to a support "human" to hand-hold them through all the badness
Totally agree, GCP is far easier to work with and get things up and running for how my brain works compared to AWS. Also, GCP name stuff in a way that tells me what it does, AWS name things like a teenage boy trying to be cool.
That's completely opposite to my experience. Do you have any examples of AWS naming that you think is "teenage boy trying to be cool"? I am genuinely curious.
17 replies →
I have had the experience of an AWS account manager helping me by getting something fixed (working at a big client). But more commonly, I think the account manager’s job at AWS or any cloud or SAAS is to create a reality distortion field and distract you from how much they are charging you.
> I think the account manager’s job at AWS or any cloud or SAAS is to create a reality distortion field and distract you from how much they are charging you.
How do they do this jedi mind trick?
2 replies →
Maybe your TAM is different, but our regularly do presentations about cost breakdown, future planning and possible reservations. There's nothing distracting there.
AWS enterprise support (basically first line support that you paid for) is actually really really good. they will look at your metrics/logs and share with you solid insights. anything more you can talk to a TAM who can then reach out to relevant engineering teams
I share your thoughts. It looks like an entire article endorsing AWS honestly
Heartily seconded. Also don't forget the docs: Google Cloud docs are generally fairly sane and often even useful, whereas my stomach churns whenever I have to dive into AWS's labyrinth of semi-outdated, nigh-unreadable crap.
To be fair there are lots of GCP docs, but I cannot say they are as good as AWS. Everything is CLI-based, some things are broken or hello-world-useless. Takes time to go through multiple duplicate articles to find anything decent. I have never had this issue with AWS.
GCP SDK docs must be mentioned separately as it's a bizarre auto-generated nonsense. Have you seen them? How can you even say that GCP docs are good after that?
2 replies →
We're relatively small GCP users (low six figures) and have monthly cadence meetings with our Google account manager. They're very accommodating, and will help with contacts, events and marketing.
> I have never found an easier way to get a docker container running in the cloud
I don't have a ton of Azure or cloud experience but I run an Unraid server locally which has a decent Docker gui.
Getting a docker container running in Azure is so complicated. I gave up after an hour of poking around.
Azure is a complete disaster, deserves its own garbage-category, and gives people PTSD. I don't think AWS/CGP should ever be compared to it at all.
3 replies →
Oh I disagree - we migrated from azure to AWS, and running a container on Fargate is significantly more work than Azure Container Apps [0]. Container Apps was basically "here's a container, now go".
[0] https://azure.microsoft.com/en-gb/products/container-apps
2 replies →
GCP support is atrocious. I've worked at one of their largest clients and we literally had to get executives into the loop (on both sides) to get things done sometimes. Multiple times they broke some functionality we depended on (one time they fixed it weeks later except it was still broken) or gave us bad advice that cost a lot of money (which they at least refunded if we did all the paperwork to document it). It was so bad that my team viewed even contacting GCP as an impediment and distraction to actually solving a problem they caused.
I also worked at a smaller company using GCP. GCP refused to do a small quota increase (which AWS just does via a web form) unless I got on a call with my sales representative and listened to a 30 minute upsell pitch.
If you are big enough to have regular meetings with AWS you are big enough to have meetings with GCP.
I’ve had technicians at both GCP and Azure debug code and spend hours on developing services.
> I’ve had technicians at both GCP and Azure debug code and spend hours on developing services.
Almost every time Google pulled in a specialist engineer working on a service/product we had issues with it was very very clear the engineer had no desire to be on that call or to help us. In other words they'd get no benefit from helping us and it was taking away from things that would help their career at Google. Sometimes they didn't even show up to the first call and only did to the second after an escalation up the management chain.
> I have never found an easier way to get a docker container running in the cloud
We started using Azure Container Apps (ACA) and it seems simple enough.
Create ACA, point to GitHub repo, it runs.
Push an update to GitHub and it redeploys.
Azure Container Apps (ACA) and AWS AppRunner are also heavily "inspired" by Google Cloud Run.
1 reply →
Also much prefer GCP but gotta say their support is hot steaming **. I wasted so much time for absolutely nothing with them.
GCP's SDK and documentation is a mess compared to AWS. And looking at the source code I don't see how it can get better any time soon. AWS seems to have proper design in mind and uses less abstractions giving you freedom to build what you need. AWS CDK is great for IAC.
The only weird part I experienced with AWS is their SNS API. Maybe due to legacy reasons, but what a bizarre mess when you try doing it cross-account. This one is odd.
I have been trying GCP for a while and DevX was horrible. The only part that more-or-less works is CLI but the naming there is inconsistent and not as well-done as in AWS. But it's relative and subjective, so I guess someone likes it. I have experienced GCP official guides that broken, untested or utterly braindead hello-world-useless. And also they are numerous and spread so it takes time to find anything decent.
No dark mode is an extra punch. Seriously. Tried to make it myself with an extension but their page is Angular hell of millions embedded divs. No thank you.
And since you mentioned Cloud Run -- it takes 3 seconds to deploy a Lambda version in AWS and a minute or more for GCP Could Function.
The author leads infrastructure at Cresta. Cresta is a customer service automation company. His first point is about how happy he is to have picked AWS and their human-based customer service, versus Google's robot-based customer service.
I'm not saying there's anything wrong, and I'm oversimplifying a bit, but I still find this amusing.
Haha very good catch. I prefer GCP but I will admit any day of the week that their support is bad. Makes sense that they would value good support highly.
We used to use AWS and GCP at my previous company. GCP support was fine, and I never saw anything from AWS support that GCP didn't also do. I've heard horror stories about both, including some security support horror stories from AWS that are quite troubling.
Utter insanity. So much cost and complexity, and for what? Startups don’t think about costs or runway anymore, all they care about is “modern infrastructure”.
The argument for RDS seems to be “we can’t automate backups”. What on earth?
Is spending time to make it reliable worth it vs working on your actual product? Databases are THE most critical things your company has.
I see this argument a lot. Then most startups use that time to create rushed half-assed features instead of spending a week on their db that'll end up saving hundreds of thousands of dollars. Forever.
For me that's short-sighted.
4 replies →
All that infra doesn’t integrate itself. Everywhere I’ve worked that had this kind of stack employed at least one if not a team of DevOps people to maintain it all, full time, the year round. Automating a database backup and testing it works takes half a day unless you’re doing something weird
16 replies →
So investing in a critical part of my business is the bad thing to do?
> The argument for RDS seems to be “we can’t automate backups”. What on earth?
I can automate backups and I'm extremely happy they with some extra cost in RDS, I don't have to do that.
Also, at some size automating the database backup becomes non-trivial. I mean, I can manage a replica (which needs to be updated at specific times after the writer), then regularly stop replication for a snapshot, which is then encrypted, shipped to storage, then manage the lifecycle of that storage, then setup monitoring for all of that, then... Or I can set one parameter on the Aurora cluster and have all of that happen automatically.
The argument for RDS (and other services along those lines) is "we can't do it as good, for less".
And, when factoring in all costs and considering all things the service takes care of, it seems like a reasonable assumption that in a free market a team that specializes in optimizing this entire operation will sell you a db service at a better net rate than you would be able to achieve on your own.
Which might still turn out to be false, but I don't think it's obvious why.
I agree but also I'm not entirely sure how much of this is avoidable. Even the most simple web applications are full of what feels like needless complexity, but I think actually a lot of it is surprisingly essential. That said, there is definitely a huge amount of "I'm using this because I'm told that we should" over "I'm using this because we actually need it"
As the famous quote goes, "If I'd had more time, I would've written a shorter letter".
Also does primary / secondary global clusters with automated failover. Saves a ton of time not to manage that manually
Everyone who says they can run a database better than Amazon is probably lying or Has a story about how they had to miss a family event because of an outage.
The point isn’t that you can’t do it, the point is that it’s less work for extremely high standards. It is not easy to configure multi region failover without an entire network team and database team unless you don’t give a shit about it actually working. Oh yea, and wait until you see how much SOC2 costs if you roll your own database.
One don’t necessarily need to run a DB better than Amazon. Just sufficiently good for the product/service you’re are working on. And depending on specifics it may costs much less (but your mileage may vary).
There are other providers with better value for service within AWS or GCP, like Crunchy.
> EKS
My contrarian view is that EC2 + ASG is so pleasant to use. It’s just conceptually simple: I launch an image into an ASG, and configure my autoscale policies. There are very few things to worry about. On the other hand, using k8s has always been a big deal. We built a whole team to manage k8s. We introduce dozens of concepts of k8s or spend person-years on “platform engineering” to hide k8s concepts. We publish guidelines and sdks and all kinds of validators so people can use k8s “properly”. And we still write 10s of thousands lines of YAML plus 10s of thousands of code to implement an operator. Sometimes I wonder if k8s is too intrusive.
K8S is a disastrous complexity bomb. You need millions upon millions of lines of code just to build a usable platform. Securing Kubernetes is a nightmare. And lock-in never really went away because it's all coupled with cloud specific stuff anyway.
Many of the core concepts of Kubernetes should be taken to build a new alternative without all the footguns. Security should be baked in, not an afterthought when you need ISO/PCI/whatever.
> K8S is a disastrous complexity bomb. You need millions upon millions of lines of code just to build a usable platform.
I don't know what you have been doing with Kubernetes, but I run a few web apps out of my own Kubernetes cluster and the full extent of my lines of code are the two dozen or so LoC kustomize scripts I use to run each app.
9 replies →
This isn't my experience at all. Maybe three or four years ago?
Who exactly needs millions of lines of code?
5 replies →
kubeadm + fabric + helm got me 99% of the way there. My direct report, a junior engineer, wrote the entire helm chart from our docker-compose. It will not entirely replace our remote environment but it is nice to have something in between our SDK and remote deployed infra. Not sure what you meant by security; could you elaborate? I just needed to expose one port to the public internet.
Millions upon millions of lines of code?! What? Can you specify what you were trying to do with it?
2 replies →
kinda like openshift?
To me, it sounds like your company went through a complex re-architecturing exercise at the same time you moved to Kubernetes, and your problems have more to do with your (probably flawed) migration strategy than the tool.
Lifting and shifting an "EC2 + ASG" set-up to Kubernetes is a straightforward process unless your app is doing something very non-standard. It maps to a Deployment in most cases.
The fact that you even implemented an operator (a very advanced use-case in Kubernetes) strongly suggests to me that you're doing way more than just lifting and shifting your existing set-up. Is it a surprise then that you're seeing so much more complexity?
Not familiar with the OP but this may have been the pitch for migration: "K8S will allow us better automation".
> My contrarian view is that EC2 + ASG is so pleasant to use.
Sometimes I think that managed kubernetes services like EKS are the epitome of "give the customers what they want", even when it makes absolutely no sense at all.
Kubernetes is about stitching together COTS hardware to turn it into a cluster where you can deploy applications. If you do not need to stitch together COTS hardware, you have already far better tools available to get your app running. You don't need to know or care in which node your app is suppose to run and not run, what's your ingress control, if you need to evict nodes, etc. You have container images, you want to run containers out of them, you want them to scale a certain way, etc.
I tend to agree that for most things on AWS, EC2 + ASG is superior. It's very polished. EKS is very bare bones. I would probably go so far as to just run Kubernetes on EC2 if I had to go that route.
But in general k8s provides incredibly solid abstractions for building portable, rigorously available services. Nothing quite compares. It's felt very stable over the past few years.
Sure, EC2 is incredibly stable, but I don't always do business on Amazon.
At first I thought your "in general" statement was contradicting your preference for EC2 + ASG. I guess AWS is such a large part of my world that "in general" includes AWS instead of meaning everything but AWS.
So by and large I agree with the things in this article. It's interesting that the points I disagree with the author on are all SaaS products:
> Moving off JIRA onto linear
I don't get the hype. Linear is fine and all but I constantly find things I either can't or don't know how to do. How do I make different ticket types with different sets of fields? No clue.
> Not using Terraform Cloud No Regrets
I generally recommend Terraform Cloud - it's easy for you to grow your own in house system that works fine for a few years and gradually ends up costing you in the long run if you don't.
> GitHub actions for CI/CD Endorse-ish
Use Gitlab
> Datadog Regret
Strong disagree - it's easily the best monitoring/observability tool on the market by a wide margin.
Cost is the most common complaint and it's almost always from people who don't have it configured correctly (which to be fair Datadog makes it far too easy to misconfigure things and blow up costs).
> Pagerduty Endorse
Pagerduty charges like 10x what Opsgenie does and offers no better functionality.
When I had a contract renewal with Pagerduty I asked the sales rep what features they had that Opsgenie didn't.
He told me they're positioning themselves as the high end brand in the market.
Cool so I'm okay going generic brand for my incident reporting.
Every CFO should use this as a litmus test to understand if their CTO is financially prudent IMO.
> Cost is the most common complaint and it's almost always from people who don't have it configured correctly (which to be fair Datadog makes it far too easy to misconfigure things and blow up costs).
I loved Datadog 10 years ago when I joined a company that already used it where I never once had to think about pricing. It was at the top of my list when evaluating monitoring tools for my company last year, until I got to the costs. The pricing page itself made my head swim. I just couldn’t get behind subscribing to something with pricing that felt designed to be impossible to reason about, even if the software is best in class.
> Datadog makes it far too easy to misconfigure things and blow up costs
I'll give you a fun example. It's fresh in my mind because i just got reamed out about it this week.
In our last contract with DataDog, they convinced us to try out the CloudSIEM product, we put in a small $600/mo committment to it to try it out. Well, we never really set it up and it sat on autopilot for many months. We fell under our contract rate for it for almost a year.
Then last month we had some crazy stuff happen and we were spamming logs into DataDog for a variety of reasons. I knew I didn't want to pay for these billions of logs to be indexed, so I made an exclusion filter to keep them out of our log indexes so we didn't have a crazy bill for log indexing.
So our rep emailed me last week and said "Hey just a heads up you have $6,500 in on-demand costs for CloudSIEM, I hope that was expected". No, it was NOT expected. Turns out excluding logs from indexing does not exclude them from CloudSIEM. Fun fact, you will not find any documented way to exclude logs from CloudSIEM ingestion. It is technically possible, but only through their API and it isn't documented. Anyway, I didn't do or know this, so now i had $6,500 of on-demand costs plus $400-500 misc on-demand costs that I had to explain to the CTO.
I should mention my annual review/pay raise is also next week (I report to the CTO), so this will now be fresh in their mind for that experience.
1 reply →
I’m a big fan of Datadog from multiple angles.
Their pricing setup is evil. Breaking out by SKUs and having 10+ SKUs is fine, trialing services with “spot” prices before committing to reserved capacity is also fine.
But (for some SKUs, at least) they make it really difficult to be confident that the reserved capacity you’re purchasing will cover your spot use cases. Then, they make you contact a sales rep to lower your reserved capacity.
It all feels designed to get you to pay the “spot” rate for as long as possible, and it’s not a good look.
I understand the pressures on their billing and sales teams that lead to these patterns, but they don’t align with their customers in the long term. I hope they clean up their act, because I agree they’re losing some set of customers over it.
4 replies →
Linear has a lot going for it. It doesn't support custom fields, so if that's a critical feature for you, I can see it falling short. In my experience though, custom fields just end up being a mess anytime a manager changes and decides to do things differently, things get moved around teams, etc.
- It's fast. It's wild that this is a selling point, but it's actually a huge deal. JIRA and so many other tools like it are as slow as molasses. Speed is honestly the biggest feature.
- It looks pretty. If your team is going to spend time there, this will end up affecting productivity.
- It has a decent degree of customization and an API. We've automated tickets moving across columns whenever something gets started, a PR is up for review, when a change is merged, when it's deployed to beta, and when it's deployed to prod. We've even built our own CLI tools for being able to action on Linear without leaving your shell.
- It has a lot of keyboard shortcuts for power users.
- It's well featured. You get teams, triaging, sprints (cycles), backlog, project management, custom views that are shareable, roadmaps, etc...
PagerDuty’s cheapest plan is $21 per user month
OpsGenie’s cheapest is $9 per user month but arbitrarily crippled, the plan anybody would want to use is $19 per user month
So instead of a factor of ten it’s ten percent cheaper. And i just kind of expect Atlassian to suck.
Datadog is ridiculously expensive and on several occasions I’ve run into problems where an obvious cause for an incident was hidden by bad behavior of datadog.
Heii On-Call is $32 per month total for your team — not per user. https://heiioncall.com/ (Full disclosure: part of the team building it)
7 replies →
Grafana OnCall can be self hosted for free or you can pay $20 a month, and still always have the option to migrate to self hosting if you want to save money
I just started building out on-call rotation scheduling to fit teams that already have an alerting solution and need simple automated scheduling. I’d love to get some feedback: https://majorpager.com
We moved from Trello to Linear and it's been fantastic. I hope to never work at an organisation large enough for JIRA to be a good idea.
To be fair Linear does strike me as everything everyone always hoped Trello would be.
So if that's the upgrade path you're going down I'd expect it to be fantastic.
Newer (aka next gen aka Team-managed) Jira projects are pretty solid.
7 replies →
DatDog is a freaking beast. NY wife works in workday (a huge employee management system) and they have a very large number of tutorials, videos, "working hours" and other tools to ensure their customers are making the best use of it.
Datadog on the other side... their "DD University" is a shame and we as paying customers are overwhelmed and with no real guidance. DD should assign some time for integration for new customers, even if it is proportional to what you pay annually. (I think I pay around 6000 usd annually.
In terms of Datadog - the per host pricing on infrastructure in a k8/microservices world is perhaps the most egregious of pricing models across all datadog services. Triply true if you use spot instances for short lived workloads.
For folks running k8 at any sort of scale, I generally recommend aggregating metrics BEFORE sending them to datadog, either on a per deployment or per cluster level. Individual host metrics tend to also matter less once you have a large fleet.
You can use opensource tools like veneur (https://github.com/stripe/veneur) to do this. And if you don't want to set this up yourself, third party services like Nimbus (https://nimbus.dev/) can do this for you automatically (note that this is currently a preview feature). Disclaimer also that I'm the founder of Nimbus (we help companies cut datadog costs by over 60%) and have a dog in this fight.
I mostly agreed with OP's article, but you basically nailed all of the points of disagreement I did have.
Jira: Its overhyped and overpriced. Most HATE jira. I guess I don't care enough. I've never met a ticket system that I loved. Jira is fine. Its overly complex sure. But once you set it up, you don't need to change it very often. I don't love it, I don't hate it. No one ever got fired for choosing Jira, so it gets chosen. Welcome to the tech industry.
Terraform Cloud: The gains for Terraform Cloud are minimal. We just use Gitlab for running Terraform pipelines and have a super nice custom solution that we enjoy. It wasn't that hard to do either. We maintain state files remotely in S3 with versioning for the rare cases when we need to restore a foobar'd statefile. Honestly I like having Terraform pipelines in the same place as the code and pipelines for other things.
GitHub Actions: Yeah switch to GitLab. I used to like Github Actions until I moved to a company with Gitlab and it is best in class, full stop. I could rave about Gitlab for hours. I will evangelize for Gitlab anywhere I go that is using anything else.
DataDog: As mentioned, DataDog is the best monitoring and observability solution out there. The only reason NOT to use it is the cost. It is absurdly expensive. Yes, truly expensive. I really hate how expensive it is. But luckily I work somewhere that lets us have it and its amazing.
Pagerduty: Agree, switch to OpsGenie. Opsgenie is considerably cheaper and does all the pager stuff of Pager duty. All the stuff that PagerDuty tries to tack on top to justify its cost is stuff you don't need. OpsGenie does all the stuff you need. Its fine. Similar to Jira, its not something anyone wants anyway. No ones going to love it, no one loves being on call. So just save money with OpsGenie. If you're going to fight for the "brand name" of something, fight for DataDog instead, not a cooler pager system.
I'm right there with you on Jira. The haters are wrong - it's a decent enough ticket system, no worse than anything else I've used. You can definitely torture Jira into something horrible, but that's not Jira's fault. Bad managers will ruin any ticket system if they have the customization tools to do so.
6 replies →
> I generally recommend Terraform Cloud
I'll be dead in the ground before I use TFC. 10 cents per resource per month my ass. We have around 100k~ resources at an early-stage startup I'm at, our AWS bill is $50~/mo and TFC wants to charge me $10k/mo for that? We can hire a senior dev to maintain an in-house tool full time for that much.
Agreed on PagerDuty It doesn't really do a lot, administrating it is fairly finicky, and most shops barely use half the functionality it has anyway.
To me its whole schedule interface is atrocious for its price, given from an SRE/dev perspective, that's literally its purpose - scheduled escalations.
Why gitlab? GitHub actions are a mess but gitlab online's ci cd is not much better at all, and for self hosted it opens a whole different can of worms. At least with GitHub actions you have a plugin ecosystem that makes the super janky underlying platform a bit more bearable.
I've found GitLab CI's "DAG of jobs" model has made maintenance and, crucially for us, optimisation relatively easy. Then I look into GitHub Actions and... where are the abstraction tools? How do I cache just part of my "workflow"? Plugins be damned. GitLab CI is so good that I'm willing to overlook vendor lock-in and YAML, and use it for our GitHub project even without proper integration. (Frankly the rest of GitLab seems to always be a couple features ahead, but no-one's willing to migrate.)
1 reply →
> Cost is the most common complaint and it's almost always from people who don't have it configured correctly (which to be fair Datadog makes it far too easy to misconfigure things and blow up costs).
Datadog's cheapest pricing is $15/host/month. I believe that is based on the largest sustained peak usage you have.
We run spot instances on AWS for machine learning workflows. A lot of them if we're training and none otherwise. Usually we're using zero. Using DataDog at it's lowest price would basically double the cost of those instances.
After their ridiculous outage, I wouldn’t touch OpsGenie with a 10ft pole.
This may be a noob question - but why not use Github Projects instead of Linear or Jita?
You're staying within an ecosystem you know and it seems to offer almost all of the necessary functionality
That would totally be my preference if business users didn't want access.
Getting them to use Github/Gitlab is an argument I've never won. Typically it goes the other way and I end up needing to maintain a Monday or Airtable instance in addition to my ticketing system.
Interesting. Atlassian also just launched an integration with OpsGenie. I have the same opinion of JIRA. I've tried many competitors (not Linear so far) and regretted it every time.
I'm not sure they just launched anything. OpsGenie has been an Atlassian product for 5 or more years now. I've been using it for 3-4 myself and its been integrated with Jira the whole time.
In fact, OpsGenie has mostly been on Auto-pilot for a few years now.
> Atlassian also just launched an integration with OpsGenie.
Given Atlassian bought OpsGenie in 2018, this either somewhere between quite late and unsurprising.
2 replies →
[dead]
I’m imagining a developer in the 90s/00s reading this list and being baffled by the complexity/terminology
I agree. I’m afraid I’m one of those 00s developers and can relate. Back then many startups were being launched on super simple stacks.
With all of that complexity/word salad from TFA, where’s the value delivered? Presumably there’s a product somewhere under all that infrastructure, but damn, what’s left to spend on it after all the infrastructure variable costs?
I get it’s a list of preferences, but still once you’ve got your selection that’s still a ton of crap to pay for and deal with.
Do we ever seek simplicity in software engineering products?
I think that far too many companies get sold on the vision of "it just works, you don't need to hire ops people to run the tools you need for your business". And that is true! And while you're starting, it may be that you can't afford to hire an ops guy and can't take the time to do it yourself. But it doesn't take that much scale before you get to the point it would be cheaper to just manage your own tools.
Cloud and SaaS tools are very seductive, but I think they're ultimately a trap. Keep your tools simple and just run them yourselves, it's not that hard.
Look, the thing is - most of infra decisions are made by devops/devs that have a vested interest in this.
Either because they only know how to manage AWS instances (it was the hotness and thats what all the blogs and YT videos were about) and are now terrified from losing their jobs if the companies switch stacks. Or because they needed to put the new thing on their CV so they remain employable. Also maybe because they had to get that promotion and bonus for doing hard things and migrating things. Or because they were pressured into by bean counters which were pressured by the geniuses of Wall Street to move capex to opex.
In any case, this isn't by necessity these days. This is because, for a massive amount of engineers, that's the only way they know how to do things and after the gold rush of high pay, there's not many engineers around that are in it to learn or do things better. It's for the paycheck.
It is what it is. The actual reality of engineering the products well doesn't come close to the work being done by the people carrying that fancy superstar engineer title.
That's for slower projects.
You know the old adage "fast, cheap, good: pick two"? With startups, you're forced to pick fast. You're still probably not gonna make it, but if you don't build fast, you definitely won't.
1 reply →
For simplicity, software must be well built. Unfortunately, the software development practice is perpetually underskilled so we release buggy crap which we compensate for in infrastructure.
> Do we ever seek simplicity in software engineering products?
Doubtfully. Simplicity of work breakdown structure - maybe. Legibility for management layers, possibly. Structural integrity of your CYA armor? 100%.
The half-life of a software project is what now, a few years at most these days? Months, in webdev? Why build something that is robust, durable, efficient, make all the correct engineering choices, where you can instead race ahead with a series of "nobody ever got fired for using ${current hot cloud thing}" choices, not worrying at all about rapidly expanding pile of tech and organizational debt? If you push the repayment time far back enough, your project will likely be dead by then anyway (win), or acquired by a greater fool (BIG WIN) - either way, you're not cleaning up anything.
Nobody wants to stay attached to a project these days anyway.
/s
Maybe.
1 reply →
I’ve used most of these technologies and the sum value add over a way simpler monolith on a single server setup is negligible. It’s pure insanity
It's a hedge.
There's an easy bent towards designing everything for scale. It's optimistic. It's feels good. It's safe, defendable, and sound to argue that this complexity, cost, and deep dependency is warranted when your product is surely on the verge of changing the course of humanity.
The reality is your SaaS platform for ethically sourced, vegan dog food is below inconsequential and the few users that you do have (and may positively affect) absolutely do not not need this tower of abstraction to run.
Yeah, I read the " My general infrastructure advice is “less is better”.", and was like "when did this list of stuff become the definition of 'less'"
My reaction exactly. I don't know their footprint but this is a long list of stuff.
I thought the same reading it – is it really this hard to build an app these days?
Things were more far more manual and much less secure, scalable and reliable in the past, but they were also far far simpler.
Agreed. It’s just ridiculous. Some just love to spend money and make things more complex.
We had FB up to 6 figures in servers and a billion MAUs (conservatively) before even tinkering with containers.
The “control plane” was ZooKeeper. Everything had bindings to it, Thrift/Protobuf goes in a znode fine. List of servers for FooService? znode.
The packaging system was a little more complicated than a tarball, but it was spiritually a tarball.
Static link everything. Dependency hell: gone. Docker: redundant.
The deployment pipeline used hypershell to drop the packages and kick the processes over.
There were hundreds of services and dozens of clusters of them, but every single one was a service because it needed a different SKU (read: instance type), or needed to be in Java or C++, or some engineering reason. If it didn’t have a real reason, it goes in the monolith.
This was dramatically less painful than any of the two dozen server type shops I’ve consulted for using kube and shit. It’s not that I can’t use Kubernetes, I know the k9s shortcuts blindfolded. But it’s no fun. And pros built these deployments and did it well, serious Kubernetes people can do everything right and it’s complicated.
After 4 years of hundreds of elite SWEs and PEs (SRE) building a Borg-alike, we’d hit parity with the bash and ZK stuff. And it ultimately got to be a clear win.
But we had an engineering reason to use containers: we were on bare metal, containers can make a lot of sense on bare metal.
In a hyperscaler that has a zillion SKUs on-demand? Kubernetes/Docker/OCI/runc/blah is the friggin Bezos tax. You’re already virtualized!
Some of the new stuff is hot shit, I’m glad I don’t ssh into prod boxes anymore, let alone run a command on 10k at the same time. I’m glad there are good UIs for fleet management in the browser and TUI/CLI, and stuff like TailScale where mortals can do some network stuff without a guaranteed zero day. I’m glad there are layers on top of lock servers for service discovery now. There’s a lot to keep from the last ten years.
But this yo dawg I heard you like virtual containers in your virtual machines so you can virtualize while you virtualize shit is overdue for its CORBA/XML/microservice/many-many-many repos moment.
You want reproducibility. Statically link. Save Docker for a CI/CD SaaS or something.
You want pros handing the datacenter because pets are for petting: pay the EC2 markup.
You can’t take risks with customer data: RDS is a very sane place to splurge.
Half this stuff is awesome, let’s keep it. The other half is job security and AWS profits.
> We had FB up to 6 figures in servers and a billion MAUs (conservatively) before even tinkering with containers.
that would have been around the time when containers entered the public/developer consciousness, no?
The funny thing is a lot of smaller startups are seeing just how absurdly expensive these service are, and are just switching back to basic bare metal server hosting.
For 99% of businesses it's a wasteful, massive overkill expense. You dont NEED all the shiny tools they offer, they don't add anything to your business but cost. Unless you're a Netflix or an Apple who needs massive global content distribution and processing services theres a good chance you're throwing money away.
I am 10s developer/systems engineer and my eyes kept getting wider with each new technology on the list. I don't know if its overkill or just the state of things right now.
There is no way one person can thoroughly understand so many complex pieces of technology. I have worked for 10 years more or less at this point, and I would only call myself confident on 5 technical products, maybe 10 if I being generous to myself.
The more complex you make it the better your job security eh? Maybe they’ll even give you a whole team to look after it all. Absolute madness.
Not really, it's just like counting: awk, grep, sed, uniq, tail, etc. "CloudOS" is in it's early days right now. You need to be careful on what tool or library you pick.
Couldn’t agree more. What a huge amount of tech and complexity just to get something off the ground
No, not at all. Maybe baffled by the use of expensive cloud services instead of running on your own bare metal where the cost is in datacenter space and bandwidth. The loss of control coupled with the cost is baffling.
There's _a lot_ in the article that existed in the 00s. Now imagine a programmer from the 70s...
I think engineers in the 20s who were putting out quality enigmas would be stunned by all the marketing lingo.
My last web development project was in the FTP upload era. Reading this, I'm kinda glad I'm not in web dev.
I am in 2024.
Reading this I couldn’t help but think: yeah all of these points make sense in isolation, but if you look at the big picture, this is an absurd level of complexity.
Why do we need entire teams making 1000s of micro decisions to deploy our app?
I’m hungry for a simpler way, and I doubt I’m alone in this.
You’re not alone. There is a constant undercurrent of pushback against this craziness. You see it all the time here on hacker news and with people I talk to irl.
Does not mean each of these things don’t solve problems. The issue as always about complexity-utility tradeoff. Some of these things have too much complexity for too little utility. I’m not qualified to judge here, but if the suspects have Turing-complete-yaml-templates on their hands, it probably ties them to the crime scene.
It smells like ZIRP is not over yet. VCs are still burning money in the AWS fire pit.
ZIRP was never the root problem.
The problem was: too much money, too few consequences for burning it.
The existence of the uber-wealthy means that markets can no longer function efficiently. Every market remains irrational longer than anyone who's not uber-wealthy can remain solvent.
Welcome to the new normal.
1 reply →
I've climbed the mountain of learning the basics of kubernetes / EKS, and I'm thinking we're going to switch to ECS. Kubernetes is way too complicated for our needs. It wants to be in control and is hard to direct with eg CloudFormation. Load balancers are provisioned from the add-on, making it hard to reference them outside kubernetes. Logging on EKS Fargate to Cloudwatch appears broken, despite following the docs. CPU/Memory metrics don't work like they do on EKS EC2, it appears to require ADOT.
I recreated the environment in ECS in 1/10th the time and everything just worked.
I've been running ECS for about 5 years now. It has come a long way from a "lightweight" orchestration tool into something thats actually pretty impressive. The recent new changes to the GUI are also helpful for people that don't have a ton of experience with orchestration.
We have moved off of it though, you can eventually need more features than it provides. Of course that journey always ends up in Kubernetes land, so you eventually will find your way back there.
Logging to Cloudwatch from kubernetes is good for one thing... audit logs. Cloudwatch in general is a shit product compared to even open source alternatives. For logging you really need to look at Fluentd or Kibana or DataDog or something along those lines. Trying to use Cloudwatch for logs is only going to end in sadness and pain.
GKE is a much better product to me still than EKS but at least in the last two years or so EKS has become a usable product. Back in like 2018 though? Hell no, avoid avoid avoid.
I started with ECS (because I wanted to avoid the complexity of K8s) and regret it. I feel I wasted a lot of time there.
In ECS, service updates would take 15 min or more (vs basically instant in K8s).
ECS has weird limits on how many containers you can run on one instance [0]. And in the network mode where you can run more containers on a host, then the DNS is a mess (you need to lookup SRV records to find out the port).
Using ECS with CDK/Cloudformation is very painful. They don't support everything (specially regarding Blue/Green deployments), and sometimes they can't apply changes you do to a service. When initially setting up everything, I had to recreate the whole cluster from scratch several times. You can argue that's because I didn't know enough, but if that ever happened to me on prod I'd be screwed.
I haven't used EKS (I switched to Azure), so maybe EKS has their own complex painful points. I'm trying to keep my K8s as vanilla as possible to avoid the cloud lock-in.
[0] https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesgu...
Interesting that you say you worry about re-creating the cluster from scratch because I've experienced exactly the opposite. Our EKS cluster required so many operations outside CloudFormation to configure access control, add-ons, metrics server, ENABLE_PREFIX_DELEGATION, ENABLE_POD_ENI... It would be a huge risk to rebuild the EKS cluster. And applications hosted there are not independent because of these factors. It makes me very anxious working on the EKS cluster. Yes you can pay an extra $70/month to have a dev cluster, but it will never be equal to prod.
On the other hand, I was able to spin up an entire ECS cluster in a few minutes time with no manual operations and entirely within CloudFormation. ECS costs nothing extra, so creating multiple clusters is very reasonable, though separate clusters would impact packing efficiency. The applications can be fully independent.
> ECS has weird limits on how many containers you can run on one instance
Interesting. With ECS it says for c5.large the task limit is 2 with without trunking, 10 with.
With EKS
1 reply →
I feel like this is overkill for a startup.
Why not dump your application server and dependencies into rented data center (or EC2 if you must) and setup a coarse DR? Maybe start with a monolith in PHP or Rails.
None of that word salad sounds like startup to me, but then again everyone loves to refer to themselves as a startup (must be a recruiting tool?), so perhaps muh dude is spot on.
I don't want to be negative, but this post reads like a list of things that I want to avoid in my career. I did a brief stint in cloud stuff at a FAANG and I don't care to go back to it.
Right now I'm engineer No. 1 at a current startup just doing DDD with a Django monolith. I'm still pretty Jr. and I'm wondering if there's a way to scale without needing to get into all of the things the author of this article mentions. Is it possible to get to a $100M valuation without needing all of this extra stuff? I realize it varies from business to business, but if anyone has examples of successes where people just used simple architecture's I'd appreciate it.
You can scale to any valuation with any architecture. Whether or not you need sophisticated scaling solutions depends on the characteristics of your product, mostly how pure of a software play it is. Pure software means you will run into scaling challenges quicker, since likely part of your value add is in fact managing the complexity of scaling.
If you are running a marketplace app and collect fees you're going to be able go much further on simpler architectures than if you're trying to generate 10,000 AI images per second.
Don't need any of it. Start simple. Some may be useful though. The list makes good points. Keep it around and if you find yourself suffering from the lack of something, look through the list and see if anything there would be good ROI. But don't adopt something just because this list says you should.
One thing though, I'd start with go. It's no more complex than python, more efficient, and most importantly IMO since it compiles down to binary it's easier to build, deploy, share, etc. And there's less divergence in the ecosystem; generally one simple way to do things like building and packaging, etc. I've not had to deal with versions or tooling or environmental stuff nearly as much since switching.
You don't need this many tools, especially really early. It also depends on the particulars of your business. E.g. if you are B2B SaaS, then you need a ton of stuff automatically to get SOC2 and generally appease the security requirements of your customers.
That said, anything that's set-and-forget is great to start with. Anything that requires it's own care and feeding can wait unless it's really critical. I think we have a project each quarter to optimize our datadog costs and renegotiate our contract.
Also if you make microservices, you are going to need a ton of tools.
1 reply →
I bet you can get pretty far with just ec2 and autoscaling, or comparable tech in other cloud platforms. With a managed database service.
1 reply →
I'm currently early in my career and "the software guy" in a non-software team and role, but I'm looking to move into a more engineering direction. You've pretty much got my dream next job at the moment — if you don't mind me asking, how did you manage to find your role, especially being "still pretty Jr."?
4 replies →
Currently working at a $100M valuation tech company that fundamentally is built on a Django monolith with some other fluffy stuff lying around it. You can go far with a Django monolith and some load balancing.
I bet Craigslist runs on much simpler infrastructure. Not sure how much they’re worth though
1 reply →
I work at a startup and most of the stuff in the article covers things we use and solve real world problems.
If you're looking for successful businesses, indie hackers like levelsio show you how far you can get with very simple architectures. But that's solo dev work - once you have a team and are dealing with larger-scale data, things like infrastructure as code, orchestration, and observability become important. Kubernetes may or may not be essential depending on what you're building; it seems good for AI companies, though.
2 replies →
I would like to know what you’re being downvoted for. It’s not bad advice, necessarily… this was the way 20 years ago. I mean isn’t hacker news running kind of like this as a monolith on a single server? People might be surprised how far you can get with a simple setup.
Key term here: 'cloud native'. Which is supposedly the future
The kitchen sink database used by everybody is such a common problem, yet it is repeated over and over again. If you grow it becomes significant tech debt and a performance bottleneck.
Fortunately, with managed DBs like RDS it is really easy to run individual DB clusters per major app.
Management problem masquerading as a tech problem.
Being shared between applications is literally what databases were invented to do. That’s why you learn a special dsl to query and update them instead of just doing it in the same language as your application.
The problem is that data is a shared resource. The database is where multiple groups in an organization come together to get something they all need. So it needs to be managed. It could be a dictator DBA or a set of rules designed in meetings and administered by ops, or whatever.
But imagine it was money. Different divisions produce and consume money just like data. Would anyone imagine suggesting either every team has their own bank account or total unfettered access to the corporate treasury? Of course not. You would make a system. Everyone would at least mildly hate it. That’s how databases should generally be managed once the company is any real size.
Why would you make it a shared resource if you don’t have to?
Decades of experience have shown us the massive costs of doing so - the crippled velocity and soul crushing agony of dba change control teams, the overhead salary of database priests, the arcane performance nightmares, the nuclear blast radius, the fundamental organizational counter-incentives of a shared resource .
Why on earth would we choose to pay those terrible prices in this day and age, when infrastructure is code, managed databases are everywhere and every team can have their own thing. You didn’t have a choice previously, now you do.
6 replies →
...I worked at a large software organization where larger teams had their own bank account, and there was a lot of internal billing, etc, mixed with plenty of funny-money to go along with it. That's not a contradiction, though, it perfectly illustrated your point for me.
The moment you have two databases is the moment you need to deal with data consistency problems.
If you can't do something like determine if you can delete data, as the article mentions, you won't be able to produce an answer to how to deal with those problems.
The downside is then you have many, many DBs to fight with, to monitor, to tune, etc.
This is rarely a problem when things are small, but as they grow, the bad schema decisions made by empowering DBA-less teams to run their own infra become glaringly obvious.
Not a downside to me. Each team maintains their own DB and pays for their own choices.
In the kitchen sink model all teams are tied together for performance and scalability, and some bad apple applications can ruin the party for everyone.
Seen this countless times doing due diligence on startups. The universal kitchen sink DB is almost always one of the major tech debt items.
9 replies →
Bad schema decisions are made regardless of whether you’re one database or 50. At least with many databases the problems are localized.
11 replies →
It's because I hate databases and programming separately. I would rather slow code then have to dig into some database procdure. Its just another level of separation thats too mentally hard to manage. Its like... my queries go into a VM and now I have to worry about how the VM is performing.
I wish and maybe there is a programming language with first class database support. I mean really first class not just let me run queries but almost like embedded into the language in a primal way where I can both deal with my database programming fancyness and my general development together.
Sincerely someone who inherited a project from a DBA.
6 replies →
Lots of interesting comments on this one. Anyone have any good resources for learning how not to fuck up schema/db design for those of us who will probably never have a DBA on the team?
Good question. We don't have a DBA either. I've learned SQL as needed and while I'm not terrible, it's still daunting when making the schema for a new module that might require 10-20 tables or more.
One thing that has worked well for us is to alway include the top-most parent key in all child tables down yhe hierarchy. This way we can load all the data for say an order without joins/exists.
Oh and never use natural keys. Each time I thought finally I had a good use-case, it has bitten me in some way.
Apart from that we just try to think about the required data access and the queries needed. Main thing is that all queries should go against indexes in our case, so we make sure the schema supports that easily. Requires some educated guesses at times but mostly it's predictable IME.
Anyway would love to see a proper resource. We've made some mistakes but I'm sure there's more to learn.
7 replies →
> not to fuck up schema/db design
The neat thing is, you don't. Nobody ever avoids fucking up db design.
The best you can do is decide what is really important to get right, and not fuck that part up.
3 replies →
If you are startup that can can’t afford a DBA, then why why why are you using Kubernetes?
Because I can go from main.go to a load balanced, autoscaling app with rolling deploys, segeregated environments, logging & monitoring in about 30 minutes, and never need to touch _any_ of that again. Plus, if I leave, the guy who comes after me can look at a helm chart, terraform module + pipeline.yml and figure out how it works. Meanwhile, our janq shell script based task scheduler craps out on something new every month. What started as 15 lines of "docker run X, sleep 30 docker kill x" is now a polyglot monster to handle all sorts of edge cases.
I have spent vanishingly close to 0 hours on maintaining our (managed) kubernetes clusters in work over the past 3 years, and if I didn't show up tomorrow my replacement would be fine.
If you can do all that in 30 minutes (or even a few hours), I would love to read an article/post about your setup, or any resources you might recommend.
4 replies →
I spent zero hours on a MySQL server on bare hardware for seven years.
Admittedly, I was afraid of ever restarting as I wasn’t sure it would reboot. But still…
3 replies →
You'll need to touch it again. These paid services tend to change all the time.
You also need to pay them which is an event.
Why wouldn't you use Kubernetes? There are basically 3 classes of deployments:
1) We don't have any software, so we don't have a prod environment.
2) We have 1 team that makes 1 thing, so we just launch it out of systemd.
3) We have between 2 and 1000 teams that make things and want to self-manage when stuff gets rolled out.
Kubernetes is case 3. Like it or not, teams that don't coordinate with each other is how startups scale, just like big companies. You will never find a director of engineering that says "nah, let's just have one giant team and one giant codebase".
On AWS, at least, there are alternatives such as ECS and even plain old EC2 auto scaling groups. Teams can have the autonomy to run their infrastructure however they like (subject to whatever corporate policy and compliance regime requirements they might have to adhere to).
Kubernetes is appealing to many, but it is not 100% frictionless. There are upgrades to manage, control plane limits, leaky abstractions, different APIs from your cloud provider, different RBAC, and other things you might prefer to avoid. It's its own little world on top of whatever world you happen to be running your foundational infrastructure on.
Or, as someone has artistically expressed it: https://blog.palark.com/wp-content/uploads/2022/05/kubernete...
4 replies →
One giant codebase is fine. Monorepo is better than lots of scattered repos linked together with git hashes. And it doesn't really get in the way of each team managing when stuff gets rolled out.
3 replies →
Google has one giant codebase. I am pretty sure the aren't the only ones.
This is my case. I’m one man show ATM so no DBA. I’m still using Kubernetes. Many things can be automated as simply as helm apply. Plus you get the benefit of not having a hot mess of systemd services, ad hoc tools which you don’t remember how you configured, plethora bash scripts to do common tasks and so on.
I see Kubernetes as one time (mental and time) investment that buys me somehow smoother sailing plus some other benefits.
Of course it is not all rainbows and unicorns. Having a single nginx server for a single /static directory would be my dream instead of MinIO and such.
I don’t push to implement Kubernetes until I had 100 engineers and a reason to use it.
I think a lot of startups have a set of requirements that is something like:
- I want to spin up multiple redundant instances of some set of services
- I want to load balance over those services
- I want some form of rolling deploy so that I don’t have downtime when I deploy
- I want some form of declarative infrastructure, not click-ops
Given these requirements, I can’t think of an alternative to managed k8s that isn’t more complex.
A startup with no DBA does not need redundant anything. Too small.
3 replies →
AWS Copilot (if you're on AWS). It's a bit like the older Elastic Beanstalk for EC2.
Because it works, the infra folks you hired already know how to use it, the API is slightly less awful than working with AWS directly, and your manifests are kinda sorta portable in case you need to switch hosting providers for some reason.
Helm is the only infrastructure package manager I've ever used where you could reliably get random third party things running without a ton of hassle. It's a huge advantage.
To make up for having a better schema in Terraform than in the database.
Because they are on AWS and can't use Cloud Run.
> Not using Terraform Cloud
We adopted TFC at the start of 2023 and it was problematic right from the start; stability issues, unforeseen limitations, and general jankiness. I have no regrets about moving us away from local execution, but Terraform Cloud was a terrible provider.
When they announced their pricing changes, the bill for our team of 5 engineers would have been roughly 20x, and more than hiring an engineer to literally sit there all day just running it manually. No idea what they’re thinking, apart from hoping their move away from open source would lock people in?
We ended up moving to Scalr, and although it hasn’t been a long time, I can’t speak highly enough of them so far. Support was amazing throughout our evaluation and migration, and where we’ve hit limits or blockers, they’ve worked with us to clear them very quickly.
I would love to see this type of thing from multiple sources. This reflects a lot of my own experience.
I think the format of this is great. I suppose it would take a motivated individual to go around and ask people to essentially fill out a form like this to get that.
I also think it's a great format.
One suggestion if we're gonna standardize around this format. Avoid the double negatives. In some cases author says "avoided XYZ" and then the judgment was "no regrets". Too many layers for me to parse there. Instead, I suggest each section being the product that was used. If you regret that product, in the details is where you mention the product you should have used. Or you have another section for product ABC and you provide the context by saying "we adopted ABC after we abandoned XYZ".
I don't recommend trying to categorize into general areas like logging, postmortems, etc. Just do a top-level section for each product.
For people who enjoyed this post but want to see the other side of the spectrum where self hosted is the norm I'll point to the now classic series of posts on how Stack Overflow runs its infra: https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...
If anyone has newer posts like the above, please reply with links as I would love to read them.
https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47... is another good one. There are a few different posts on it scattered around:
https://world.hey.com/dhh/we-stand-to-save-7m-over-five-year...
https://world.hey.com/dhh/our-cloud-exit-has-already-yielded...
Related, looks like X is doing similar: https://twitter.com/XEng/status/1717754398410240018
Disagree on the point and reasoning about the single database.
Sounds like they experienced badly managed and badly constrained database. The described fks and relations: that's what the key constraints and other guard rails and cascades are for - so that you are able to manage a schema. That's exactly how you do it: add in new tables that reference old data.
I think the regret is actually not managing the database, and not so much about having a single database.
"database is used by everyone, it becomes cared for by no one". How about "database is used by everyone, it becomes cared for by everyone".
Reading further
> Endorse-ish: Schema migration by Diff
Well that explains it... What a terrible approach to migrations for data integrity.
Can you explain? Having a tool to detect changes and create a migration doesn’t sound bad? In a nutshell thats how django migrations work as well, which works really well.
Genuinely curious (I don't have much experiences with DBs), how is schema migration done 'properly' these days?
2 replies →
> How about "database is used by everyone, it becomes cared for by everyone".
So every one needs to know every use case of that database? Seems very unlikely if there are multiple teams using same DB.
FKs? Unique constraints? Not null colums? If not added at the creation of the table they will never be added - the moment DB is part of a public API you cannot do a lot of things safely.
The only moment when you want to share DB is when you really need to squeeze every last bit of performance - and even then, you want to have one owner and severly limited user accounts (with white list of accessible views and stored procedures).
The database should never ever become part of a public API.
You don’t share a DB for performance reasons (rather the opposite), you do it to ensure data integrity and consistency.
And no, not everyone needs to know every use case. But every team needs to have someone who coordinates any overlapping schema concerns with the other teams. This needs to be managed, but it’s also not rocket science.
1 reply →
This is fabulous. I keep lists like this in my notebook(s). The critical thing here is that you shouldn't dwell on your "wrong" choices, instead document the choice, what you thought you were getting, what you got, and what information would have been helpful to know at the time of decision (or which information you should have given more weight at the time of the decision.) If you do this, you will consistently get better and better.
And by far "automate all the things" is probably my number one suggestion for DevOps folks. Something that saves you 10 minutes a day pays for itself in a month when you have a couple of hours available to diagnose and fix a bug that just showed up. (5 days a week X 4 weeks X 10 minutes = 200 minutes) The exponential effect of not having to do something is much larger than most people internalize (they will say, "This just takes me a couple of minutes to do." when in fact it takes 20 to 30 minutes to do and they have to do it repeatedly.)
As a machine learning platform engineer these sound like technology choices as opposed to infrastructure decisions. I would love to read this post but really with the infrastructure trade-offs that were made. But thanks for the post.
Side node: There is a small typo repeated twice "Kuberentes"
Awesome writeup! Just had a couple comments/questions.
> Not adopting an identity platform early on
The reason for not adopting an IDP early is because almost every vendor price gouges for SAML SSO integration. Would you say it's worth the cost even when you're a 3-5 person startup?
> Datadog
What would you recommend as an alternative? Cloudwatch? I love everything about Datadog, except for their pricing....
> Nginx load balancer for EKS ingress
Any reason for doing this instead of an Application Load Balancer? Or even HA Proxy?
For datadog, unfortunately there's no obvious altnernative despite many companies trying to take marketshare. This is to say, datadog both has second to none DX and a wide breadth of services.
Grafana Labs comes closest in terms of breadth but their DX is abysmal (I say this as a heavy grafana/prometheus user) Same comments about new relic though they have better dx than grafana. Chronosphere has some nice DX around prometheus based metrics but lack the full product suite. I could go on but essentially, all vendors either lack breadth, DX, or both.
Almost every time I read someone's insights who works in an environment with IaaS buy-in, my takeaway is the same: oh boy, what an alphabet soup.
The initial promise of "we'll take care of this for you, no in-house knowledge needed" has not materialized. For any non-trivial use case, all you do is replace transferrable, tailored knowledge with vendor-specific voodoo.
People who are serious about selling software-based services should do their own infrastructure.
Even if others disagree with your endorsements or regrets, this record shows you're actually aware of the important decisions you made over the past four years and tracked outcomes. Did you record the decisions when you made them and revisit later?
> Code is of course powerful, but I’ve found the restrictive nature of Terraform’s HCL to be a benefit with reduced complexity.
No way. We used Terraform before and the code just got unreadable. Simple things like looping can get so complex. Abstraction via modules is really tedious and decreases visibility. CDKTF allowed us to reduce complexity drastically while keeping all the abstracted parts really visible. Best choice we ever made!
Sounds like a whole lot of stuff for a startup. Maybe start with a simple stack until there's market fit. Even Amazon didn't start this way.
Great post. I do wonder - what are the simplest K8s alternatives?
Many say in the database world, "use Postgres", or "use sqlite." Similarly there are those databases that are robust that no one has heard of, but are very limited like FoundationDB. Or things that are specialized and generally respected like Clickhouse.
What are the equivalents of above for Kubernetes?
It’s mainly running your own control plane that is complex. Managed k8s (EKS, AKS, GKE) is not difficult at all. Don’t listen to all the haters. It’s the same crowd who think they can replace systemd with self hacked init scripts written in bash, because they don’t trust abstractions and need to see everything the computer does step-by-step.
I also stayed away for a long time due to all the fear spread here, after taking the leap, I’m not looking back.
The lightweight “simpler” alternative is docker-compose. I put simpler in quotes because once you factor in all the auxiliary software needed to operate the compose files in a professional way (IaC, Ansible, monitoring, auth, VM provisioning, ...), you will accumulate the same complexity yourself, only difference is you are doing it with tools that may be more familiar to what you are used to. Kubernetes gives you a single point of control plane for all this. Does it come with a learning curve? Yes, but once you get over it there is nothing inherent about it that makes it unnecessary complex. You don’t need autoscaler, replicasets and those more advanced features just because you are on k8s.
If you want to go even simpler, the clouds have offerings to just run a container, serverless, no fuzz around. I have to warn everyone though that using ACI on Azure was the biggest mistake of my career. Conceptually it sounds like a good idea but Azures execution of it is just a joke. Updating a very small container image taking upwards of 20-30 minutes, no logs on startup crashes, randomly stops serving traffic, bad integration with storage.
The simplest k8s alternative (that is an actual alternative) is Nomad.
Kubernetes aren't like that.
It's just that, you should start with a handful of backed-up pet servers. Then manually automate their deployment when you need it. And only then go for a tool that abstracts the automated deployment when you need it.
But I fear the simplest option on the Kubernetes area is Kubernetes.
I don’t know that this is good advice.
I shunned k8s for a long time because of the complexity, but the managed options are so much easier to use and deploy than pet servers that I can’t justify it any more. For anything other than truly trivial cases, IMO kubernetes or (or similar, like nomad) is easier than any alternative.
The stack I use is hosted Postgres and VKS from Vultr. It’s been rock solid for me, and the entire infrastructure can be stored in code.
This is good advice, if you haven't experienced the pain of doing it yourself, you won't know what the framework does for you. There are limits to this reasoning of course, we don't reimplement everything on the stack just for the learning experience. But starting with just docker might be a good idea.
You can always use old boring AWS EC2 and such. And sprinkle in some Terraform if you feel fancy. That would be my “use sqlite”
Kubernetes is probably “use postgres”
> Multiple applications sharing a database [regret]
The industry has known this to be a stereotypically bad idea for generations now. It lead to things like the enterprise sevice bus, service-oriented architectures, and finally "micro services". Recently I've seen "micro services" that share the same database, so we've come full-circle.
Yet, every place I've worked was either laboring under a project to decouple two or more applications that were conjoined at the DB, or were still at the "this sucks but no one wants to fix it" stage.
How do we keep making this same mistake in industry?
Well it's a bit unfortunate this post was published in Feb 1st, it got really outdated really fast around the "choose flux for gitops" part.
Context https://www.silverliningsinfo.com/automation/weaveworks-unra...
Mind sharing bit more of the details?
> engineers at Weaveworks built the first version of Flux > Weaveworks donated Flux and Flagger to the CNCF
https://fluxcd.io/blog/2022/11/flux-is-a-cncf-graduated-proj...
> Weaveworks will be closing its doors and shutting down commercial operations > Alexis Richardson, 5 Feb 2024
https://www.linkedin.com/posts/richardsonalexis_hi-everyone-...
If the project has legs, it's now under CNCF.
So far it seems fine, and the maintainers seem to be doing OK too.
Is the project future at risk? https://github.com/fluxcd/flux2/discussions/4544
What's the news there? I was just about to try it out this weekend.
Something I’ve noticed with PaaS services like RDS or Azure SQL is that people arguing against it are assuming that the alternative is “competence”.
Even in a startup, it’s difficult to hire an expert in every platform that can maintain a robust, secure system. It’s possible, but not guaranteed, and may require a high pay to retain the right staff.
Many government agencies on the other hand are legally banned from offering a competitive wage, so they can literally never hire anyone that competent.
This cap on skill level means that if they do need reliable platforms, the only way they can get one is by paying 10x the real market rate for an over-priced cloud service.
These are the “whales” that are keeping the cloud vendors fat and happy.
Props to the author for writing up the results from his exercise. But I think he should focused on a few controversial ones, and not the rotes ones.
Many of the decisions presented are not disagreeable (choosing slack) and some lack framing that clarifies the associated loss (Not adopting an identity platform early on). I think they're all good choices worth mentioned; I would have preferred a deeper look into the few that seemed easy and turned out to be hard, or the ones that were hard and got even harder.
> not the rotes ones
It helps to hear the validation, although I think almost every decision has a dissenting voice in the HN comments.
Can any of your engineers run the product locally and iterate fast?
Yeah typically run a single go service or use devspace to combine multiple services using published containers
Okta... after everything that's happened recently with them?
Yeah... this stood out! Do you have any good alternatives? I wish CloudFlare would do it (IDP).
Without some sort of background on cost or scale it is hard to judge any of these decisions.
[dead]
The Bazel one made me chuckle - I worked at a company with an scm & build setup clearly inspired by Google’s setup. As a non-ex-Googler, I found it obviously insane, but there was just no way to get traction on that argument. I love that the rest of this list is pretty cut and dry, but Bazel is the one thing that the author can’t bring themself to say “don’t regret” even though they clearly don’t regret not using it.
I've seen Bazel reduce competent engineers to tears. There was a famous blog post a half-decade ago called something like "Bazel is the worst build system, except for all the others" and this still seems to ring true for me today.
There are some teams I work with that we'll never bother to make use Bazel because we know in advance that it would cripple them.
Having led a successful Bazel migration, I'd still recommend many projects to stick to the native or standard supported toolchain until there's a good reason to migrate to a build system (And I don't consider GitHub actions to be a build system).
I’m curious, what do you find insane about Bazel? In my experience it makes plenty of sense. And after using it for some months, I find more insane how build systems like CMake depend on you having some stuff preinstalled in your system and produce a different result depending on which environment they’re run on.
> Discourage private messages and encourage public channels.
I wish my current company did this. It's infuriating. The other day, I asked a question about how to set something up, and a manager linked me to a channel where they'd discussed that very topic - but it was private, and apparently I don't warrant an invite, so instead I have to go bother some other engineers (one of whom is on vacation.)
Private channels should be for sensitive topics (legal, finance, etc) or for "cozy spaces" - a team should have a private channel that feels like their own area, but for things like projects and anything that should be searchable, please keep things public.
I think kubernetes is a mistake and should have went with AWS ECS (using fargate or backed by autoscaling ec2), if single change he wouldn't need to even thing about a bunch of other topics on his list. Something to think about, AWS Lambda first then fallback to AWS ECS for everything else that needs to really be on 100% of the time.
I love this write-up and the way it's presented. I disagree with some of the decisions and recommendations, but it's great to read through the reasoning even in those cases.
It'd be amazing if more people published similar articles and there was a way to cross-compare them. At the very least, I'm inspired to write a similar article.
> There are no great FaaS options for running GPU workloads, which is why we could never go fully FaaS.
I keep wondering when this is going to show up. We have a lot of service providers, but even more frameworks, and every vendor seems to have their own bespoke API.
I don’t think anybody should go “fully FaaS”, it’s like saying screwdrivers are useless, all you need is a hammer.
That being said, Cloudflare is on the path to offering a great GPU FaaS system for inference.
I believe it’s still in beta, but it’s the most promising option at the moment.
Right, I still find it faster to manually provision a specific instance type, install PyTorch on it, and deploy a little flask app for an inference server.
Check out beam.cloud. They’re impressing me with calling GPU runtimes as a FaaS
I just started playing with modal.com and so far it seems good. I haven't run anything in production yet, so YMMV.
stuff like this makes me want to experiment with going back to just one huge $100k server and running it all on one box in a server rack.
I am doing that. I am part of a research group, and don’t have the $$ or ability to pay so much for all these services.
So we got a $90k server with 184TB of raw storage (SAS SSD), 64 cores, and 1TB of memory. Put it on a 10GB line at our university and it is rock solid. We probably have less downtime than Github, even with reboots every few months.
Have some large (multi-TB) databases on it and web APIs for accessing the data. Would be hugely expensive in the cloud with, especially with egress costs.
You have to be comfortable sys-admining though. Fortunately I am.
> Ubuntu for dev servers
I didn't understand this section. Ubuntu servers as dev environment, what do you mean? As in an environment to deploy things onto, or a way for developers to write code like with VSCode Remote?
My take from this was more: being uniform reduced overhead of maintaining.
Being able to write a bash script that runs on ever machine is nice.
seems like the latter given "Originally I tried making the dev servers the same base OS that our Kubernetes nodes ran on, thinking this would make the development environment closer to prod"
But I thought the whole point of the container ecosystem was to abstract away the OS layer. Given that the kernel is backwards compatible to a fault, shouldn't it be enough to have a kernel that is as least as recent as the one on your k8s platform (provided that you're running with the default kernel or something close to it)?
1 reply →
Who's using Pulumi here and how mature is it in comparison to terraform?
I'm using Pulumi in production pretty heavily for a bunch of different app types (ECS, EKS, CloudFront, CloudFlare, Vault, Datadog monitors, Lambdas of all types, EC2s with ASGs, etc.), it's reasonably mature enough.
As mentioned in the other comment, the most commonly used providers for terraform are "bridged" to pulumi, so the maturity is nearly identical to Terraform. I don't really use Pulumi's pre-built modules (crossroads), but I don't find I've ever missed them.
I really like both Pulumi and Terraform (which I also used in production for hundreds of modules for a few years), which it seems like isn't always a popular opinion on HN, but I have and you absolutely can run either tool in production just fine.
My slight preference is for Pulumi because I get slightly more willing assistance from devs on our team to reach in and change something in infra-land if they need to while working on app code.
We do still use some Pulumi and some Terraform, and they play really nicely together: https://transcend.io/blog/use-terraform-pulumi-together-migr...
IaaC is one of the worst acronyms ever.
Infrastructure should be declared, not coded.
Say what you want. The tool then builds that, or changes whats there to match.
I've tried Pulumi and understanding the bit that runs before it tries to do stuff and the bit that runs after it tries to do stuff and working out where the bugs are is a PITA. It lulls you into a false sense of security that you can refer to your own variables in code, but that doesn't get carried over to when it is actually running the plan on the cloud service (ie actually creating the infrastructure) because you can only refer to the outputs of other infrastructure.
CFN is too far in the other direction, primarily because it's completely invisible and hard to debug.
Terraform has enough programmability (eg for_each, for-expressions etc) that you can write "here is what I want and how the things link together" and terraform will work out how to do it.
The language is... sometimes painful, but it works.
The provider support is unmatched and the modules are of reasonable quality.
I think currently under the hood it's actually still terraform. I know they are working on their own native providers.
> Startups don’t have the luxury of a DBA …
I understand, but I think they don’t have the luxury of not having a DBA. Data is important; it’s arguably more important than code. Someone needs to own thinking about data, whether it is stored in a hierarchical, navigation-based database such as a filesystem, a key-value stored like S3 (which, sure, can emulate a filesystem), or in a relational database. Or, for that matter, in vendor systems such as Google Workspace email accounts or Office365 OneDrive.
Early on, depending on what you're building, you don't need a fully fleshed DBA and can get away with at least one person that knows DB fundamentals.
But if you only want to hire React developers (or swap for the framework of the week) then you'll likely end up with zero understanding of the DB. Down the line you have a mess with inconsistent or corrupted data that'll come back with a vengeance.
It's short-sighted for serious endeavors.
> Ubuntu
we have dotnet webapp deployed on Ubuntu and it leaves a lot to be desired. The package for .net6 from default repo didn't recognise other dotnet components installed, net8 is not even coming to 22.04 - you have to install from the ms repo. But that is not compatible with the default repo's package for net6 so you have to remove that first and faff around with exact versions to get it installed side by side...
At least I don't have to deal with rhel Why is renewing a dev subscription so clunky?!
I don't get why all startups don't just start with a PaaS like Render, Fly.io or Heroku. Why spend time setting up your own infra and potentially have to hire dedicated staff to manage it when you can do away with all that and get on with trying to move your business forward?
If and when you start experiencing scaling problems (great!), that's the time to think about migrating to setting up infra.
Because like every service-oriented offering, each platform differentiates as hard as it can to lock you in to their way of doing things.
Things largely look the same on the surface; this takes the most effect at the implementation-detail level, where adjusting and countercorrecting down the track is fiddly and uses an adrenally-draining level of attention span - right when you're at the point where you're scaling and you no longer have the time to deal with implementation detail level stuff.
You're on <platform> and you're doing things their way and pivoting the architecture will only be prioritised if the alternative would be bankruptcy.
When you're starting out, you just need a server to run your application and a database.
It literally doesn't matter what service you're using at that point.
I don't see how you need to be "doing things their way" when that's all you have.
> Using cert-manager to manage SSL certificates
> Very intuitive to configure and has worked well with no issues. Highly recommend using it to create your Let’s Encrypt certificates for Kubernetes.
> The only downside is we sometimes have ANCIENT (SaaS problems am I right?) tech stack customers that don’t trust Let’s Encrypt, and you need to go get a paid cert for those.
Cert-manager allows you to use any CA you like including paid ones without automation.
I find the amount of services/products used insane. Is this all handled/known by those mythical full-stack-dev-sec-ops developers?
I would have liked some data around why these technologies were chosen and preferably based on loads from customers.
Seems like yagni to me but please prove me wrong
It is a shame karpenter is AWS only. I was thinking about how our k8s autoscaler could be better and landed on the same kind of design as karpenter where you work from unschedulable pods backwards. Right now we have an autoscaler which looks at resource utilization of a node pool but that doesn’t take into account things like topology spread constraints and resource fragmentation.
https://github.com/Azure/karpenter-provider-azure there is this in the works for karpenter on aks
It’s actually released in preview, they called it Node Auto Provisioning. Doesn’t work with Azure Linux unfortunately.
2 replies →
Ironic that the article begins with an image of server chassis with wires running around while the description is entirely about cloud infra.
> My general infrastructure advice is “less is better”.
I found this slightly ironic given there are ~50 headers in the article :)
I liked the format of the writeup
Terraform is great but it's so frustrating sometimes. You just pray that the provider has a specific configuration of whatever resources you're working with, because else when them resources are up on multiple env then you'll have to edit those configs somehow.
I see homebrew in here as a way to distribute <stuff> internally.
We have non-developers (artists, designers) on our team, and asking them to manage homebrew is a non-starter. We're also on windows.
We current just shove everything (and I mean everything) in perforce. Are there any better ways of distributing this for a small team?
I've seen a lot of comments about how bad DataDog is because of cost but surprisingly I haven't seen open-source alternatives like OpenTelemetry/Prometheus/Grafana/Tempa mentioned.
Is it because most people are willing to pay someone else to manage monitoring infrastructure or other reasons?
the way I think of datadog is that datadog it provides a second to none DX combined with a wide suite of product offerings that is good enough for most companies most of the time. does it have opaque pricing that can be 100x more expensive than alternatives? absolutely! will people continue to use it? yes!
something to keep in mind is that most companies are not like the folks in this thread. they might not have the expertise, time or bandwidth to build invest in observability.
the vast majority of companies just want something that basically works and doesn’t take a lot of training to use. I think of Datadog as the Apple of observability vendors - it doesn’t offer everything and there are real limitations (and price tags) for more precise use cases but in the general case, it just works (especially if you stay within its ecosystem)
> There are no great FaaS options for running GPU workloads
This hits hard. Someone please take my (client's) money and provide sane GPU FaaS. Banana.dev is cool but not really enterprise ready. I wish there was a AWS/GCP/Azure analogue that the penny pinchers and MBAs in charge of procurement can get behind.
I am confused. Doesn't Modal Labs solve this?
Definitely. But the sad reality is that in some corporate environments (incumbent finance, government) if it's not a button click in portal.azure.com away, you can spend 6-12 months in meetings with low energy gloomboys to get your access approved.
1 reply →
This guy gets it, I agree with it all. The exception being, use Fargate without K8s and lean on Terraform and AWS services rather than the K8s alternatives. When you have no choice left and you have to use K8s, then I would pick it up. No sense going down into the mines if you don't have to.
As someone who isn't a developer, readingthis was eye opening. It's interesting just how unbundled the state of running a software company is. And this is only your selection of the tools and options, not imagining the entire landscape.
Interesting read, I agree with adopting an identity platform but this can definitely be contentious if you want to own your data.
I wonder how much one should pay attention to future problems at the start of a startup versus "move fast and break things." Some of this stuff might just put you off finishing.
Interesting enough read. But I’m not sure he’s a regretful enough boy to write a blog to merit the title.
I was hoping there would be a section for Search Engines. It's one of those things you tend to get locked in to, and it's hard to clearly know your requirements well enough early on.
Any references to something like this with a Search slant would be greatly appreciated.
Curious about the mention of buying IPs. Anyone else can share feedback/thoughts on this?
This was done for multiple reasons but mainly security and to allow customers to whitelist a certain ip range.
After reading through this entire post, I'm pleasantly surprised that there isn't one item where I don't mirror the same endorse/regret as the author. I'm not sure if this is coincidence or popular opinion.
What’s the right way to manage npm installs and deploy it to an AWS ec2 instance from github? Kubernetes? GitOps? EKS? I roll my own solution now with cron and bash because everything seems so bloated.
> We use Okta to manage our VPN access and it’s been a great experience.
I have no first hand exerience wtih Okta, but everything I read about it makes me scared to use it. i.e. stability and security.
What are startups using for a logging tool that isn’t datadog?
https://highlight.io
Loki
https://axiom.co/
Using k8s over ECS and raw-dogging Terraform instead of using the CDK? It's no wonder you end up needing to hire entire teams of people just to manage infra
AWS ECS is better than Kubernetes, and Cloudformation is better than Terraform. Just my IMHO.
Both are more simple and do the same thing.
Not sure the fascination about Go - one can write fully scalable functional readable maintainable upgradable rest api service with Java 17 and above.
I struggle with the type system in both, but today I was going through obscure go code and wishing interfaces were explicitly implemented. Lack of sum types is making me sad
Noob here - all these are great... but why can't I just use Heroku to radically not have to deal with a large prt if these things?
"Since the database is used by everyone, it becomes cared for by no one. Startups don’t have the luxury of a DBA, and everything owned by no one is owned by infrastructure eventually"
I think adding a DBA or hiring one to help you layout your database should not be considered a 'luxury'...
Yeah I mean, hiring one person to own that for 5-10 teams is pretty cheap... Cheaper than each team constantly solving the same problems and relearning the same gotchas/operational stuff that doesn't add much value when writing your application code.
There's even consultants you can hire out by the day instead of a full-time DBA.
Maybe you need help with setup for a few weeks/months, and then some routine billable hours per month for maintenance / change advice.
I see more 'Endorse' items than 'Regret' items.
Anyway, amazing write up.
Learning about alternatives to Jira is always good.
Nice, I run Kamal on Hetzner with Cloudflare.
> Zero Trust VPN
VPNs can be wonderful, and you can use use Tailscale or AWS VPN or OpenVPN or IPSEC and you can authenticate using Okta or GSuite or Auth0 or Keycloak or Authelia.
But since when is this Zero Trust? It takes a somewhat unusual firewall scheme to make a VPN do anything that I would seriously construe as Zero Trust, and getting authz on top of that is a real PITA.
"Multiple applications sharing a database" and Kubernetes sound really funny together:)
A fallacy of a "choice" between GCP and AWS never stops to entertain me
> Go is for services that are non-GPU bound.
What are they using for GPU bound services. Python?
Python indeed
After working with infrastructure for 20 years, I fully endorse this post.
Half the stuff is K8s related... Makes me very happy to use Cloud Run.
What is the cost? With 1/10th of the sum one capable engineer can setup a way better infra on premise. The days of free money is over, guys. Wake up!
I should really learn AWS huh
> homebrew for Linux
No, just no. I see this cropping up now and then. Homebrew is unsafe for Linux, and is only recommended by Mac users that don't want to bother to learn about existing package management.
good
[dead]
[flagged]