Comment by tgrowazay

8 hours ago

Self-hosted 8xH100 is ~$250k, depreciated across three years => $80k/year, with power and cooling => $90k/year (~$10/hour total).

AWS charges $55/hour for EC2 p5.48xlarge instance, which goes down with 1 or 3 year commitments.

With 1 year commitment, it costs ~$30/hour => $262k per year.

3-year commitment brings price down to $24/hour => $210k per year.

This price does NOT include egress, and other fees.

So, yeah, there is a $120k-$175k difference that can pay for a full-time on-site SRE, even if you only need one 8xH100 server.

Numbers get better if you need more than one server like that.

12 comments

tgrowazay

Aurornis 8 hours ago

$120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.

Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.

If there's an issue with the server while they're sick or on vacation, you just stop and wait.

If they take a new job, you need to find someone to take over or very quickly hire a replacement.

There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.

Going on-prem like this is highly risky. It works well until the hardware starts developing problems or the person in charge gets a new job. The weeks and months lost to dealing with the server start to become a problem. The SRE team starts to get tired of having to do all of their work on weekends because they can't block active use during the week. Teams start complaining that they need to use cloud to keep their project moving forward.

Figs 8 hours ago

> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
> If there's an issue with the server while they're sick or on vacation, you just stop and wait.
Very much depends on what you're doing, of course, but "you just stop and wait" for sickness/vacation sometimes is actually good enough uptime -- especially if it keeps costs down. I've had that role before... That said, it's usually better to have two or three people who know the systems though (even if they're not full time dedicated to them) to reduce the bus factor.
Manuel_D 5 hours ago

> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one.
You can still use cloud for excess capacity when needed. E.g. use on-prem for base load, and spin up cloud instances for peaks in load.
PunchyHamster 6 hours ago

> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.
they come with warranty, often with technican guaranteed to arrive within few hours or at most a day. Also if SHTF just getting cloud to augument current lackings isn't hard
justsomehnguy 7 hours ago

If a business which require at least a quarter million bucks worth of hardware for the basic operation yet it can't pay the market rate for someonr who would operate it - maybe the basics of that business is not okay?
formerly_proven 8 hours ago
> There's a second bus factor: What happens when that 8xH100 starts to get flakey?
These come in a non-flakey variant?
- spwa4 7 hours ago
  
  It's called a warranty.
  And the other argument: every company I've ever know to do AWS has an AWS sysadmin (sorry "devops"), same for Azure. Even for small deployments. And departments want their own person/team.
stego-tech 5 hours ago
Out of all the comments on numbers, SREs, and scaling, you get the response for meeting numbers with numbers!
> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
Literally this. I can do SRE on-prem and cloud, and my 50/30/20 budget break-even point (as in, needs and savings but no wants - so 70%) is $170k before taxes. Rent is astonishingly high right now, and the sort of mid-career professional you want to handle SRE for your single DC is going to take $150k in this market before fucking off to the first $200k job they get.
Know your market, and pay accordingly. You cannot fuck around with SREs.
> Hiring 1 person to run the infrastructure means that 1 person is on-call 24/7 forever.
This is less of an issue than you might think, but strongly dependent upon the quality of talent you’ve retained and the budget you’ve given them. Shitbox hardware or cheap-ass talent means you’ll need to double or triple up locally, but a quality candidate with discretion can easily be supported by a counterpart at another office or site, at least short-term. Ideally though, yeah, you’ll need two engineers to manage this stack, but AWS savings on even a modest (~700 VMs) estate will cover their TC inside of six months, generally.
> There's a second bus factor: What happens when that 8xH100 starts to get flakey? You can't move the jobs to another server because you only have one. You can start diagnosing things and replacing parts and hope it gets to the root issue, but that's more downtime.
This strikes at another workload I neglected to mention, and one I highly recommend keeping in the public cloud: GPUs.
GPUs on-prem suck. Drivers are finnicky, firmware is flakey, vendor support inconsistent, and SR-IOV is a pain in the ass to manage at scale. They suck harder than HBAs, which I didn’t think was possible.
If you’re consuming GPUs 24x7 and can afford to support them on-prem, you’re definitely not here on HN killing time. For everyone else, tune your scaling controls on your cloud provider of choice to use what you need, when you need it, and accept the reality that hyperscalers are better suited for GPU workloads - for now.
> Going on-prem like this is highly risky.
Every transaction is risky, but the risk calculus for “static” (ADDS) or “stable” (ERP, HRIS, dev/test) work makes on-prem uniquely appealing when done right. Segment out your resources (resist the urge for HPC or HCI), build sensible redundancies (on-prem or in the cloud), and lean on workhorse products over newer, fancier platforms (bulletproof hypervisors instead of fragile K8s clusters), and you can make the move successful and sensible. The more cowboy you go with GPUs, K8s, or local Terraform, the more delicate your infra becomes on-prem - and thus the riskier it is to keep there.
Keep it simple, silly.
- throwup238 2 hours ago
  
  > Out of all the comments on numbers, SREs, and scaling, you get the response for meeting numbers with numbers!
  >> $120K isn't going to cover the fully loaded costs of an SRE who can set up and run that.
  > Literally this. I can do SRE on-prem and cloud, and my 50/30/20 budget break-even point (as in, needs and savings but no wants - so 70%) is $170k before taxes. Rent is astonishingly high right now, and the sort of mid-career professional you want to handle SRE for your single DC is going to take $150k in this market before fucking off to the first $200k job they get.
  That's $120k per pod. Four pods per rack at 50kW.
  What universe are we living in that a single SRE can't manage even a single rack for less than half a million in total comp?
- lazylizard 2 hours ago
  
  i am not sre, merely sysadmin.
  and somehow i have this impression that gpus on slurm/pbs could not be simpler.
  u can use a vm for the head node, dont even need the clustering really..if u can accept taking 20min to restore a vm.. and the rest of the hardware are homogeneous - you setup 1 right and the rest are identical.
  and its a cluster with a job queue.. 1 node going down is not the end of the world..
  ok if u have pcie GPUs sometimes u have to re-seat them and its a pain. otherwise if ur h200 or disks fail u just replace them, under warranty or not...
charcircuit 7 hours ago

>If there's an issue with the server while they're sick or on vacation, you just stop and wait.
You can ask AI to troubleshoot and fix the issue.