Comment by stingraycharles

1 day ago

Potentially useful context: OP is one of the cofounders of Tailscale.

> Traditional Cloud 1.0 companies sell you a VM with a default of 3000 IOPS, while your laptop has 500k. Getting the defaults right (and the cost of those defaults right) requires careful thinking through the stack.

I wish them a lot of luck! I admire the vision and am definitely a target customer, I'm just afraid this goes the way things always go: start with great ideals, but as success grows, so must profit.

Cloud vendor pricing often isn't based on cost. Some services they lose money on, others they profit heavily from. These things are often carefully chosen: the type of costs that only go up when customers are heavily committed—bandwidth, NAT gateway, etc.

But I'm fairly certain OP knows this.

I've run an Openstack cloud. Local to the host NVME's directly attached to VMs is unbeatable. All clouds offer this. But that storage is ephemeral and it was when I implemented it in Openstack too.

There's not enough redundancy. You could raid1 those NVME's when before they get attached to a VM and that helps with hardware failures, but you get less of them to attach. Even if you RAID them, there's not a good way to move that VM to another host if there's a RAM or CPU or other hardware issue on that host.

These VM's with NVME's directly attached have to basically be treated as bare metal servers and you have to do redundancy at the application layer (like database replication).

But again, all of the major cloud services offer these types of machines if you NEED NVME IO speed. There are quirks though. For example, in Azure it seems like you have to be able to expect the VM to be moved whenever Azure feels like it and expect that ephemeral data to be wiped. Whereas in Openstack, we would do local block level migrations if we HAD to move the VM to another host. That block level migration required the VM to be turned off but it did copy the local NVME data to another host. If this happened it was all planned and the particular application had app level redundancy built in so it was not a problem. If the host crashed, that particular VM would just be down till the host was fixed and came back online.

  • > Even if you RAID them, there's not a good way to move that VM to another host if there's a RAM or CPU or other hardware issue on that host.

    This is the critical point. All hardware fails eventually. The CPU and RAM are, in a real sense, also ephemeral. The only relevant question is what the risk tolerance of the use-case is. If restoring from async backup is sufficient, then embrace ephemerality and keep backups. If you need round-the-clock availability, pick an architecture that lets you fall over gracefully to another machine, and embrace the ephemerality when you inevitably need to do so.

i was just curious so i tested this actually.

Using fio

Hetzner (cx23, 2vCPU, 4 GB) ~3900 IOPS (read/write) ~15.3 MB/s avg latency ~2.1 ms 99.9th percentile ≈ ~5 ms max ≈ ~7 ms

DigitalOcean (SFO1 / 2 GB RAM / 30 GB Disk) ~3900 IOPS (same!) ~15.7 MB/s (same!) avg latency ~2.1 ms (same!) 99.9th percentile ≈ ~18 ms max ≈ ~85 ms (!!)

using sequential dd

Hetzner: 1.9 GB/s DO: 850 MB/s

Using low end plan on both but this Hetzner is 4 euro and DO instance is $18.

  • I love Hetzner so much. I'm not affiliated I'm a really happy customer these guys just do everything right.

    • As long as you never have to interact with them. If you run into issues they have caused themselves, you'll find yourself dealing with a unique mix of arrogance and incompetence.

      3 replies →

  • Just for comparison I use the cheapest netcup root server:

    RS 1000 G12 AMD EPYC™ 9645 8 GB DDR5 RAM (ECC) 4 dedicated cores 256 GB NVMe

    Costs 12,79 €

    Results with the follwing command:

    fio --name=randreadwrite \ --filename=testfile \ --size=5G \ --bs=4k \ --rw=randrw \ --rwmixread=70 \ --iodepth=32 \ --ioengine=libaio \ --direct=1 \ --numjobs=4 \ --runtime=60 \ --time_based \ --group_reporting

    IOPS Read: 70.1k IOPS Write: 30.1k IOPS ~100k IOPS total

    Throughput Read: 274 MiB/s Write: 117 MiB/s

    Latency Read avg: 1.66 ms, P99.9: 2.61 ms, max 5.644 ms Write avg: 0.39 ms, P99.9: 2.97 ms, max 15.307 ms

    • Nice, on Hetzner AX41-nvme (~50 eur, from 2020) non-raid I get:

      IOPS: read 325k, write 139k

      Throughput: read 1271MB/s, write 545MB/s

      Latency: read avg 0.3ms, P99.9 2.7ms, max 20ms; write: 0.14ms, P99.9 0.35ms max 3.3ms

      so roughly 100 times iops and throughput of the cloud VMs

    • That is a bit of a unfair comparison. The Hetzner and DO instances are shared hosting, you are using dedicated ressources.

      Using a Netcup VPS 1000 G12 is more comparable.

      read: IOPS=18.7k, BW=73.1MiB/s

      write: IOPS=8053, BW=31.5MiB/s

      Latency Read avg: 5.39 ms, P99.9: 85.4 ms, max 482.6 ms

      Write avg: 3.36 ms, P99.9: 86.5 ms, max 488.7 ms

      2 replies →

Many cloud vendors have you pay through the nose for IOPS and bandwidth.

Edit: I posted this before reading, and these two are the same he points out.

  • Yes, but you can’t directly compare SAN-style storage with a local NVMe. But I agree that it’s too expensive, but not nearly as insane as the bandwidth pricing. If you go to a vendor and ask for a petabyte of storage, and it needs to be fully redundant, and you need the ability to take PIT-consistent multi-volume snapshots, be ready to pay up. And this is what’s being offered here.

    And yes, IO typically happens in 4kb blocks, so you need a decent amount of IOPS to get the full bandwidth.

    • Sure, but a petabyte of block storage with redundancy and PIT backups is a poor abstraction to build on, in large part because it’s not a thing that can be built without either paying an wild amount of money or taking a huge performance hit or both. If you do your PIT recovery at a higher layer, you have to work a bit harder but you get far better cost, perf and recovery.

      That latter part is a big deal, too. If I buy 1PB of block storage, I’m decently likely to be running a fancy journaled or WAL-ed or rollback-logged thing on top, and that thing might be completely unable to read from a read only snapshot. So actually reading from a PIT snapshot is a pain regardless of what I paid for it. Even using EBS or similar snapshots is far from being an amazing experience.

>3000 IOPS

If that's true, I wonder if this is a deliberate decision by cloud providers to push users towards microservice architectures with proprietary cloud storage like S3, so you can't do on-machine dbs even for simple servers.

  • It's probably a combination of high density storage nodes getting I/O bound and SSDs having finite write endurance. Anything that improves the first problem costs them money to improve it and then makes the second problem worse, and the second one costs them money again, so why would they want to make the default something that costs then more twice if most people don't need it?

    Instead they make the default "meager IOPS" and then charge more to the people who need more.

    • I'm not sure about this but I remember that a lot of servers at my old company stuck with hard disks as late as 2018 - exactly for the same reason - HDDs for all their faults dont have write endurance issues. This was quite surprising to me back then.

    • How often is the storage in cloud providers even local vs how often are laptops doing anything other than raw access to a single local disk with a basic FS?

      I remember my worked laptop's IOPS beating a single VM on the first SSD based SAN I deployed as well. Of course, the SAN scaled well beyond it with 1,000 VMs.

> Cloud vendor pricing often isn't based on cost.

Business 101 teaches us that pricing isn't based on cost. Call it top down vs bottom up pricing, but the first principles "it costs me $X to make a widget, so 1.y * $X = sell the product for $Y is not how pricing works in practice.

  • Just to spell this out more clearly for the back row.of the classroom:

    The price is what the customer will pay, regardless of your costs.

  • That's not a business 101.

    • > That's not a business 101.

      It kinda is, but obscured by GP's formula.

      More simply; if it costs you $X to produce a product and the market is willing to pay $Y (which has no relation to $X), why would you price it as a function of $X?

      If it costs me $10 to make a widget and the market is happy to pay $100, why would I base my pricing on $10 * 1.$MARGIN?

      1 reply →