Comment by ggregoire

13 hours ago

> scaled up by increasing the instance size

I always wondered what kind of instance companies at that level of scalability are using. Anyone here have some ideas? How much cpu/ram? Do they use the same instance types available to everyone, or does AWS and co offer custom hardware for these big customers?

The major hyperscalers all offer a plethora of virtual machines SKUs that are essentially one entire two-socket box with many-core CPUs.

For example, Azure Standard_E192ibds_v6 is 96 cores with 1.8 TB of memory and 10 TB of local SSD storage with 3 million IOPS.

Past those "general purpose" VMs you get the enormous machines with 8, 16, or even 32 sockets.[1] These are almost exclusively used for SAP HANA in-memory databases or similar ERP workloads.

Azure Standard_M896ixds_24_v3 provides 896 cores, 32 TB of memory, and 185 Gbps Ethernet networking. This is generally available, but you have to allocate the quota through a support ticket and you may have to wait and/or get your finances "approved" by Microsoft. Something like this will set you back [edited] $175K per month[/edited]. (I suspect OpenAI is getting a huge effective discount.)

Personally, I'm a fan of "off label" use of the High Performance Compute (HPC) sizes[2] for database servers.

The Standard_HX176rs HPC VM size gives you 176 cores and 1.4 TB of memory. That's similar to the E-series VM above, but with a higher compute-to-memory ratio. The memory throughput is also way better because it has some HBM chips for L3 (or L4?) cache. In my benchmarks it absolutely smoked the general-purpose VMs at a similar price point.

[1] https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

[2] https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

  • > Something like this will set you back $30K-$60K per year

    lol, no, cloud is nowhere near that good value. It’s $3.5M annually.

    > The Standard_HX176rs HPC VM size gives you 176 cores and 1.4 TB of memory

    This one is $124k per year.

    • Thanks for the correction, fixed.

      I noticed that the M896i is so obscure and rarely used that there are typos associated with it everywhere including the official docs! In once place is says it has 23 TB of memory when it actually has 32 TB.

  • On the AWS side there are "HANA certified" instances that max out at 1920 cores and 32 TB RAM - u7inh-32tb.480xlarge

    https://docs.aws.amazon.com/sap/latest/general/sap-hana-aws-...

    • I'm pretty sure both Azure and AWS are merely reselling the same HPE Compute Scale-up Server 3200 chassis with some variations. Azure seems to have only the 16-socket model, but AWS has the 32-socket model.

      That AWS instance uses these 60-core processors: https://www.intel.com/content/www/us/en/products/sku/231747/...

      To anyone wondering about these huge memory systems: avoid them if at all possible! Only ever use these if you absolutely must.

      For one, these systems have specialised parts that are more expensive per unit compute: $283 per CPU core instead of something like $85 for a current-gen AMD EPYC, which are also about 2x as fast as the older Intel Scalable Xeons that need to go into this chassis! So the cost efficiency ratio is something like 6:1 in favour of AMD processors. (The cost of the single large host system vs multiple smaller ones can get complicated.)

      The second effect is that 32-way systems have huge inter-processor cache synchronisation overheads. Only very carefully coded software can scale to use thousands of cores without absolutely drowning in cache line invalidations.

      At these scales you're almost always better off scaling out "medium" sized boxes. A single writer and multiple read-only secondary replicas will take you very far, up to hundreds of gigabits of aggregate database traffic.