← Back to context

Comment by lhl

1 year ago

Last fall I built a new workstation with an EPYC 9274F (24C Zen4 4.1-4.3GHz, $2400), 384GB 12 x 32GB DDR5-4800 RDIMM ($1600), and a Gigabyte MZ33-AR0 motherboard. I'm slowly populating with GPUs (including using C-Payne MCIO gen5 adapters), not focused on memory, but I did spend some time recently poking at it.

I spent extra on the 9274F because of some published benchmarks [1] that showed that the 9274F had STREAM TRIAD results of 395 GB/s (on 460.8 GB/s of theoretical peak memory bandwidth), however sadly, my results have been nowhere near that. I did testing with LIKWID, Sysbench, and llama-bench, and even w/ an updated BIOS and NUMA tweaks, I was getting <1/2 the Fujitsu benchmark numbers:

  Results for results-f31-l3-srat:
  {
      "likwid_copy": 172.293857421875,
      "likwid_stream": 173.132177734375,
      "likwid_triad": 172.4758203125,
      "sysbench_memory_read_gib": 191.199125,
      "llama_llama-2-7b.Q4_0": {
          "tokens_per_second": 38.361456,
          "model_size_gb": 3.5623703002929688,
          "mbw": 136.6577115303955
      }
  }

For those interested in all the system details/running their own tests (also MLC and PMBW results among others): https://github.com/AUGMXNT/speed-benchmarking/tree/main/epyc...

[1] https://sp.ts.fujitsu.com/dmsp/Publications/public/wp-perfor...

Assuming that you populated the channels correctly, which I believe you did, I can only think that this issue could be related to the motherboard itself or RAM. I think you could start by measuring the single-core RAM bandwidth and latency.

Since the CPU is clocked quite high, figures you should be getting are I guess around ~100ns, but probably less than that, and 40-ish GB/s of BW. If those figures do not match then it could be either a motherboard (HW) or BIOS (SW) issue or RAM stick issue.

If those figures closely match then it's not a RAM issue but a motherboard (BIOS or HW) and you could continue debugging by adding more and more cores to the experiment to understand at which point you hit the saturation point for the bandwidth. It could be a power issue with the mobo.

  • Yeah, that channels are populated correctly. As you can see from the mlc-results.txt, the latency looks fine:

       mlc --idle_latency
      Intel(R) Memory Latency Checker - v3.11b
      Command line parameters: --idle_latency
    
      Using buffer size of 1800.000MiB
      Each iteration took 424.8 base frequency clocks (       104.9   ns)
    

    As does the per-channel --bandwidth_matrix results:

                      Numa node
      Numa node            0       1       2       3       4       5       6       7
             0        45999.8 46036.3 50490.7 50529.7 50421.0 50427.6 50433.5 52118.2
             1        46099.1 46129.9 52768.3 52122.3 52086.5 52767.6 52122.6 52093.4
             2        46006.3 46095.3 52117.0 52097.2 50385.2 52088.5 50396.1 52077.4
             3        46092.6 46091.5 52153.6 52123.4 52140.3 52134.8 52078.8 52076.1
             4        45718.9 46053.1 52087.3 52124.0 52144.8 50544.5 50492.7 52125.1
             5        46093.7 46107.4 52082.0 52091.2 52147.5 52759.1 52163.7 52179.9
             6        45915.9 45988.2 50412.8 50411.3 50490.8 50473.9 52136.1 52084.9
             7        46134.4 46017.2 52088.9 52114.1 52125.0 52152.9 52056.6 52115.1
    

    I've tried various NUMA configurations (from 1 domain to a per-CCD config) and it doesn't seem to make much difference.

    Updating from the board-delivered F14 to the latest 9004 F31 BIOS (the F33 releases bricked the board and required using a BIOS flasher for manual recover) gave marginal (5-10%) improvement, but nothing major.

    While 1DPC, the memory is 2R (but still registers at 4800), training on every boot. The PMBW graph is probably the most useful behavior chart: https://github.com/AUGMXNT/speed-benchmarking/blob/main/epyc...

    Since I'm not so concerned with CPU inference, I feel like the debugging/testing I've done is... the amount I'm going to do, which is enough to at least characterize, if not fix the performance.

    I might write up a more step-by-step guide at some point to help others but for now the testing scripts are there - I think most people who are looking at theoretical MBW should probably do their own real-world testing as it seems to vary a lot more than GPU bandwidth.

    • To saturate the bandwidth, you would need ~16 zen4 cores but you could first try running

          lkwid -t load -i 100 -w S0:5GB:8:1:2
      

      and see what you get. I think you should be able to get somewhere around ~200 GB/s.

      7 replies →