Comment by menaerus

1 year ago

Assuming that you populated the channels correctly, which I believe you did, I can only think that this issue could be related to the motherboard itself or RAM. I think you could start by measuring the single-core RAM bandwidth and latency.

Since the CPU is clocked quite high, figures you should be getting are I guess around ~100ns, but probably less than that, and 40-ish GB/s of BW. If those figures do not match then it could be either a motherboard (HW) or BIOS (SW) issue or RAM stick issue.

If those figures closely match then it's not a RAM issue but a motherboard (BIOS or HW) and you could continue debugging by adding more and more cores to the experiment to understand at which point you hit the saturation point for the bandwidth. It could be a power issue with the mobo.

9 comments

menaerus

lhl 1 year ago

Yeah, that channels are populated correctly. As you can see from the mlc-results.txt, the latency looks fine:

   mlc --idle_latency
  Intel(R) Memory Latency Checker - v3.11b
  Command line parameters: --idle_latency

  Using buffer size of 1800.000MiB
  Each iteration took 424.8 base frequency clocks (       104.9   ns)

As does the per-channel --bandwidth_matrix results:

                  Numa node
  Numa node            0       1       2       3       4       5       6       7
         0        45999.8 46036.3 50490.7 50529.7 50421.0 50427.6 50433.5 52118.2
         1        46099.1 46129.9 52768.3 52122.3 52086.5 52767.6 52122.6 52093.4
         2        46006.3 46095.3 52117.0 52097.2 50385.2 52088.5 50396.1 52077.4
         3        46092.6 46091.5 52153.6 52123.4 52140.3 52134.8 52078.8 52076.1
         4        45718.9 46053.1 52087.3 52124.0 52144.8 50544.5 50492.7 52125.1
         5        46093.7 46107.4 52082.0 52091.2 52147.5 52759.1 52163.7 52179.9
         6        45915.9 45988.2 50412.8 50411.3 50490.8 50473.9 52136.1 52084.9
         7        46134.4 46017.2 52088.9 52114.1 52125.0 52152.9 52056.6 52115.1

I've tried various NUMA configurations (from 1 domain to a per-CCD config) and it doesn't seem to make much difference.

Updating from the board-delivered F14 to the latest 9004 F31 BIOS (the F33 releases bricked the board and required using a BIOS flasher for manual recover) gave marginal (5-10%) improvement, but nothing major.

While 1DPC, the memory is 2R (but still registers at 4800), training on every boot. The PMBW graph is probably the most useful behavior chart: https://github.com/AUGMXNT/speed-benchmarking/blob/main/epyc...

Since I'm not so concerned with CPU inference, I feel like the debugging/testing I've done is... the amount I'm going to do, which is enough to at least characterize, if not fix the performance.

I might write up a more step-by-step guide at some point to help others but for now the testing scripts are there - I think most people who are looking at theoretical MBW should probably do their own real-world testing as it seems to vary a lot more than GPU bandwidth.

menaerus 1 year ago
To saturate the bandwidth, you would need ~16 zen4 cores but you could first try running
lkwid -t load -i 100 -w S0:5GB:8:1:2
and see what you get. I think you should be able to get somewhere around ~200 GB/s.
- lhl 1 year ago
  
  w/ likwid-bench S0:5GB:8:1:2, 129136.28 MB/s . At S0:5GB:16:1:2 184734.43 MB/s (this is the max, S0:5GB:12:1:2 is 186228.62 and S0:5GB:48:1:2 is 183598.29 MB/s) - According to lstopo my 9274F has 8 dies with 3 cores on each (currently each die is set to its own NUMA domain (L3 strat). In any case, I also gave `numactl --interleave=all likwid-bench -t load -w S0:5GB:48:1:2 -i 100` a spin and topped out about the same place: 184986.45 MB/s.
  
  6 replies →