Comment by lhl

1 year ago

Yeah, that channels are populated correctly. As you can see from the mlc-results.txt, the latency looks fine:

   mlc --idle_latency
  Intel(R) Memory Latency Checker - v3.11b
  Command line parameters: --idle_latency

  Using buffer size of 1800.000MiB
  Each iteration took 424.8 base frequency clocks (       104.9   ns)

As does the per-channel --bandwidth_matrix results:

                  Numa node
  Numa node            0       1       2       3       4       5       6       7
         0        45999.8 46036.3 50490.7 50529.7 50421.0 50427.6 50433.5 52118.2
         1        46099.1 46129.9 52768.3 52122.3 52086.5 52767.6 52122.6 52093.4
         2        46006.3 46095.3 52117.0 52097.2 50385.2 52088.5 50396.1 52077.4
         3        46092.6 46091.5 52153.6 52123.4 52140.3 52134.8 52078.8 52076.1
         4        45718.9 46053.1 52087.3 52124.0 52144.8 50544.5 50492.7 52125.1
         5        46093.7 46107.4 52082.0 52091.2 52147.5 52759.1 52163.7 52179.9
         6        45915.9 45988.2 50412.8 50411.3 50490.8 50473.9 52136.1 52084.9
         7        46134.4 46017.2 52088.9 52114.1 52125.0 52152.9 52056.6 52115.1

I've tried various NUMA configurations (from 1 domain to a per-CCD config) and it doesn't seem to make much difference.

Updating from the board-delivered F14 to the latest 9004 F31 BIOS (the F33 releases bricked the board and required using a BIOS flasher for manual recover) gave marginal (5-10%) improvement, but nothing major.

While 1DPC, the memory is 2R (but still registers at 4800), training on every boot. The PMBW graph is probably the most useful behavior chart: https://github.com/AUGMXNT/speed-benchmarking/blob/main/epyc...

Since I'm not so concerned with CPU inference, I feel like the debugging/testing I've done is... the amount I'm going to do, which is enough to at least characterize, if not fix the performance.

I might write up a more step-by-step guide at some point to help others but for now the testing scripts are there - I think most people who are looking at theoretical MBW should probably do their own real-world testing as it seems to vary a lot more than GPU bandwidth.

8 comments

lhl

menaerus 1 year ago

To saturate the bandwidth, you would need ~16 zen4 cores but you could first try running

    lkwid -t load -i 100 -w S0:5GB:8:1:2

and see what you get. I think you should be able to get somewhere around ~200 GB/s.

lhl 1 year ago
w/ likwid-bench S0:5GB:8:1:2, 129136.28 MB/s . At S0:5GB:16:1:2 184734.43 MB/s (this is the max, S0:5GB:12:1:2 is 186228.62 and S0:5GB:48:1:2 is 183598.29 MB/s) - According to lstopo my 9274F has 8 dies with 3 cores on each (currently each die is set to its own NUMA domain (L3 strat). In any case, I also gave `numactl --interleave=all likwid-bench -t load -w S0:5GB:48:1:2 -i 100` a spin and topped out about the same place: 184986.45 MB/s.
- menaerus 1 year ago
  
  Yes, you're correct that your CPU has 8 CCDs but the bw with 8 threads is already too low. Those 8 cores should be able to get you at roughly half of the theoretical bw. 8x zen5 cores for comparison can reach the ~230 GB/s mark.
  Can you repeat the same lkwid experiment but with 1, 2 and 4 threads? I'm wondering when is it that it begins to detoriate quickly.
  Maybe also worth doing is repeating the 8 threads but forcing lkwid to pick every third physical core so that you get 1 thread per CCD experiment setting.
  
  5 replies →