← Back to context

Comment by brucehoult

19 hours ago

I don't know how they got their 3 GB/s memory bandwidth.

My own testing shows 5347.7 MB/s on a 64 MiB to 64 MiB `memcpy()` using a basic 7 instruction RVV copy loop an X100 core. That's a total 10.7 GB/s memory bandwidth.

The A100 "AI" cores do better, with 13225.9 MB/s on the 64 MiB to 64 MiB copy, for a total 26.5 GB/s memory bandwidth.

Both core types do a 25 GB/s `memcpy()` total 50 GB/s in cache.

On X100 cores:

    bruce@k3:~$ ./test_memcpy 
    Byte size :              ns     Speed
            0 :             6.3       0.0 MB/s
            1 :             6.5     147.6 MB/s
            2 :             6.5     295.7 MB/s
            4 :             6.3     602.7 MB/s
            8 :             6.4    1193.6 MB/s
           16 :             6.4    2402.1 MB/s
           32 :             6.4    4796.1 MB/s
           64 :             7.1    8558.1 MB/s
          128 :             7.1   17313.7 MB/s
          256 :            12.6   19444.2 MB/s
          512 :            20.8   23424.8 MB/s
         1024 :            39.8   24563.3 MB/s
         2048 :            80.4   24284.2 MB/s
         4096 :           158.0   24722.1 MB/s
         8192 :           312.5   24997.6 MB/s
        16384 :           609.6   25630.4 MB/s
        32768 :          1287.0   24281.6 MB/s
        65536 :          2761.8   22630.4 MB/s
       131072 :          6463.0   19340.9 MB/s
       262144 :         12897.6   19383.5 MB/s
       524288 :         25779.1   19395.6 MB/s
      1048576 :         52356.4   19099.9 MB/s
      2097152 :        111030.3   18013.1 MB/s
      4194304 :        569240.2    7026.9 MB/s
      8388608 :       1468409.2    5448.1 MB/s
     16777216 :       2905474.6    5506.8 MB/s
     33554432 :       5769324.2    5546.6 MB/s
     67108864 :      11967851.6    5347.7 MB/s

And on A100:

    bruce@k3:~$ ai ./test_memcpy 
    Byte size :              ns     Speed
            0 :            21.0       0.0 MB/s
            1 :            82.7      11.5 MB/s
            2 :            82.9      23.0 MB/s
            4 :            82.9      46.0 MB/s
            8 :            82.8      92.2 MB/s
           16 :            82.9     184.2 MB/s
           32 :            82.9     368.2 MB/s
           64 :            87.2     699.7 MB/s
          128 :            87.1    1401.7 MB/s
          256 :            87.2    2799.1 MB/s
          512 :            77.2    6326.1 MB/s
         1024 :            82.9   11784.2 MB/s
         2048 :            98.4   19855.9 MB/s
         4096 :           193.5   20191.4 MB/s
         8192 :           313.5   24916.8 MB/s
        16384 :           627.0   24919.0 MB/s
        32768 :          1254.2   24915.7 MB/s
        65536 :          2508.0   24920.1 MB/s
       131072 :          5017.3   24913.6 MB/s
       262144 :         10036.5   24909.0 MB/s
       524288 :         20075.0   24906.6 MB/s
      1048576 :         62556.9   15985.4 MB/s
      2097152 :        152324.5   13129.9 MB/s
      4194304 :        303466.3   13181.0 MB/s
      8388608 :        610230.0   13109.8 MB/s
     16777216 :       1186394.5   13486.2 MB/s
     33554432 :       2317591.8   13807.4 MB/s
     67108864 :       4838988.3   13225.9 MB/s

That's using the following `memcpy()` in both cases.

    .globl memcpy
    memcpy:
            mv      a3, a0
    0:      vsetvli a4, a2, e8, m4, ta, ma
            vle8.v  v0, (a1)
            sub     a2, a2, a4
            add     a1, a1, a4
            vse8.v  v0, (a3)
            add     a3, a3, a4
            bnez    a2, 0b
            ret

Why does bandwidth (MB/s) decrease over larger sizes? Is it possible caches play a larger factor during smaller memcpy, and you see the real CPU<->RAM bandwidth when you’re touching larger areas of memory?

EDIT: never mind, your comment seems to indicate that to be the case