Comment by brucehoult
19 hours ago
I don't know how they got their 3 GB/s memory bandwidth.
My own testing shows 5347.7 MB/s on a 64 MiB to 64 MiB `memcpy()` using a basic 7 instruction RVV copy loop an X100 core. That's a total 10.7 GB/s memory bandwidth.
The A100 "AI" cores do better, with 13225.9 MB/s on the 64 MiB to 64 MiB copy, for a total 26.5 GB/s memory bandwidth.
Both core types do a 25 GB/s `memcpy()` total 50 GB/s in cache.
On X100 cores:
bruce@k3:~$ ./test_memcpy
Byte size : ns Speed
0 : 6.3 0.0 MB/s
1 : 6.5 147.6 MB/s
2 : 6.5 295.7 MB/s
4 : 6.3 602.7 MB/s
8 : 6.4 1193.6 MB/s
16 : 6.4 2402.1 MB/s
32 : 6.4 4796.1 MB/s
64 : 7.1 8558.1 MB/s
128 : 7.1 17313.7 MB/s
256 : 12.6 19444.2 MB/s
512 : 20.8 23424.8 MB/s
1024 : 39.8 24563.3 MB/s
2048 : 80.4 24284.2 MB/s
4096 : 158.0 24722.1 MB/s
8192 : 312.5 24997.6 MB/s
16384 : 609.6 25630.4 MB/s
32768 : 1287.0 24281.6 MB/s
65536 : 2761.8 22630.4 MB/s
131072 : 6463.0 19340.9 MB/s
262144 : 12897.6 19383.5 MB/s
524288 : 25779.1 19395.6 MB/s
1048576 : 52356.4 19099.9 MB/s
2097152 : 111030.3 18013.1 MB/s
4194304 : 569240.2 7026.9 MB/s
8388608 : 1468409.2 5448.1 MB/s
16777216 : 2905474.6 5506.8 MB/s
33554432 : 5769324.2 5546.6 MB/s
67108864 : 11967851.6 5347.7 MB/s
And on A100:
bruce@k3:~$ ai ./test_memcpy
Byte size : ns Speed
0 : 21.0 0.0 MB/s
1 : 82.7 11.5 MB/s
2 : 82.9 23.0 MB/s
4 : 82.9 46.0 MB/s
8 : 82.8 92.2 MB/s
16 : 82.9 184.2 MB/s
32 : 82.9 368.2 MB/s
64 : 87.2 699.7 MB/s
128 : 87.1 1401.7 MB/s
256 : 87.2 2799.1 MB/s
512 : 77.2 6326.1 MB/s
1024 : 82.9 11784.2 MB/s
2048 : 98.4 19855.9 MB/s
4096 : 193.5 20191.4 MB/s
8192 : 313.5 24916.8 MB/s
16384 : 627.0 24919.0 MB/s
32768 : 1254.2 24915.7 MB/s
65536 : 2508.0 24920.1 MB/s
131072 : 5017.3 24913.6 MB/s
262144 : 10036.5 24909.0 MB/s
524288 : 20075.0 24906.6 MB/s
1048576 : 62556.9 15985.4 MB/s
2097152 : 152324.5 13129.9 MB/s
4194304 : 303466.3 13181.0 MB/s
8388608 : 610230.0 13109.8 MB/s
16777216 : 1186394.5 13486.2 MB/s
33554432 : 2317591.8 13807.4 MB/s
67108864 : 4838988.3 13225.9 MB/s
That's using the following `memcpy()` in both cases.
.globl memcpy
memcpy:
mv a3, a0
0: vsetvli a4, a2, e8, m4, ta, ma
vle8.v v0, (a1)
sub a2, a2, a4
add a1, a1, a4
vse8.v v0, (a3)
add a3, a3, a4
bnez a2, 0b
ret
Why does bandwidth (MB/s) decrease over larger sizes? Is it possible caches play a larger factor during smaller memcpy, and you see the real CPU<->RAM bandwidth when you’re touching larger areas of memory?
EDIT: never mind, your comment seems to indicate that to be the case