Comment by WithinReason

2 days ago

This is 11 bit ops and a subtract, which I assume is ~11 clocks, while you can just do:

l1 = dot(A[:11000000],B[:11000000]) l2 = dot(A[:00110000],B[:00110000]) l3 = dot(A[:00001100],B[:00001100]) l4 = dot(A[:00000011],B[:00000011])

result = l1 + l2 * 4 + l3 * 16 + l4 * 64

which is 8 bit ops and 4x8 bit dots, which is likely 8 clocks with less serial dependence