← Back to context

Comment by WithinReason

3 days ago

How do you do ternary matmul with popcnt on 1.58 bit packed data?

Assuming 2 bit per values (first bit is sign and second bit is value).

actv = A[_:1] & B[_:1]

sign = A[_:0] ^ B[_:0]

dot = pop_count(actv & !sign) - pop_count(actv & sign)

It can probably be made more efficient by taking a column-first format.

Since we are in CPU land, we mostly deal with dot products that match the cache size, I don't assume we have a tiled matmul instruction which is unlikely to support this weird 1-bit format.

  • Haven't looked closely, but on modern x86 CPUs it might be possible to do much better with the gf2affineqb instructions, which let us do 8x8 bit matrix multiplications efficiently. Not sure how you'd handle the 2-bit part, of course.

  • This is 11 bit ops and a subtract, which I assume is ~11 clocks, while you can just do:

    l1 = dot(A[:11000000],B[:11000000]) l2 = dot(A[:00110000],B[:00110000]) l3 = dot(A[:00001100],B[:00001100]) l4 = dot(A[:00000011],B[:00000011])

    result = l1 + l2 * 4 + l3 * 16 + l4 * 64

    which is 8 bit ops and 4x8 bit dots, which is likely 8 clocks with less serial dependence