Comment by lacedeconstruct

1 hour ago

yes but >> 8 is so much faster

8 comments

lacedeconstruct

You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow.

StilesCrisis 20 minutes ago

It's just multiplication. Floating multiply is extraordinarily fast.

lacedeconstruct 13 minutes ago

The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable

dist-epoch 1 hour ago

Only in micro-benchmarks.

For real usage, today's CPUs are limited by memory bandwidth.

lacedeconstruct 36 minutes ago

What are you talking about in a hot loop in my software renderer this is like 10x faster

    // color4_t result = {
    //     .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
    //     .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
    //     .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
    //     .a = src.a + (dst.a * inv_alpha) * INV_255
    // };

    // 1/256 but much faster
    color4_t result = {
        .r = (src.r * src.a + dst.r * inv_alpha) >> 8,
        .g = (src.g * src.a + dst.g * inv_alpha) >> 8,
        .b = (src.b * src.a + dst.b * inv_alpha) >> 8,
        .a = src.a + ((dst.a * inv_alpha) >> 8)
    };

dist-epoch 35 minutes ago

Because you are working in the cache.
Also, you should use SIMD.

1 reply →

szundi 43 minutes ago

[dead]