Comment by lacedeconstruct
1 hour ago
What are you talking about in a hot loop in my software renderer this is like 10x faster
// color4_t result = {
// .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
// .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
// .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
// .a = src.a + (dst.a * inv_alpha) * INV_255
// };
// 1/256 but much faster
color4_t result = {
.r = (src.r * src.a + dst.r * inv_alpha) >> 8,
.g = (src.g * src.a + dst.g * inv_alpha) >> 8,
.b = (src.b * src.a + dst.b * inv_alpha) >> 8,
.a = src.a + ((dst.a * inv_alpha) >> 8)
};
If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.
Because you are working in the cache.
Also, you should use SIMD.
> Also, you should use SIMD. ironically no clang is better at auto vectorizing