← Back to context Comment by dist-epoch 1 hour ago Only in micro-benchmarks.For real usage, today's CPUs are limited by memory bandwidth. 4 comments dist-epoch Reply lacedeconstruct 1 hour ago What are you talking about in a hot loop in my software renderer this is like 10x faster // color4_t result = { // .r = (src.r * src.a + dst.r * inv_alpha) * INV_255, // .g = (src.g * src.a + dst.g * inv_alpha) * INV_255, // .b = (src.b * src.a + dst.b * inv_alpha) * INV_255, // .a = src.a + (dst.a * inv_alpha) * INV_255 // }; // 1/256 but much faster color4_t result = { .r = (src.r * src.a + dst.r * inv_alpha) >> 8, .g = (src.g * src.a + dst.g * inv_alpha) >> 8, .b = (src.b * src.a + dst.b * inv_alpha) >> 8, .a = src.a + ((dst.a * inv_alpha) >> 8) }; dist-epoch 1 hour ago Because you are working in the cache.Also, you should use SIMD. lacedeconstruct 43 minutes ago > Also, you should use SIMD. ironically no clang is better at auto vectorizing szundi 1 hour ago [dead]
lacedeconstruct 1 hour ago What are you talking about in a hot loop in my software renderer this is like 10x faster // color4_t result = { // .r = (src.r * src.a + dst.r * inv_alpha) * INV_255, // .g = (src.g * src.a + dst.g * inv_alpha) * INV_255, // .b = (src.b * src.a + dst.b * inv_alpha) * INV_255, // .a = src.a + (dst.a * inv_alpha) * INV_255 // }; // 1/256 but much faster color4_t result = { .r = (src.r * src.a + dst.r * inv_alpha) >> 8, .g = (src.g * src.a + dst.g * inv_alpha) >> 8, .b = (src.b * src.a + dst.b * inv_alpha) >> 8, .a = src.a + ((dst.a * inv_alpha) >> 8) }; dist-epoch 1 hour ago Because you are working in the cache.Also, you should use SIMD. lacedeconstruct 43 minutes ago > Also, you should use SIMD. ironically no clang is better at auto vectorizing
dist-epoch 1 hour ago Because you are working in the cache.Also, you should use SIMD. lacedeconstruct 43 minutes ago > Also, you should use SIMD. ironically no clang is better at auto vectorizing
lacedeconstruct 43 minutes ago > Also, you should use SIMD. ironically no clang is better at auto vectorizing
What are you talking about in a hot loop in my software renderer this is like 10x faster
Because you are working in the cache.
Also, you should use SIMD.
> Also, you should use SIMD. ironically no clang is better at auto vectorizing
[dead]