Comment by lacedeconstruct 1 hour ago yes but >> 8 is so much faster 8 comments lacedeconstruct Reply xigoi 12 minutes ago You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow. StilesCrisis 20 minutes ago It's just multiplication. Floating multiply is extraordinarily fast. lacedeconstruct 13 minutes ago The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable dist-epoch 1 hour ago Only in micro-benchmarks.For real usage, today's CPUs are limited by memory bandwidth. lacedeconstruct 36 minutes ago What are you talking about in a hot loop in my software renderer this is like 10x faster // color4_t result = { // .r = (src.r * src.a + dst.r * inv_alpha) * INV_255, // .g = (src.g * src.a + dst.g * inv_alpha) * INV_255, // .b = (src.b * src.a + dst.b * inv_alpha) * INV_255, // .a = src.a + (dst.a * inv_alpha) * INV_255 // }; // 1/256 but much faster color4_t result = { .r = (src.r * src.a + dst.r * inv_alpha) >> 8, .g = (src.g * src.a + dst.g * inv_alpha) >> 8, .b = (src.b * src.a + dst.b * inv_alpha) >> 8, .a = src.a + ((dst.a * inv_alpha) >> 8) }; dist-epoch 35 minutes ago Because you are working in the cache.Also, you should use SIMD. 1 reply → szundi 43 minutes ago [dead]
xigoi 12 minutes ago You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow.
StilesCrisis 20 minutes ago It's just multiplication. Floating multiply is extraordinarily fast. lacedeconstruct 13 minutes ago The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable
lacedeconstruct 13 minutes ago The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable
dist-epoch 1 hour ago Only in micro-benchmarks.For real usage, today's CPUs are limited by memory bandwidth. lacedeconstruct 36 minutes ago What are you talking about in a hot loop in my software renderer this is like 10x faster // color4_t result = { // .r = (src.r * src.a + dst.r * inv_alpha) * INV_255, // .g = (src.g * src.a + dst.g * inv_alpha) * INV_255, // .b = (src.b * src.a + dst.b * inv_alpha) * INV_255, // .a = src.a + (dst.a * inv_alpha) * INV_255 // }; // 1/256 but much faster color4_t result = { .r = (src.r * src.a + dst.r * inv_alpha) >> 8, .g = (src.g * src.a + dst.g * inv_alpha) >> 8, .b = (src.b * src.a + dst.b * inv_alpha) >> 8, .a = src.a + ((dst.a * inv_alpha) >> 8) }; dist-epoch 35 minutes ago Because you are working in the cache.Also, you should use SIMD. 1 reply → szundi 43 minutes ago [dead]
lacedeconstruct 36 minutes ago What are you talking about in a hot loop in my software renderer this is like 10x faster // color4_t result = { // .r = (src.r * src.a + dst.r * inv_alpha) * INV_255, // .g = (src.g * src.a + dst.g * inv_alpha) * INV_255, // .b = (src.b * src.a + dst.b * inv_alpha) * INV_255, // .a = src.a + (dst.a * inv_alpha) * INV_255 // }; // 1/256 but much faster color4_t result = { .r = (src.r * src.a + dst.r * inv_alpha) >> 8, .g = (src.g * src.a + dst.g * inv_alpha) >> 8, .b = (src.b * src.a + dst.b * inv_alpha) >> 8, .a = src.a + ((dst.a * inv_alpha) >> 8) }; dist-epoch 35 minutes ago Because you are working in the cache.Also, you should use SIMD. 1 reply →
You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow.
It's just multiplication. Floating multiply is extraordinarily fast.
The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable
Only in micro-benchmarks.
For real usage, today's CPUs are limited by memory bandwidth.
What are you talking about in a hot loop in my software renderer this is like 10x faster
Because you are working in the cache.
Also, you should use SIMD.
1 reply →
[dead]