← Back to context

Comment by nubg

5 hours ago

> "engineers optimizing inferencing"

are we sure this is not a fancy way of saying quantization?

When MP3 became popular, people were amazed that you could compress audio to 1/10th its size with minor quality loss. A few decades later, we have audio compression that is much better and higher-quality than MP3, and they took a lot more effort than "MP3 but at a lower bitrate."

The same is happening in AI research now.

Or distilled models, or just slightly smaller models but same architecture. Lots of options, all of them conveniently fitting inside "optimizing inferencing".

A ton of GPU kernels are hugely inefficient. Not saying the numbers are realistic, but look at the 100s of times of gain in the Anthropic performance takehome exam that floated around on here.

And if you've worked with pytorch models a lot, having custom fused kernels can be huge. For instance, look at the kind of gains to be had when FlashAttention came out.

This isn't just quantization, it's actually just better optimization.

Even when it comes to quantization, Blackwell has far better quantization primitives and new floating point types that support row or layer-wise scaling that can quantize with far less quality reduction.

There is also a ton of work in the past year on sub-quadratic attention for new models that gets rid of a huge bottleneck, but like quantization can be a tradeoff, and a lot of progress has been made there on moving the Pareto frontier as well.

It's almost like when you're spending hundreds of billions on capex for GPUs, you can afford to hire engineers to make them perform better without just nerfing the models with more quantization.