Comment by gajjanag

3 hours ago

Wow, yes - you are completely correct (read through the note in detail now).

Though, as your paper also notes, the quantizer values themselves aren't fundamentally novel to either paper. Lloyd Max scalar quantizers have been studied for a very, very long time. And the specific Lloyd Max values for the Gaussian input distribution have been obtained in many papers across signal processing and information theory.

4 comments

gajjanag

amitport 3 hours ago

Thanks for that!

It is worth noting that taking advantage of the post-rotation distribution was not actually done until DRIVE (2021), which was made possible via our proper scaling. Furthermore, applying a Lloyd-Max codebook post-rotation was introduced EDEN.

We consider these to be the foundational works in this regard.

gajjanag 3 hours ago
> Thanks for that! It is worth noting that taking advantage of the post-rotation distribution
I again feel this claim is too strong. Rotations have been used in information theory/wireless communications for decades at this point, with appropriate scaling done at channel inputs/outputs to hit channel capacity. The signals then pass through the appropriate codebooks that take advantage of the post-rotated+whitened signal.
Our cellphones today are powered by such technology.
I agree with your claim when restricted to deep learning. But I do not agree with the broad characterization that taking advantage of post-rotation distributions was only first done in your work.
- amitport 2 hours ago
  
  Thanks for the pushback, and I appreciate the reference to classical information theory.
  While I probably overstated things by using the very general phrase "taking advantage," I want to be very precise about the claim, as I believe these works are foundational to quantization, beyond the scope of deep learning. The mechanism of applying a deterministic biased quantizer, such as Lloyd-Max, to the induced post-rotation distribution, alongside mathematically correcting its inherent bias, is a distinct contribution (which asymptotically improves the worst-case error).
  If there is a classical paper that utilizes such a combination, I would genuinely be very eager to review it. But to my knowledge, this was not introduced prior to DRIVE and EDEN.