Comment by rasz

7 years ago

3 Gflops, we are deep beyond diminishing returns here. Opus seems good enough.

Keep in mind that the very first CELP speech codec (in 1984) used to take 90 seconds to encode just 1 second of speech... on a Cray supercomputer. Ten years later, people had that running in their cell phones. It's not just that hardware keeps getting faster, but algorithms are also getting more efficient. LPCNet is already 1/100 the complexity of the original WaveNet (which is just 2 years old) and I'm pretty sure it's still far from optimal.

  • This is roughly >100x computation for 2x improvement, which might sound great, except we are already talking single digit Kbits here, hence diminishing returns.

Opus is awesome and covers a previously unmatched spectrum of use cases... but that isn't everything.

Opus isn't good enough to be a replacement for AMBE for use over radio. Opus doesn't make it easier to make very high quality speech synthesis, etc.

Opus loss robustness could be much better using tools from this toolbox-- and we're a long way from not wanting better performance in the face of packet loss.

  • Opus is still improving, v1.1 to v1.2 then onto v1.3 (current in FFMpeg) saw huge reductions in compute for encoding, and the minimum bitrate for stereo wideband fall year after year.

    The limiting factor for Opus's penetration has been compute, FEC is still rarely supported on VOIP deskphones due to this, ditto for handling multiple Opus calls at once.

3 GFLOP/sec sounds like a lot but it's considerably less math than the radio DSPs inside any modern phone's baseband is doing during a phone call.

  • I don't know much about phone tech, are the basebands really doing math or just instrumenting? My assumption would be that there is just some sensor writing to a buffer at a high frequency but that whatever processes that buffer operates at a lower frequency.

    • Your question is hard to parse? What is instrumenting? If it helps though... the word “baseband” itself is the lower frequency containing just the bandwidth of the signal. Ie that is the lower frequency...

      1 reply →

Not really, no. Especially not if this is implemented in a specialized accelerator. A GFLOP is not that much there. Also, like most other neural network algorithms, this could also be done in fixed point, thereby further reducing the computational cost.

I wonder if a particular network can be implemented more economically if we only run it in "transformation" mode, without the need do train it.

There are technologies to compress deep networks by pruning weak connections. I don't believe the author is using this, so it's likely the computational cost could be reduced by a factor of 10. It could also be that simple tweaks to the NN architecture also work (was the author aiming for using a network as small as possible to begin with?).

  • Actually, what's in the demo already includes pruning (through sparse matrices) and indeed, it does keep just 1/10 of the weights as non-zero. In practice it's not quite a 10x speedup because the network has to be a bit bigger to get the same performance. It's still a pretty significant improvement. Of course, the weights are pruned by 16x1 blocks to avoid hurting vectorization (see the first LPCNet paper and the WaveRNN paper for details).