Also, while the author complains that there is not a lot of high quality data around [0], you do not need a lot of data to train small models. Depending on the problem you are trying to solve, you can do a lot with single-digit gigabytes of audio data. See, e.g., https://jmvalin.ca/demo/rnnoise/
[0] Which I do agree with, particularly if you need it to be higher quality or labeled in a particular way: the Fisher database mentioned is narrowband and 8-bit mu-law quantized, and while there are timestamps, they are not accurate enough for millisecond-level active speech determination. It is also less than 6000 conversations totaling less than 1000 hours (x2 speakers, but each is silent over half the time, a fact that can also throw a wrench in some standard algorithms, like volume normalization). It is also English-only.
Wrt data, it's not like there is a shortage of transcribed audio in the form of music sheets, lyrics and subtitles.
If one asks ~~nice~~ expensive enough they can even get isolated multitracks or teleprompter feeds together with the audiovisual tracks. Heck, if they wanted they could set up dedicated transcription teams for the plethora of podcasts with the costs somewhere in the rounding error range. But you can't siphon that off of torrents and paying for training material goes against the core ethics of the big players.
Too bad you can't really scrape tiktok/instagram reels with subtitles... Oh no, oh no, oh no no no no
I refuse to believe that none of these people ever heard of Nyquist, and that noone was able to come up with "ayyy lmao let's put a low pass on this before downsampling".
Edit: 2 day old account posting stuff that doesn't pass the sniff test. Hmmmm... baited by a bot?
in ~30 years of my work in DSP domain, I've seen insane amount of ways to do signal processing wrong even for simplest things like passing a buffer and doing resampling.
The last example I've seen in one large company, done by a developer lacking audio/DSP experience: they used ffmpeg's resampling lib, but, after every 10ms audio frame processed by resampler, they'd invoke flush(), just for the sake of convenience of having the same number of input and output buffers ... :)
What's the point of saying that without backing it up? Either you think it's so obvious it doesn't need backing up (in which case you don't need to say it), or ...?
The reason it matters is that soon, any time somebody sees a comment they don't like or think is stupid, they'll just say, "eh a bot said that," and totally dilute the rest of the discussion, even if the comment was real.
> imo audio DSP experts are diametrically opposed to AI on moral grounds.
Can you elaborate on this point? I don't know the moral grounds of audio DSP experts, and thus I don't understand why in your opinion they wouldn't take an offer if you really pay them some serious amount of money.
Just to be clear: considering what a typical daily job in DSP programming is like, I can imagine that many audio DSP experts are not the best culture fit for AI companies, but this doesn't have anything to do with morality.
In my experience, most of the people in audio DSP are musicians or otherwise very well exposed to music and the arts, and many see using AI as fundamentally immoral or unethical. It's technology designed via theft with the intent to harm professionals in these spaces.
Like I said, it's like paying a doctor to design a better gun.
Also, while the author complains that there is not a lot of high quality data around [0], you do not need a lot of data to train small models. Depending on the problem you are trying to solve, you can do a lot with single-digit gigabytes of audio data. See, e.g., https://jmvalin.ca/demo/rnnoise/
[0] Which I do agree with, particularly if you need it to be higher quality or labeled in a particular way: the Fisher database mentioned is narrowband and 8-bit mu-law quantized, and while there are timestamps, they are not accurate enough for millisecond-level active speech determination. It is also less than 6000 conversations totaling less than 1000 hours (x2 speakers, but each is silent over half the time, a fact that can also throw a wrench in some standard algorithms, like volume normalization). It is also English-only.
Wrt data, it's not like there is a shortage of transcribed audio in the form of music sheets, lyrics and subtitles.
If one asks ~~nice~~ expensive enough they can even get isolated multitracks or teleprompter feeds together with the audiovisual tracks. Heck, if they wanted they could set up dedicated transcription teams for the plethora of podcasts with the costs somewhere in the rounding error range. But you can't siphon that off of torrents and paying for training material goes against the core ethics of the big players.
Too bad you can't really scrape tiktok/instagram reels with subtitles... Oh no, oh no, oh no no no no
[flagged]
I refuse to believe that none of these people ever heard of Nyquist, and that noone was able to come up with "ayyy lmao let's put a low pass on this before downsampling".
Edit: 2 day old account posting stuff that doesn't pass the sniff test. Hmmmm... baited by a bot?
in ~30 years of my work in DSP domain, I've seen insane amount of ways to do signal processing wrong even for simplest things like passing a buffer and doing resampling.
The last example I've seen in one large company, done by a developer lacking audio/DSP experience: they used ffmpeg's resampling lib, but, after every 10ms audio frame processed by resampler, they'd invoke flush(), just for the sake of convenience of having the same number of input and output buffers ... :)
Haha wow, I guess that gives a very noticeable 100 Hz comb-like effect... noone cared to do a quick sanity check on that output?!
[dead]
[flagged]
What's the point of saying that without backing it up? Either you think it's so obvious it doesn't need backing up (in which case you don't need to say it), or ...?
The reason it matters is that soon, any time somebody sees a comment they don't like or think is stupid, they'll just say, "eh a bot said that," and totally dilute the rest of the discussion, even if the comment was real.
We are *quickly* approaching a Tuesday where "bot detected, opinion rejected" is going to be a default assumption.
5 replies →
imo audio DSP experts are diametrically opposed to AI on moral grounds. Good luck hiring the good ones. It's like paying doctors to design guns.
Not sure your analogy works, Guantanamo had no trouble hiring medical personnel
> imo audio DSP experts are diametrically opposed to AI on moral grounds.
Can you elaborate on this point? I don't know the moral grounds of audio DSP experts, and thus I don't understand why in your opinion they wouldn't take an offer if you really pay them some serious amount of money.
Just to be clear: considering what a typical daily job in DSP programming is like, I can imagine that many audio DSP experts are not the best culture fit for AI companies, but this doesn't have anything to do with morality.
In my experience, most of the people in audio DSP are musicians or otherwise very well exposed to music and the arts, and many see using AI as fundamentally immoral or unethical. It's technology designed via theft with the intent to harm professionals in these spaces.
Like I said, it's like paying a doctor to design a better gun.
1 reply →