Comment by nateb2022

2 months ago

> Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.

https://en.wikipedia.org/wiki/Zero-shot_learning

edit: since there seems to be some degree of confusion regarding this definition, I'll break it down more simply:

We are modeling the conditional probability P(Audio|Voice). If the model samples from this distribution for a Voice class not observed during training, it is by definition zero-shot.

"Prediction" here is not a simple classification, but the estimation of this conditional probability distribution for a Voice class not observed during training.

Providing reference audio to a model at inference-time is no different than including an AGENTS.md when interacting with an LLM. You're providing context, not updating the model weights.

21 comments

nateb2022

woodson 2 months ago

This generic answer from Wikipedia is not very helpful in this context. Zero-shot voice cloning in TTS usually means that data of the target speaker you want the generated speech to sound like does not need to be included in the training data used to train the TTS models. In other words, you can provide an audio sample of the target speaker together with the text to be spoken to generate the audio that sounds like it was spoken by that speaker.

coder543 2 months ago
Why wouldn’t that be one-shot voice cloning? The concept of calling it zero shot doesn’t really make sense to me.
- ben_w 2 months ago
  
  Zero-shot means zero-retraining, so think along the lines of "Do you need to modify the weights? Or can you keep the weights fixed and you only need to supply an example?"
  As with other replies, yes this is a silly name.
  
  1 reply →
- nateb2022 2 months ago
  
  Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.
  If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.
  
  1 reply →
- oofbey 2 months ago
  
  It’s nonsensical to call it “zero shot” when a sample of the voice is provided. The term “zero shot cloning” implies you have some representation of the voice from another domain - e.g. a text description of the voice. What they’re doing is ABSOLUTELY one shot cloning. I don’t care if lots of STT folks use the term this way, they’re wrong.
- woodson 2 months ago
  
  I don't disagree, but that's what people started calling it. Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).
  
  4 replies →
- geocar 2 months ago
  
  So if you get your target to record (say) 1 hour of audio, that's a one-shot.
  If you didn't do that (because you have 100 hours of other people talking), that's zero-shots, no?
  
  3 replies →
nateb2022 2 months ago
> This generic answer from Wikipedia is not very helpful in this context.
Actually, the general definition fits this context perfectly. In machine learning terms, a specific 'speaker' is simply a 'class.' Therefore, a model generating audio for a speaker it never saw during training is the exact definition of the Zero-Shot Learning problem setup: "a learner observes samples from classes which were not observed during training," as I quoted.
Your explanation just rephrases the very definition you dismissed.
- woodson 2 months ago
  
  From your definition:
  > a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.
  That's not what happens in zero-shot voice cloning, which is why I dismissed your definition copied from Wikipedia.
  
  2 replies →

numpad0 2 months ago

I think the point is it's not zero shot if a sample is needed. A system that require one sample is usually considered one-shot, or few-shot if it needs few, etc etc.