Comment by SoftTalker

2 months ago

What does "zero-shot" mean in this context?

23 comments

SoftTalker

The *-shot jargon is just in-crowd nonsense that has been meaningless since day one (or zero). Like Big O notation but even more arbitrary (as evidenced by all the answers to your comment).

nateb2022 2 months ago

> Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.

https://en.wikipedia.org/wiki/Zero-shot_learning

edit: since there seems to be some degree of confusion regarding this definition, I'll break it down more simply:

We are modeling the conditional probability P(Audio|Voice). If the model samples from this distribution for a Voice class not observed during training, it is by definition zero-shot.

"Prediction" here is not a simple classification, but the estimation of this conditional probability distribution for a Voice class not observed during training.

Providing reference audio to a model at inference-time is no different than including an AGENTS.md when interacting with an LLM. You're providing context, not updating the model weights.

woodson 2 months ago
This generic answer from Wikipedia is not very helpful in this context. Zero-shot voice cloning in TTS usually means that data of the target speaker you want the generated speech to sound like does not need to be included in the training data used to train the TTS models. In other words, you can provide an audio sample of the target speaker together with the text to be spoken to generate the audio that sounds like it was spoken by that speaker.
- coder543 2 months ago
  
  Why wouldn’t that be one-shot voice cloning? The concept of calling it zero shot doesn’t really make sense to me.
  
  14 replies →
- nateb2022 2 months ago
  
  > This generic answer from Wikipedia is not very helpful in this context.
  Actually, the general definition fits this context perfectly. In machine learning terms, a specific 'speaker' is simply a 'class.' Therefore, a model generating audio for a speaker it never saw during training is the exact definition of the Zero-Shot Learning problem setup: "a learner observes samples from classes which were not observed during training," as I quoted.
  Your explanation just rephrases the very definition you dismissed.
  
  3 replies →
numpad0 2 months ago

I think the point is it's not zero shot if a sample is needed. A system that require one sample is usually considered one-shot, or few-shot if it needs few, etc etc.