Comment by nateb2022
3 days ago
> This generic answer from Wikipedia is not very helpful in this context.
Actually, the general definition fits this context perfectly. In machine learning terms, a specific 'speaker' is simply a 'class.' Therefore, a model generating audio for a speaker it never saw during training is the exact definition of the Zero-Shot Learning problem setup: "a learner observes samples from classes which were not observed during training," as I quoted.
Your explanation just rephrases the very definition you dismissed.
From your definition:
> a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.
That's not what happens in zero-shot voice cloning, which is why I dismissed your definition copied from Wikipedia.
> That's not what happens in zero-shot voice cloning
It is exactly what happens. You are confusing the task (classification vs. generation) with the learning paradigm (zero-shot).
In the voice cloning context, the class is the speaker's voice (not observed during training), samples of which are generated by the machine learning model.
The definition applies 1:1. During inference, it is predicting the conditional probability distribution of audio samples that belong to that unseen class. It is "predict[ing] the class that they belong to," which very same class was "not observed during training."
You're getting hung up on the semantics.
Jeez, OP asked what it means in this context (zero-shot voice cloning), where you quoted a generic definition copied from Wikipedia. I defined it concretely for this context. Don't take it as a slight, there is no need to get all argumentative.