Comment by woodson
3 days ago
From your definition:
> a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.
That's not what happens in zero-shot voice cloning, which is why I dismissed your definition copied from Wikipedia.
> That's not what happens in zero-shot voice cloning
It is exactly what happens. You are confusing the task (classification vs. generation) with the learning paradigm (zero-shot).
In the voice cloning context, the class is the speaker's voice (not observed during training), samples of which are generated by the machine learning model.
The definition applies 1:1. During inference, it is predicting the conditional probability distribution of audio samples that belong to that unseen class. It is "predict[ing] the class that they belong to," which very same class was "not observed during training."
You're getting hung up on the semantics.
Jeez, OP asked what it means in this context (zero-shot voice cloning), where you quoted a generic definition copied from Wikipedia. I defined it concretely for this context. Don't take it as a slight, there is no need to get all argumentative.