← Back to context

Comment by nateb2022

3 days ago

> That's not what happens in zero-shot voice cloning

It is exactly what happens. You are confusing the task (classification vs. generation) with the learning paradigm (zero-shot).

In the voice cloning context, the class is the speaker's voice (not observed during training), samples of which are generated by the machine learning model.

The definition applies 1:1. During inference, it is predicting the conditional probability distribution of audio samples that belong to that unseen class. It is "predict[ing] the class that they belong to," which very same class was "not observed during training."

You're getting hung up on the semantics.

Jeez, OP asked what it means in this context (zero-shot voice cloning), where you quoted a generic definition copied from Wikipedia. I defined it concretely for this context. Don't take it as a slight, there is no need to get all argumentative.