Comment by minimaxir

4 months ago

If you're just generating labels from existing documents, you don't need that many data points, but the LLM may hallucinate labels if you have too few relative to the number of labels you want.

For training the model downstream, the main constraint on dataset size is how many distinct labels you want for your use case. The rules of thumb are:

a) ensuring that each label has a few samples

b) atleast N^2 data points total for N labels to avoid issues akin to the curse of dimensionality