Comment by deepsquirrelnet

4 months ago

I think a less order biased, more straightforward way would be just to vectorize everything, perform clustering and then label the clusters with the LLM.

OP here. Yes that works too and get you to the same result. Remove risks for bias but the trade-off is higher marginal cost and latency.

The idea is also that this would be a classification system used in production whereby you classify data as it comes, so the "rolling labels" problem still exists there.

In my experience though, you can dramatically reduce unwanted bias by tuning your cosine similarity filter.