Comment by kgeist

4 months ago

I did something similar: made an LLM generate a list of "blockers" per transcribed customer call, calculated the blockers' embeddings, and clustered them.

The OP has 6k labels and discusses time + cost, but what I found is:

- a small, good enough locally hosted embedding model can be faster than OpenAI's embedding models (provided you have a fast GPU available), and it doesn't cost anything

- for just 6k labels you don't need Pinecone at all, with Python it took me like a couple of seconds to do all calculations in memory

For classification + embedding you can use locally hosted models, it's not a particularly complex task that requires huge models or huge GPUs. If you plan to do such classification tasks regularly, you can make a one-time investment (buy a GPU) and then you'll be able to run many experiments with your data without having to think about costs anymore.

Agreed, I've run sentence-transformers/all-MiniLM-L6-v2 locally on CPU for a similar task, and it was approx X2 faster than calling the OpenAI embedding API, not to mention free.

OP here. I agree with you. For production use we use VoyageAI which is usually 2x faster than OpenAI at similar quality levels (p90 is < 200ms) but we're looking at spinning up a local embedding in our cloud environment, that would make p95 < 100ms and make cost negligible as well.