Comment by kgeist
4 months ago
I did something similar: made an LLM generate a list of "blockers" per transcribed customer call, calculated the blockers' embeddings, and clustered them.
The OP has 6k labels and discusses time + cost, but what I found is:
- a small, good enough locally hosted embedding model can be faster than OpenAI's embedding models (provided you have a fast GPU available), and it doesn't cost anything
- for just 6k labels you don't need Pinecone at all, with Python it took me like a couple of seconds to do all calculations in memory
For classification + embedding you can use locally hosted models, it's not a particularly complex task that requires huge models or huge GPUs. If you plan to do such classification tasks regularly, you can make a one-time investment (buy a GPU) and then you'll be able to run many experiments with your data without having to think about costs anymore.
Agreed, I've run sentence-transformers/all-MiniLM-L6-v2 locally on CPU for a similar task, and it was approx X2 faster than calling the OpenAI embedding API, not to mention free.
OP here. I agree with you. For production use we use VoyageAI which is usually 2x faster than OpenAI at similar quality levels (p90 is < 200ms) but we're looking at spinning up a local embedding in our cloud environment, that would make p95 < 100ms and make cost negligible as well.