Qwen3 is the open weight state of the art at the moment. Qwen3-embedding-8B and Qwen3-reranker-8b are surprisingly good (according to some benchmarks, better than Gemini 2.5 embedding). 4B is also nearly as good so you might as well use that too unless 8B benefits your usecase. If you don't need a SOTA-precise embedding model because you'll run a more powerful reranker, you could run qwen3-embedding-4B at Q4 which is only 2GB, and will process extremely fast in most hardware. A weaker but close choice is `Qwen3-Embedding-0.6B` at Q8 which is about 600MB and will run just fine on most powerful CPUs. So if that does the job for you, you may not even need GPU, just grab an instance with 16 vCPUs, that'll give you plenty of throughput, probably more than you need until your RAG has thousands of active users.
Qwen3 is the open weight state of the art at the moment. Qwen3-embedding-8B and Qwen3-reranker-8b are surprisingly good (according to some benchmarks, better than Gemini 2.5 embedding). 4B is also nearly as good so you might as well use that too unless 8B benefits your usecase. If you don't need a SOTA-precise embedding model because you'll run a more powerful reranker, you could run qwen3-embedding-4B at Q4 which is only 2GB, and will process extremely fast in most hardware. A weaker but close choice is `Qwen3-Embedding-0.6B` at Q8 which is about 600MB and will run just fine on most powerful CPUs. So if that does the job for you, you may not even need GPU, just grab an instance with 16 vCPUs, that'll give you plenty of throughput, probably more than you need until your RAG has thousands of active users.
I'm using the qwen3 4B model in consumer hardware, which beats Gemini in English language tasks.
The Qwen3 embedding models were released recently and do very well on benchmarks.