Comment by keizo
2 days ago
has anyone done some simple latency profiling of gemini embedding vs open ai embedding api? seem like that api call is one of the biggest chunks of time in a simple rag setup.
2 days ago
has anyone done some simple latency profiling of gemini embedding vs open ai embedding api? seem like that api call is one of the biggest chunks of time in a simple rag setup.
In my experience the api call is trivial compared to the time taken for the LLM to compose the response.
gemini flash and groq are pretty fast, and that part is streamable. curiosity got the best of me so i had claude code write a quick test. given this test is simply is 20 requests, with 1 second delay between requests ran once. so take with a grain of salt, but interesting still. Extra half second in a search is super noticeable so google looking like a reasonable improvement.