Comment by keizo

2 days ago

has anyone done some simple latency profiling of gemini embedding vs open ai embedding api? seem like that api call is one of the biggest chunks of time in a simple rag setup.

2 comments

keizo

elliotto 1 day ago

In my experience the api call is trivial compared to the time taken for the LLM to compose the response.

keizo 1 day ago

gemini flash and groq are pretty fast, and that part is streamable. curiosity got the best of me so i had claude code write a quick test. given this test is simply is 20 requests, with 1 second delay between requests ran once. so take with a grain of salt, but interesting still. Extra half second in a search is super noticeable so google looking like a reasonable improvement.

  OpenAI Statistics:

  - Average: 0.360 seconds
  - Median: 0.292 seconds
  - Min: 0.211 seconds
  - Max: 0.779 seconds
  - Std Dev: 0.172 seconds

  Google Gemini Statistics:

  - Average: 0.304 seconds
  - Median: 0.273 seconds
  - Min: 0.250 seconds
  - Max: 0.445 seconds
  - Std Dev: 0.066 seconds

  The key insights from these numbers:
  - Google has much lower standard deviation (0.066 vs 0.172), meaning more consistent/predictable performance
  - Google's worst-case (max) is much better than OpenAI's (0.445s vs 0.779s)
  - OpenAI had a slightly better best-case (min) performance (0.211s vs 0.250s)
  - Google's performance is more tightly clustered around its average, while OpenAI has more variability