Comment by makeitdouble

5 months ago

> After trial and error with different models

As a mere occasional customer I've been scanning 4 to 5 pages of the same document layout every week in gemini for half a year, and every single week the results were slightly different.

To note the docs are bilingual so it could affect the results, but what stroke me is the lack of consistency, and even with the same model, running it two or three times in a row gives different results.

That's fine for my usage, but that sounds like a nightmare if everytime Google tweaks their model, companies have to reajust their whole process to deal with the discrepancies.

And sticking with the same model for multiple years also sound like a captive situation where you'd have to pay premium for Google to keep it available for your use.

30 comments

makeitdouble

tomrod 5 months ago

Consider turning down the temperature in the configuration? LLMs have a bit of randomness in them.

Gemini 2.0 Flash seems better than 1.5 - https://deepmind.google/technologies/gemini/flash/

mejutoco 5 months ago

> and every single week the results were slightly different.

This is one of the reasons why open source offline models will always be part of the solution, if not the whole solution.

rafaelmn 5 months ago
Inconsistency comes from scaling - if you are optimizing your infra to be cos effective you will arrive at same tradeoffs. Not saying it's not nice to be able to make some of those decisions on your own - but if you're picking LLMs for simplicity - we are years away from running your own being in the same league for most people.
- mejutoco 5 months ago
  
  And if you are not you wont.
  You can decide if you change your local setup or not. You cannot decide the same of a service.
  There is nothing inevitable about inconsistency in a local setup.

iandanforth 5 months ago

At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.

pigscantfly 5 months ago
This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.
- brookst 5 months ago
  
  If temperature is zero, and weights are weights, where is the non-deterministic behavior coming from?
  
  21 replies →
kiratp 5 months ago

Quantized floating point math can, under certain scenarios, be non-associative.
When you combine that fact with being part of a diverse batch of requests over an MoE model, outputs are non-deterministic.

bushbaba 5 months ago

That’s why you have azure openAI APIs which give a lot more consistency