Comment by makeitdouble
18 days ago
> After trial and error with different models
As a mere occasional customer I've been scanning 4 to 5 pages of the same document layout every week in gemini for half a year, and every single week the results were slightly different.
To note the docs are bilingual so it could affect the results, but what stroke me is the lack of consistency, and even with the same model, running it two or three times in a row gives different results.
That's fine for my usage, but that sounds like a nightmare if everytime Google tweaks their model, companies have to reajust their whole process to deal with the discrepancies.
And sticking with the same model for multiple years also sound like a captive situation where you'd have to pay premium for Google to keep it available for your use.
Consider turning down the temperature in the configuration? LLMs have a bit of randomness in them.
Gemini 2.0 Flash seems better than 1.5 - https://deepmind.google/technologies/gemini/flash/
> and every single week the results were slightly different.
This is one of the reasons why open source offline models will always be part of the solution, if not the whole solution.
Inconsistency comes from scaling - if you are optimizing your infra to be cos effective you will arrive at same tradeoffs. Not saying it's not nice to be able to make some of those decisions on your own - but if you're picking LLMs for simplicity - we are years away from running your own being in the same league for most people.
And if you are not you wont.
You can decide if you change your local setup or not. You cannot decide the same of a service.
There is nothing inevitable about inconsistency in a local setup.
At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.
This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.
If temperature is zero, and weights are weights, where is the non-deterministic behavior coming from?
21 replies →
Quantized floating point math can, under certain scenarios, be non-associative.
When you combine that fact with being part of a diverse batch of requests over an MoE model, outputs are non-deterministic.
That’s why you have azure openAI APIs which give a lot more consistency