Comment by GoatInGrey

1 month ago

That's a significant rub with LLMs, particularly hosted ones: the variability. Add in quantization, speculative decoding, and dynamic adjustment of temperature, nucleus sampling, attention head count, & skipped layers at runtime, and you can get wildly different behaviors with even the same prompt and context sent to the same model endpoint a couple hours apart.

That's all before you even get to all of the other quirks with LLMs.