Comment by GoatInGrey
5 hours ago
That's a significant rub with LLMs, particularly hosted ones: the variability. Add in quantization, speculative decoding, and dynamic adjustment of temperature, nucleus sampling, attention head count, & skipped layers at runtime, and you can get wildly different behaviors with even the same prompt and context sent to the same model endpoint a couple hours apart.
That's all before you even get to all of the other quirks with LLMs.
No comments yet
Contribute on Hacker News ↗