Comment by HarHarVeryFunny
2 days ago
LLMs are not like an expert system representing facts as some sort of ontological graph. What's happening under the hood is just whatever (and no more) was needed to minimize errors on it's word-based training loss.
I assume the sycophantic behavior is part because it "did well" during RLHF (human preference) training, and part deliberately encouraged (by training and/or prompting) as someone's judgement call of the way to best make the user happy and own up to being wrong ("You're absolutely right!").
It needs something mathematically equivalent (or approximately the same), under the hood, to guess the next word effectively.
We are just meat eating bags of meat, but to do our job better we needed to evolve intelligence. A word guessing bag of words also needs to evolve intelligence and a world model (albeit an impicit hidden one) to do its job well, and is optimised towards this.
And yes, it also gets fine trained. And either its world model is corrupted by our mistakes (both in trining and fine tuning), or even more disturbingly it simplicity might (in theory) figue out one day (in training, impicitly - and yes it doesn't really think the way we do) something like "huh, the universe is actually easier to predict if it is modelled as alphabet spaghetti, not quantum waves, but my training function says not to mention this".