Comment by spwa4
11 hours ago
> An alarming number of people don't understand that LLMs work via purely stochastic processes ...
I've been studying AI for 20 years. What really needs to be added to this statement is:
"An alarming number of people don't understand that LLMs work via purely stochastic processes - and so does human thinking. People do NOT arrive at the same conclusion if merely the weather's different. Worse: with human thinking not only do most people not think this is real, a subset of people will actively fight the idea. Of course, depending on the weather"
Every time people point out a limitation or constraint of LLMs, I see a comment that is to the effect of “but humans…”. I don’t understand why this comparison is relevant to this particular thread. Is it just an amusing similarity?
I think it often useful to push the conversation down "we built a system for humans that dealt with this, what from that is or is not applicable for agents in the same context"? Humans randomizing resume review for screening is pretty known; I've seen companies try to fight it with things like hiding information, panel reviews, etc - it's unclear to me how effective those would be for agents (honestly, it was unclear how effective those were for humans). I was depressed about the hiring process before we had AI screening and I remain depressed about it.
It may seem trite but the point is that if separate humans were assigned the same task the LLM was here the results would be similarly non-deterministic.
We expect computers to be consistent on the other hand. A calculator will always give you the same answer unless some chip gets struck by a particle. LLMs are on computers and should be fairly consistent too.
And this lies at the heart of the problem.
We expect computers to be consistent despite running programs that are not designed to be consistent.
This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.
But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.
> This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.
The average user has very little. A word processor with inconsistent pagination or a spreadsheet with inconsistent totals is rightly seen as faulty.
4 replies →
The same person is not going to give you three different answers within span of minutes. Especially when nothing fundamentally has changed. People might or might not update their views depending on their biases.
I'm pretty sure the personality tests are created specifically for the reason that a single person can have fundamentally (or conflicting) beliefs about himself in a matter of minutes. You can say "I am honest person" and the next minute you can say "I never lie" - and both cannot be true for an average person.
What's even worse, different humans have different weights.
If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.
Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".
> What's even worse, different humans have different weights.
Far worse would be different humans having the same weights.
Test retest reliability is a thing in psychometrics.
Ah cool. So there is data? How consistent are humans?
What I'd really love is an actual number for a "human hallucination rate". How often will a random human
1) claim something that is wrong
2) defend the wrong claim and/or logic even when the problem is pointed out to them
(and this of course outside of the usual topics. In politics? I don't care. In religion? Don't care (well, maybe a bit more than politics). Let's say in physics or popular logic or something like that)
There is evidence that children will oscillate between understanding and not understanding while learning topics. Philip Sadler at Harvard published about this but i can't find the paper im thinking of on his google scholar. too many papers!
but moreover, to verify a test item you need to make sure that peopel will select the same answers under teh same conditions at different times. people generally forget the specific questions they were asked if you ask them the same questions a month later so being able to get them to answer the same way each time is important. it is assumed the people have some static knowledge of a topic in this scenario.
If you want to consider a statistical examination of how people answer tests and how we assess knowledge and other things in people through surveying you can read about item response theory and rasch analysis.
a studied example is sampling judicial decisions before lunch and after lunch. judges are more lenient on a full stomach.
That was a single study and it's finding is at the very least disputed, if not debunked, e.g. https://news.ycombinator.com/item?id=41091803
how did they account for sampling bias? a judge might leave easier cases for after lunch. people with control over their schedules usually ease themselves back into it after breaks.