Comment by somenameforme
2 months ago
The Turing Test has not been meaningfully passed. Instead we redefined the test to make it passable. In Turing's original concept the competent investigator and participants were all actively expected to collude against the machine. The entire point is that even with collusion, the machine would be able to do the same, and to pass. Instead modern takes have paired incompetent investigators alongside participants colluding with the machine, probably in an effort to be part 'of something historic'.
In "both" (probably more, referencing the two most high profile - Eugene and the LLMs) successes, the interrogators consistently asked pointless questions that had no meaningful chance of providing compelling information - 'How's your day? Do you like psychology? etc' and the participants not only made no effort to make their humanity clear, but often were actively adversarial obviously intentionally answering illogically, inappropriately, or 'computery' to such simple questions. For instance here is dialog from a human in one of the tests:
----
[16:31:08] Judge: don't you thing the imitation game was more interesting before Turing got to it?
[16:32:03] Entity: I don't know. That was a long time ago.
[16:33:32] Judge: so you need to guess if I am male or female
[16:34:21] Entity: you have to be male or female
[16:34:34] Judge: or computer
----
And the tests are typically time constrained by woefully poor typing skills (is this the new normal in the smartphone gen?) to the point that you tend to get anywhere from 1-5 interactions of just several words each. The above snip was a complete interaction, so you get 2 responses from a human trying to trick the judge into deciding he's a computer. And obviously a judge determining that the above was probably a computer says absolutely nothing about the quality of responses from the computer - instead it's some weird anti-Turing Test where humans successfully act like a [bad] computer, ruining the entire point of the test.
The problem with any metric for something is that it often ends up being gamed to be beaten, and this is a perfect example of that. I suspect in a true run of the Turing Test we're still nowhere even remotely close to passing it.
I don't doubt it that all of the formal Turning tests have been badly done. But I suspect that if you did one, at least one run will mis-judge an LLM. Maybe it's a low percentage, but that's vastly better than zero.
So I'd say we're at least "remotely close", which is sufficient for me to reconsider Searle.
I thought it was funny that in the Cameron R. Jones attempt as doing the test, 75% of judges thought GPT-4o was the human rather than the actual human. I think it illustrates both the limits of the test and that LLMs are getting quite good. (paper https://arxiv.org/abs/2503.23674)
I think if you are having to accuse the humans of woeful typing and being smartphone gen fools you are kind scoring one for the LLM. In the Turing test they were only supposed to match an average human.
Turing didn't specify the credentialization of the interrogator but it's implied to be rather high in his somewhat naive examples. For instance he gives the example of offering arithmetic as a question which the machine would answer rapidly and the man who would not. In any case it's somebody clearly aiming to figure out who is who instead of just screwing around with pointless 'hi', 'how are you doing', and other such completely pointless questions. His 5 minute timeline was based on conversational speed which is about 2 words per second. Over a 5 minute interrogation, that'd be around 300 words a piece, contrasted against modern takes which are often about an order of magnitude less.
The LLM Turing Test was particularly abysmal. They used college students doing it for credit, actively filtered the users to ensure people had no clue what was going on, intentionally framed it as a conversation instead of a pointed interrogation, and then had a bot who's prompt was basically 'act stupid, ask questions, usually use fewer than 5 words', and the kids were screwing around most of the time. For instance here is a complete interrogation from that experiment (against a bot):
- hi
- heyy what's up
- hru
- I'm good, just tired lol. hbu?
The 'ask questions' was a reasonable way of breaking the test because it made interrogators who had no clue what they were doing waste all of their time, so there were often 0 meaningful questions or answers or any given interrogation. In any case I think that scores significantly above 50% are a clear indicator of humans screwing around or some other 'quirk' in the experiment, because, White Zombie notwithstanding, one cannot be more human than human.
[1] - https://osf.io/jk7bw/overview
> instead it's some weird anti-Turing Test where humans successfully act like a [bad] computer
This is ex-post-facto denial and cope. The Turing Test isn't a test between computers and the idealized human, it's a test between functional computers and functional humans. If the average human performs like the above, then well, I guess the logical conclusion is that computers are already better "humans (idealized)" than humans.