← Back to context

Comment by tim333

2 months ago

I thought it was funny that in the Cameron R. Jones attempt as doing the test, 75% of judges thought GPT-4o was the human rather than the actual human. I think it illustrates both the limits of the test and that LLMs are getting quite good. (paper https://arxiv.org/abs/2503.23674)

I think if you are having to accuse the humans of woeful typing and being smartphone gen fools you are kind scoring one for the LLM. In the Turing test they were only supposed to match an average human.

Turing didn't specify the credentialization of the interrogator but it's implied to be rather high in his somewhat naive examples. For instance he gives the example of offering arithmetic as a question which the machine would answer rapidly and the man who would not. In any case it's somebody clearly aiming to figure out who is who instead of just screwing around with pointless 'hi', 'how are you doing', and other such completely pointless questions. His 5 minute timeline was based on conversational speed which is about 2 words per second. Over a 5 minute interrogation, that'd be around 300 words a piece, contrasted against modern takes which are often about an order of magnitude less.

The LLM Turing Test was particularly abysmal. They used college students doing it for credit, actively filtered the users to ensure people had no clue what was going on, intentionally framed it as a conversation instead of a pointed interrogation, and then had a bot who's prompt was basically 'act stupid, ask questions, usually use fewer than 5 words', and the kids were screwing around most of the time. For instance here is a complete interrogation from that experiment (against a bot):

- hi

- heyy what's up

- hru

- I'm good, just tired lol. hbu?

The 'ask questions' was a reasonable way of breaking the test because it made interrogators who had no clue what they were doing waste all of their time, so there were often 0 meaningful questions or answers or any given interrogation. In any case I think that scores significantly above 50% are a clear indicator of humans screwing around or some other 'quirk' in the experiment, because, White Zombie notwithstanding, one cannot be more human than human.

[1] - https://osf.io/jk7bw/overview