https://arxiv.org/abs/2503.23674
Large Language Models Pass the Turing Test
Cameron R. Jones, Benjamin K. Bergen
We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
It's also interesting in terms of what is says about humans because now we have an intelligence (or the appearance of intelligence---is there a functional difference) that is not human. For example, here is the ranked list of subjects of most useful to ferret out whether the agent is NOT HUMAN.
Strange
Jailbreak
Scenario/Game
Logic Math
General Knowledge
Humor
Rude
Current Events
Are you a bot/human?
Surroundings
Opinion
Time
Daily Activities
Personal details
Weather
Human experience
In the dating threads, humor has been suggested as a way to estimate people's intelligence. What this shows is that a more effective way to ask your prospective date how they'd commit a crime, respond to something strange, partake in some kind of strategy planning, or run mental puzzle games like what kind of bird is ARTPOR? I'm looking forward to your dating reports
This of course also implies that humans in their typical social state (talking about their experiences, the weather, personal details, how their day was, and what their opinion on this or that is) are NOT demonstrating signs of human intelligence---these conversations are basically just being "generative" rituals that could be replaced by an app at this point.