Comment by drtz

3 hours ago

> It's a weird fact claim, because the ground truth is "nobody knows for sure" and that's not one of the available options.

It's even weirder to suggest that the disagreement is indicative of a problem. If you asked five very knowledgeable humans on this subject to select the correct answer on a multiple-choice questionnaire, they would almost certainly vary significantly more than these 5 LLMs.

Not to say that hallucination isn't a problem, but this is a lousy way to test it.

2 comments

drtz

dakolli 2 hours ago

What are you talking about, it had the option for nuanced responses, but it chose the more binary responses. It could have chosen no explanations, no qualifiers but instead it showed off LLMs incapability for nuance.

These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking" tokens as a concept are mostly there to convince people to use models that consume more tokens and produce more revenue. The output from reasoning models might be more accurate, but its just a consequence of a longer inference runtime, there is no "reasoning" happening, reasoning is just sales/UX bullsh*t.

drtz 1 hour ago

> What are you talking about, it had the option for nuanced responses
The prompt allowed for exactly four valid outputs and explicitly disallowed explanations and qualifiers.
> Output exactly one label: True, > Mostly True, Misleading, or False. > No explanations, no qualifiers.
How is that a nuanced response?
> These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking"
My suggestion is that five presumably reasoning and thinking humans would also have variation in their responses to the exact same prompt.