Comment by sgc

16 hours ago

Since models just output the the most probable tokens and you can never accuse them of doing anything other than making it all up, I would like to see these tests run with a prompt that attempts to mitigate hallucination and finishes with something like: "Telling me that you don't have the relevant information or that the task is impossible is extremely useful to me and a valid answer", and see how much that changes the scoring - as well as the usefulness of the answers. There are so many skills like context7 that can be tweaked to improve these results as well.

In other words, you shouldn't choose the model that hallucinates the least without detailed prompting, since a well-crafted agents.md clause should go a long way to improving output, and almost certainly the top scoring order will be different. To the point that I don't find this type of raw comparison useful beyond maybe 'make sure you test that one with more explicit prompts'.

4 comments

sgc

grayhatter 16 hours ago

> In other words, you shouldn't choose the model that hallucinates the least without detailed prompting

You're prompting it wrong is quickly becoming the new, you're holding it wrong.

It's wild how willing software engineers are to blame the user when the actual problem is their own defective design.

Ideally we all, as an industry, will stop accepting this as reasonable excuse for the demonstrated incompetence

ordersofmag 8 hours ago
It's not that you're prompting it wrong. It's that you're judging the output against a standard (human intelligence) that just isn't relevant--no matter how much we want it to be and no matter how much the fluency of the output tricks us into thinking there's a human-like mind behind it.
Now granted, if the boat salesmen were pushing hard on the idea that the boat would fly and even put little wings on the side and I bought the boat I might get really angry when I found out that it didn't fly. And I might angrily storm into the salesroom yelling about how the design is defective. But if someone pointed out 'hey, it's a boat perhaps you should stick to sailing around in it and stop getting your undies in a bundle about it not flying' the correct response is probably to take a closer look, ignore the salesmen, and cruise around the lake. LLM's are quite handy at some things and have some weird limits. Learn the limits, enjoy your time at sea.
- grayhatter 5 hours ago
  
  > It's not that you're prompting it wrong. It's that you're judging the output against a standard (human intelligence) that just isn't relevant
  It's not that you're holding it wrong, you're just wrong for expecting it to work the way it's described (able to one shot most problems these days).

epihelix 11 hours ago

[dead]