← Back to context

Comment by grayhatter

5 hours ago

The way you define the evaluation criteria seems very problematic[^1].

I don't understand the point of describing it as 3 possible outcomes. I objected to it because the only reason I would do something like that would be to obscure the severity of the model defects. I'm sure I'm missing something, but the reason I suspect that's how it's done, is to [intentionally] obscure the actual meaningful metrics.

I would expect any engineer to evaluate any model using accuracy, (error rate), and usefulness (definitive answer rate), as strictly independent metrics. Did it answer, and if it answered, did emit incorrect or misleading information and how many quantifiable bits of each.

The false negative rate (model confirmed to contain the requested output/information via other method but was unable to for the given test) is significant, but given a non-definitive answer is significantly different from a definitive and incorrect answer. Why would you want to group hallucinations?

Number/rate of useful answers (correct and incorrect) and error rate (given any answer how often will that answer be defective in some way).

To be clear, I'm differentiating hallucination rate from eagerness to answer, even though they're obviously linked because I believe presenting 20 correct answers, 20 incorrect answers, and 60 abstentions as a hallucination rate of 25% as obviously malicious. If I give you 40 answers, 20 correct and 20 incorrect. the error rate is 50% and if it refused an additional 60 times, it's usefulness rate would be 40%... arguably 20% depending on how strict you choose to be about the definition of useful. The matrix we should be using is a 2x2 true positive, false positive, true negative, false negative. But being that honest that might make the model look bad!

[^1]: just in case it's unclear, I'm using you exclusively rhetorically. I don't think you personally are being misleading, only that you're explaining how it's done... but that's the problem isn't it.