Comment by card_zero
17 hours ago
Gee, a thing by a guy, with a name. What are you saying exactly? So the test in question is a test the LLM is asked to carry out, right? Then your point is that if it's a load of vacuous flannel 49% of the time, but meaningful 51% of the time, on average this is genuine work so we can't complain about the 49%?
Wait, you're probably talking about the test of discarding a report based on something superficial like spelling errors. Which fails with LLMs due to their basic conman personalities and smooth talking. And therefore ..?
No comments yet
Contribute on Hacker News ↗