Comment by jerf
2 months ago
It might help a bit to expand this test to a short phrase. With such a small test the model can be right for the wrong reasons; opening up a bit of space to be wrong in might sharpen the differences.
(My one-off test of the default ChatGPT model, whatever that is, got 'How many b's are there in "Billy Bob beat the record for bounding?"' correct first try, with correct reasoning given.)
No comments yet
Contribute on Hacker News ↗