Comment by jerf

2 months ago

It might help a bit to expand this test to a short phrase. With such a small test the model can be right for the wrong reasons; opening up a bit of space to be wrong in might sharpen the differences.

(My one-off test of the default ChatGPT model, whatever that is, got 'How many b's are there in "Billy Bob beat the record for bounding?"' correct first try, with correct reasoning given.)