Comment by logicallee

2 years ago

I took a quick glance through the article. It states:

"LLMs can’t count or easily do other mathematical operations due to tokenization, and because tokens correspond to a varying length of characters, the model can’t use the amount of generated tokens it has done so far as a consistent hint."

It then proceeds to use this thing that current LLM's can't do to see if it responds to tipping.

I think that is frankly unfair. It would be like picking something a human can't do, and then using that as the standard to judge whether humans do better when offered a tip.

I definitely think the proper way to test whether tipping improves performance is through some metric that is definitely within the capabilities of LLM's.