Comment by theshrike79

10 hours ago

The easiest way to fix these is give the model an environment to run code.

Any model can easily one-shot a python script that can count the occurrence of any letter anywhere and return the result.

It's just a tooling issue. You really can't "train" an LLM to do it because tokenisation and ... stuff.

I am not convinced they are executing code. Otherwise I would expect LLMs to not frequently guess the result of math questions.

Of course you could train it. Some quick scripting to find all words with repeat letters, build up sample sentences (aardvark has three a,) and you have hard coded the answer to simple questions that make your LLM look stupid.

  • I have personally observed Grok running Python code in a chat to determine the current date so it could accurately tell me whether the 20th is a Friday (it wasn't in that specific month)

    .. it did that in a story prompt that didn't happen in a) our world b) the current time =)